Introduction

Housing prices are influenced by a wide range of socioeconomic, environmental, and structural factors. Understanding how these variables interact is critical for economists, policymakers, real estate professionals, and prospective homebuyers. This project applies statistical modeling and exploratory data analysis techniques to investigate the drivers of housing prices using two real-world datasets: the Boston Housing dataset and King County housing data.

In Part 1, we analyze the Boston Housing dataset to explore relationships between median home values and key covariates such as pollution levels, number of rooms, distance to employment centers, pupil-teacher ratios, and socioeconomic status indicators. We begin with exploratory data analysis and summary statistics, followed by bootstrap resampling to estimate the variability of the median home price. Correlation analysis and a multivariate linear regression model are then used to identify statistically significant predictors and evaluate model assumptions.

In Part 2, we extend this analysis to a larger and more modern dataset from King County, Washington. Using a random subsample of housing records, we explore price distributions, correlations among structural features, and construct a multivariate regression model to predict housing prices. Diagnostic plots and performance metrics such as Root Mean Squared Error (RMSE) are used to assess model fit and highlight limitations of linear regression in the presence of skewed and heteroscedastic data.

Overall, this project demonstrates how statistical inference, regression modeling, and diagnostic evaluation can be used to extract meaningful insights from housing data, while also illustrating the practical challenges of model assumptions when working with real-world datasets.

PART 1: BOSTON HOUSING DATA

We will extract the Boston dataset from the MASS library in R.

We are interested in the variables:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(Boston)
nox<-Boston[,5]
rm<-Boston[,6]
dis<-Boston[,8]
pr<-Boston[,11]
ls<-Boston[,13]
medv<-Boston[,14]
data.boston<- data.frame(medv, nox, rm, dis, pr, ls)
head(data.boston)
##   medv   nox    rm    dis   pr   ls
## 1 24.0 0.538 6.575 4.0900 15.3 4.98
## 2 21.6 0.469 6.421 4.9671 17.8 9.14
## 3 34.7 0.469 7.185 4.9671 17.8 4.03
## 4 33.4 0.458 6.998 6.0622 18.7 2.94
## 5 36.2 0.458 7.147 6.0622 18.7 5.33
## 6 28.7 0.458 6.430 6.0622 18.7 5.21
par(mfrow = c(2, 3))
hist(data.boston$medv, main = "Median Value of Homes",
     xlab = "Median Value in $1000", col = "lightblue", border = "black")
hist(data.boston$nox, main = "Nitric Oxide Concentration",
     xlab = "Nox parts per 10 million", col = "lightblue", border = "black")
hist(data.boston$rm, main = "Average Number of Rooms",
     xlab = "Average Number of Rooms Per Dwelling", col = "lightblue", border = "black")
hist(data.boston$dis, main = "Weighted Distances",
     xlab = "Distances to Employment Centers", col = "lightblue", border = "black")
hist(data.boston$pr, main = "Pupil-Teacher Ratio",
     xlab = "Pupil-Teacher Ratio", col = "lightblue", border = "black")
hist(data.boston$ls, main = "Percentage of Lower Status in Populaton",
     xlab = "Percentage of Lower Status", col = "lightblue", border = "black")

Let’s look at the summary statistics to see if they align with what we could conclude from the graphs.

Summary Statistics

You can also embed plots, for example:

summary(data.boston)
##       medv            nox               rm             dis        
##  Min.   : 5.00   Min.   :0.3850   Min.   :3.561   Min.   : 1.130  
##  1st Qu.:17.02   1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 2.100  
##  Median :21.20   Median :0.5380   Median :6.208   Median : 3.207  
##  Mean   :22.53   Mean   :0.5547   Mean   :6.285   Mean   : 3.795  
##  3rd Qu.:25.00   3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 5.188  
##  Max.   :50.00   Max.   :0.8710   Max.   :8.780   Max.   :12.127  
##        pr              ls       
##  Min.   :12.60   Min.   : 1.73  
##  1st Qu.:17.40   1st Qu.: 6.95  
##  Median :19.05   Median :11.36  
##  Mean   :18.46   Mean   :12.65  
##  3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :22.00   Max.   :37.97

Based on the histogram of target variable medv, a median statistic should be used for a bootstrap simulation of the data because it is skewed, so the mean will be heavily affected by potential outliers while the median is less affected.

Median Statistic Bootstrap

set.seed(123)
B = 3000
np_boot = replicate(B, {
  boot_samp = sample(data.boston$medv, size = length(data.boston$medv), replace = TRUE)
  median(boot_samp)
})
sd(np_boot)
## [1] 0.3794048
summary(np_boot)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.10   20.90   21.20   21.18   21.40   22.40
hist(np_boot,
     main = "Histogram of Bootstrap Medians",
     xlab = "Bootstrap Medians",
     col = "lightblue",
     border = "black",
     breaks = 30)

Based on the histogram of the bootstrap medians, the sampling distribution of the median appears approximately normal.

With 3,000 bootstrap resamples, the distribution of the median appears approximately normal, consistent with the Central Limit Theorem.

Although the original median home value data is skewed, the bootstrap distribution of the median appears approximately normal, allowing normal-based inference.

Correlations

cor_matrix = cor(data.boston)
print(cor_matrix)
##            medv        nox         rm        dis         pr         ls
## medv  1.0000000 -0.4273208  0.6953599  0.2499287 -0.5077867 -0.7376627
## nox  -0.4273208  1.0000000 -0.3021882 -0.7692301  0.1889327  0.5908789
## rm    0.6953599 -0.3021882  1.0000000  0.2052462 -0.3555015 -0.6138083
## dis   0.2499287 -0.7692301  0.2052462  1.0000000 -0.2324705 -0.4969958
## pr   -0.5077867  0.1889327 -0.3555015 -0.2324705  1.0000000  0.3740443
## ls   -0.7376627  0.5908789 -0.6138083 -0.4969958  0.3740443  1.0000000
# We can just look at the "lower half" of the correlation matrix since it is symmetric.
corrplot(cor(data.boston[, 1:6]), type = "lower")

# What if we don't care about negative and positive and just want the magnitude of correlation?
corrplot(abs(cor(data.boston[,1:6])), method = "shade",  type = "lower", col.lim = c(0, 1))

With the target variable being median value of homes (medv), we only care about the first column of the correlation plot.

The variable with the least interactions/correlation for median value of homes (medv) is the weighted distances to five Boston employment centers (dis). This may be because much of Boston’s population lives outside of the city and more towards suburban areas, so the demand for living close to employment centers is not that extreme to where it heavily impacts the median home value.

The variables with the greatest interactions/correlation for median value of homes (medv) include: percentage of lower status in population (ls) and average number of rooms (rm) where ls is negatively correlated to medv while rm is positively correlated. This means that if the percentage of lower status in the population is greater, then the median value of homes would most likely be significantly less. On the other hand, if the average number of rooms is greater, then the median value of homes would most likely be significantly more. Thus, there is a strong correlation between ls and rm covariates with medv. This could support how many Bostonians would prefer to live in places that are more “safe” and are larger with more rooms as it seems to be a strict demand.

In terms of correlation between covariates themselves, it can be seen that the weighted distances to five Boston employment centers (dis) and the nitric oxide concentration has a very strong correlation. This may be because the urban areas of Boston have very high levels of pollution while places farther from the city have less pollution. This does make sense as downtown and other very urban areas have greater population density and production of nitric oxides.

For least correlation between the covariates, pupil-teacher ratio (pr) and nox has very little correlation probably because it does not directly affect each other. rm and dis also have low correlation, but still positive which could signify that farther away from urban areas mean more land and more rooms for the house although it isn’t direct.

Justification

Because several covariates are correlated with one another, a multivariate linear regression model allows us to evaluate the effect of each predictor on median home value while controlling for the others. This approach reduces omitted variable bias and provides a clearer understanding of how multiple socioeconomic and environmental factors jointly influence housing prices.

Multivariate Linear Regression

Recall that the equation for a multivariate linear regression for the Boston data we have is: \[\text{medv}_i = \beta_0 + \beta_1\text{nox}_i+ \beta_2\text{rm}_i + \beta_3 \text{dis}_i + \beta_4 \text{pr}_i + \beta_5 \text{ls}_i + \epsilon_i \quad i = 1, 2, ..., n\] * \(\beta_0 = \text{intercept}\) * \(\beta_1, ..., \beta_5 = \text{slopes (effect of each predictor holding others constant)}\) * \(\epsilon_i = \text{error term (often assumed mean 0, constant variance)}\)

# Randomly split the data into two parts
# 80% for training and 20% for testing
split = sample.split(data.boston$medv, SplitRatio = 0.8)
train_split = data.boston[split == TRUE, ]
test_split = data.boston[split == FALSE, ]
boston_lm = lm(medv ~ nox + rm + dis + pr + ls, data = train_split)

Model Regression Diagnostic

plot(boston_lm)

For the residuals vs fitted plot, it seems like the residuals are randomly scattered around the horizontal line suggesting that linear assumption is possible. There are a couple of outliers that are very far from the horizontal line which could signify some errors made by the model, but there seems to be an equal number of points above and below the line which supports linearity.

Overall, the QQ residuals plot does tend to align with the dashed line meaning that overall it supports the idea that the residuals are normally distributed. Towards the ends (< -2 & > 2 Theoretical Quantiles) is when it deviates from the line which may suggest that there may be a few extreme values. This means that outside of the second standard deviations of the normal curve, is when data starts to be “less normal” and could limit our regression model in terms of accuracy.

The scale-location plot shows have points relatively scattered and the line is close to horizontal but not exactly which could suggest that it would probably be mostly homoscedasticity. In terms of the residuals vs leverage plot, there are some points that are in the far right, so removing some of these points could critically affect the linear regression model. One thing to be a bit worried about is that the points do seem to cluster at certain points as it suggests that the variance of the residuals is not constant across all levels of the fitted values. This can potentially violate the assumption of homoscedasticity which we want for the model.

The residuals vs leverage plot does have some outliers in the far right which could mean that these data points (if removed) could significantly impact our regression model. However, none of the points are beyond the Cook’s distance so this is a beneficial thing to take about the model as no points are incredibly influential.

Covariates Analysis

summary(boston_lm)
## 
## Call:
## lm(formula = medv ~ nox + rm + dis + pr + ls, data = train_split)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4827  -2.9234  -0.8236   1.8485  27.8618 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.74529    5.10764   7.390 7.77e-13 ***
## nox         -18.87990    3.50365  -5.389 1.17e-07 ***
## rm            4.31837    0.44408   9.724  < 2e-16 ***
## dis          -1.28107    0.19348  -6.621 1.07e-10 ***
## pr           -1.06806    0.12113  -8.818  < 2e-16 ***
## ls           -0.57497    0.05084 -11.310  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.985 on 428 degrees of freedom
## Multiple R-squared:  0.7243, Adjusted R-squared:  0.7211 
## F-statistic: 224.9 on 5 and 428 DF,  p-value: < 2.2e-16

It can be seen that average number of rooms (rm) and percentage of low status population (ls) are the covariates which have the greatest predictive power on the median home price in Boston because of its high absolute test statistic and its very low p-value which signifies statistical significance.

This finding does back our hypothesis on how rm and ls are the covariates which have the greatest correlation for medv using the correlation plot from earlier.

This makes sense as more rooms typically indicate larger and more expensive houses, which is valuable information for homebuyers. The information about low status population can be important for economists analyzing socioeconomic factors in housing markets.

In addition, the overall residuals do look pretty good where the median is close to 0 meaning that overall, the differences between the predicted values and the actual values are not extreme. It is a quite skewed, however, where the quantiles and ranges are not symmetrical which is something to note about the validity and accuracy of our linear model.

prediction = predict(boston_lm, newdata = test_split)
data.frame(Actual = test_split$medv, Predicted = prediction)
##     Actual Predicted
## 7     22.9 23.309550
## 11    15.0 19.267334
## 25    15.6 15.732544
## 31    12.7 11.412493
## 36    18.9 23.567936
## 39    24.7 22.827825
## 59    23.3 20.732762
## 60    19.6 19.565360
## 61    18.7 16.126652
## 67    19.4 23.668308
## 69    17.4 18.143362
## 72    21.7 22.728907
## 77    20.0 24.563627
## 82    23.9 26.922711
## 87    22.5 22.418506
## 92    22.0 28.484824
## 104   19.3 20.901885
## 107   19.5 17.245996
## 108   20.4 21.243750
## 123   20.5 19.135331
## 126   21.4 21.157611
## 133   23.0 21.462860
## 134   18.4 16.657076
## 138   17.1 20.433666
## 159   24.3 30.869474
## 163   50.0 40.596221
## 164   50.0 42.110299
## 166   25.0 28.403253
## 190   34.9 35.338457
## 216   25.0 24.844034
## 218   28.7 28.573712
## 239   23.7 28.335808
## 246   18.5 12.607924
## 249   24.5 21.488242
## 251   24.4 24.353423
## 252   24.8 25.469999
## 287   20.1 18.863092
## 298   20.3 20.205362
## 309   22.8 29.614451
## 311   16.1 18.802495
## 315   23.8 26.230604
## 319   23.1 24.900158
## 320   21.0 21.774506
## 328   22.2 19.480229
## 330   22.6 27.458073
## 331   19.8 24.791481
## 346   17.5 18.971176
## 350   26.6 23.920397
## 352   24.1 21.973542
## 354   30.1 22.779303
## 355   18.2 12.716039
## 362   19.9 17.527643
## 365   21.9 35.048422
## 370   50.0 29.234323
## 374   13.8  3.248799
## 378   13.3 18.889107
## 380   10.2 16.076835
## 392   23.2 15.521549
## 396   13.1 18.976734
## 402    7.2 16.778294
## 409   17.2 12.112779
## 410   27.5 21.238424
## 423   20.8 18.361748
## 431   14.5 19.785590
## 432   14.1 20.655838
## 447   14.9 16.699127
## 451   13.4 18.849702
## 454   17.8 21.867826
## 468   19.1 15.555219
## 485   20.6 18.075918
## 486   21.2 21.224290
## 489   15.2 15.615239
ggplot(test_split, aes(x = medv, y = prediction)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  labs(title = "Actual vs Predicted Prices", x = "Actual Prices", y = "Predicted Prices")

# Root mean squared error:
paste("The root mean squared error:", sqrt(mean((test_split$medv - prediction)^2)))
## [1] "The root mean squared error: 5.06851220452036"

The \(y=x\) line is the benchmark where it represents perfect prediction. This is where the predicted price is exactly the same as the actual price. While testing, the closer our points are to the red line means the better our linear regression model does in predicting the most accurate medv prices.

The smaller the root mean squared error is, the closer on average the expected values are to the actual values. In this case, the root mean squared error is about 5. As this Boston linear model has a reasonable root mean squared error close to 0, it suggests that this model performs reasonably well in predicting the median home values. Because this model is good for predicting home values as seen with the relatively low root mean squared error, it can potentially be used by market analysts, real estate agents, and home buyers. This model can really help home buyers see what covariates they should consider the most and how it effects the overall value of the house(s) they might buy.

PART 2: KING COUNTY DATA

We will now look at market housing trends in King County, Washington. This is because we can see similarities and differences between what influences housing prices in King County versus Boston. Recall that Boston is in the east coast while King County is in the west coast, so we can expect that there will be some differences that may be present.

Furthermore, this King County data that we are importing has a lot more variables which will make trying to predict home prices even more interesting.

Retrieve the King County Data CSV.

king_county_housing <- read.csv("kc_house_data.csv")

head(king_county_housing)
##           id            date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000  221900        3      1.00        1180     5650
## 2 6414100192 20141209T000000  538000        3      2.25        2570     7242
## 3 5631500400 20150225T000000  180000        2      1.00         770    10000
## 4 2487200875 20141209T000000  604000        4      3.00        1960     5000
## 5 1954400510 20150218T000000  510000        3      2.00        1680     8080
## 6 7237550310 20140512T000000 1225000        4      4.50        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930
set.seed(123)
king_county_housing <- slice_sample(king_county_housing, n = 3000)

head(king_county_housing)
##           id            date  price bedrooms bathrooms sqft_living sqft_lot
## 1 4254000220 20150307T000000 475000        4      2.50        2040    16200
## 2 7302000610 20150508T000000 316000        4      1.50        2120    46173
## 3 7855600820 20140904T000000 802000        4      2.25        2130     8734
## 4 4139460200 20150325T000000 905000        4      2.50        3330     9557
## 5 8901500178 20141030T000000 700000        4      2.25        2440     9450
## 6 2730000270 20150212T000000 178500        3      1.00         900    10511
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1    2.0          0    0         3     8       2040             0     1997
## 2    2.0          0    0         3     7       2120             0     1974
## 3    2.0          0    2         4     8       2130             0     1961
## 4    2.0          0    0         3    10       3330             0     1995
## 5    1.5          0    0         3     7       2440             0     1947
## 6    1.0          0    0         4     6        900             0     1961
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98019 47.7366 -121.958          2530      15389
## 2            0   98053 47.6503 -121.968          2000      46173
## 3            0   98006 47.5672 -122.161          2550       8800
## 4            0   98006 47.5526 -122.102          3360       9755
## 5         2014   98125 47.7061 -122.307          1720       7503
## 6            0   98001 47.2883 -122.272          1460      10643
# Time to plot the histogram:
ggplot(king_county_housing, aes(x = price)) +
  geom_histogram(binwidth = 50000, fill = "lightblue", color = "gray") +
  labs(title = "Histogram of House Prices", x = "Price", y = "Frequency")

summary(king_county_housing)
##        id                date               price            bedrooms    
##  Min.   :1.200e+06   Length:3000        Min.   :  82000   Min.   :0.000  
##  1st Qu.:2.131e+09   Class :character   1st Qu.: 324666   1st Qu.:3.000  
##  Median :3.905e+09   Mode  :character   Median : 460000   Median :3.000  
##  Mean   :4.614e+09                      Mean   : 542444   Mean   :3.344  
##  3rd Qu.:7.305e+09                      3rd Qu.: 649960   3rd Qu.:4.000  
##  Max.   :9.842e+09                      Max.   :5300000   Max.   :8.000  
##    bathrooms      sqft_living      sqft_lot          floors        waterfront  
##  Min.   :0.000   Min.   : 390   Min.   :   690   Min.   :1.000   Min.   :0.00  
##  1st Qu.:1.500   1st Qu.:1420   1st Qu.:  5000   1st Qu.:1.000   1st Qu.:0.00  
##  Median :2.250   Median :1880   Median :  7530   Median :1.000   Median :0.00  
##  Mean   :2.097   Mean   :2058   Mean   : 14654   Mean   :1.487   Mean   :0.01  
##  3rd Qu.:2.500   3rd Qu.:2552   3rd Qu.: 10464   3rd Qu.:2.000   3rd Qu.:0.00  
##  Max.   :6.000   Max.   :8010   Max.   :920423   Max.   :3.500   Max.   :1.00  
##       view          condition         grade          sqft_above  
##  Min.   :0.0000   Min.   :1.000   Min.   : 4.000   Min.   : 390  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000   1st Qu.:1180  
##  Median :0.0000   Median :3.000   Median : 7.000   Median :1550  
##  Mean   :0.2357   Mean   :3.417   Mean   : 7.658   Mean   :1769  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000   3rd Qu.:2180  
##  Max.   :4.0000   Max.   :5.000   Max.   :13.000   Max.   :6220  
##  sqft_basement       yr_built     yr_renovated       zipcode     
##  Min.   :   0.0   Min.   :1900   Min.   :   0.0   Min.   :98001  
##  1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0   1st Qu.:98033  
##  Median :   0.0   Median :1975   Median :   0.0   Median :98065  
##  Mean   : 289.5   Mean   :1971   Mean   :  87.8   Mean   :98078  
##  3rd Qu.: 550.0   3rd Qu.:1996   3rd Qu.:   0.0   3rd Qu.:98117  
##  Max.   :3500.0   Max.   :2015   Max.   :2015.0   Max.   :98199  
##       lat             long        sqft_living15    sqft_lot15    
##  Min.   :47.18   Min.   :-122.5   Min.   : 399   Min.   :   659  
##  1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1480   1st Qu.:  5040  
##  Median :47.58   Median :-122.2   Median :1840   Median :  7560  
##  Mean   :47.56   Mean   :-122.2   Mean   :1983   Mean   : 12589  
##  3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2370   3rd Qu.: 10008  
##  Max.   :47.78   Max.   :-121.3   Max.   :5790   Max.   :434728

Correlations

# Calculate the correlation matrix
cor_matrix <- cor(king_county_housing %>% dplyr::select(price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated))

# Plot the correlation matrix
corrplot(cor_matrix, type = "upper", tl.cex = 0.8, tl.srt = 45)

From the correlation plot, it seems like the variables with the lowest interactions are waterfront, condition, year built especially with the price. This may be because there are new homes that are not that high demand and homes being based on condition may have some subjectivity as different buyers might perceive the home’s looks and conditions differently.

Variables with the highest interactions are living square footage (sqft_living) and sqft_above, grade, and bathrooms especially with price. This is because a larger house with greater space is more needed for the typical buyer as it grants more things to do and furniture which is the ultimate goal especially for a larger family. Furthermore, grade is the ultimate measurable factor on if a house is “good” (like a rating for any subject) which many buyers heavily consider when trying to purchase a home.

From interpreting this correlation plot, we hypothesize that the most important covariate predictors of King County home prices during 2014 to 2015 include sqft_living, sqft_above, sqft_basement, grade, and bathrooms. We can also predict that sqft_living and grade will have the greatest positive correlation with the price.

It can also be seen that every covariate has a positive correlation with the price. This means that the greater value of each covariate will theoretically result in higher prices.

Justification

Similar to the Boston analysis, there are several covariates are correlated with one another. As we identified, bathrooms, sqft_living, sqft_above, and grade have relatively strong correlation with other covaraites. Therefore, a multivariate linear regression model is a good idea for a prediction model because it allows us to evaluate the effect of each predictor on median home value while controlling for the others. This approach reduces omitted variable bias and provides a clearer understanding of how multiple socioeconomic and environmental factors jointly influence housing prices.

Multivariate Linear Regression Model on King County Homes

set.seed(123)
split <- sample.split(king_county_housing$price, SplitRatio = 0.7)
training_data <- subset(king_county_housing, split == TRUE)
testing_data <- subset(king_county_housing, split == FALSE)

kc_model <- lm(price ~ sqft_living + grade + bathrooms + sqft_above
               + sqft_basement, data = training_data)

Model Regression Diagnostic

plot(kc_model)

The diagnostic plots for this linear regression model on King County housing data is worse than the Boston one in terms of normality of variance and to have a linear model that properly and accurate predicts prices based on covariates.

The residuals vs fitted plot does show a trend where there is a huge cluster to the left of the plot and then it suddenly starts to scatter towards the right. The point are not really random around the red line of the graph, so this suggests that the residuals are not evenly distributed which could indicate a potential non-linearity relationship between the covariates and the housing prices.

The Q-Q plot does have greater deviation from the dashed line after the second standard deviations suggesting that the residuals are skewed rather than normal which raises a red flag about our linear regression model and how the residuals may have different values of certain data.

The scale-location plot has extreme clustering towards the left of the graph rather than having points equally spaced out around the whole graph. This violates the assumption of homoscedasticity. This suggests that the variance of the residuals changes with the level of the predictors. We want our residuals to be evenly distributed throughout all of our predictive data, so that our model can properly predict the prices for all potential covariate values. However, the clustering here can potentially limit the accuracy of our model’s predictions.

The residuals vs leverage plot looks much worse than the plot of the regression model for the Boston dataset. Not only does the red line not stay around 0, but there are points in the graph that lie to the far right and even outside of Cook’s distance. Therefore, if we were to remove some points that are the outliers, this will heavily affect how our linear regression model.

Covariates Analysis

summary(kc_model)
## 
## Call:
## lm(formula = price ~ sqft_living + grade + bathrooms + sqft_above + 
##     sqft_basement, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -816582 -146507  -20762  103551 3184689 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -712467.63   44213.82 -16.114  < 2e-16 ***
## sqft_living       272.11      14.06  19.358  < 2e-16 ***
## grade          120465.47    7768.32  15.507  < 2e-16 ***
## bathrooms      -33735.69   11245.68  -3.000  0.00273 ** 
## sqft_above        -85.25      14.16  -6.021 2.02e-09 ***
## sqft_basement         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 260000 on 2259 degrees of freedom
## Multiple R-squared:  0.5657, Adjusted R-squared:  0.5649 
## F-statistic: 735.5 on 4 and 2259 DF,  p-value: < 2.2e-16

Based on the summary statistics: All of the five covariates chosen for the linear model may have significant predictive power on the price of homes. In fact, living square footage (sqft_living) and grade (grade) have the greatest predictive power due to their high test statistic and also their low p-values which stands for greater statistical significance. This does back our hypothesis that sqft_living and grade have some of the greatest correlation to the price of all the covariates. This may be due to the fact that the price of a home depends on how large it is as it is what the buyer is ultimately getting from their purchase. Therefore, with a constant distance and other variables, a very large house with large living room space for a large family should cost more than a smaller house with a small living room for an individual.

Furthermore, many buyers and real estate professionals look for the grade of the house and whether it is “good” or “bad”. Therefore the grade plays a significant role on the price of the home. In conclusion, homebuyers would greatly benefit from this model focusing on the predictive power of sqft_living and grade because they are critical factors influencing the price of a home. By understanding how these variables impact pricing, homebuyers can make more informed decisions about which properties offer the best value for their needs and budget.

prediction <- predict(kc_model, newdata = testing_data)
ggplot(testing_data, aes(x = price, y = prediction)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  labs(title = "Actual vs Predicted Prices", x = "Actual Prices", y = "Predicted Prices")

paste("Average housing prices of King County dataset:", mean(king_county_housing$price))
## [1] "Average housing prices of King County dataset: 542444.273"
# RMSE:
paste("The root mean squared error:", sqrt(mean((testing_data$price - prediction)^2)))
## [1] "The root mean squared error: 198072.992144565"

This is quite a high root mean squared error of about 198,073 dollars. Because our mean house prices based on our randomly subsetted data set of King County homes is around 542,000 dollars, the root mean squared error of 198,073 dollars is very significant to how accurate our model is in predictive home prices based on the covariates. This may be due to the fact that our data is quite skewed particularly with the residuals as seen from the diagnostics which will impact our RMSE.

One way we can work towards being better at predicting the prices and seeing the correlations of price and covariates is to potentially choose a different model. Maybe a linear regression model is not fit for this specific scenario as correlations may not be linear. We can also try to find data with lower covariate correlation to improve the accuracy of our model. By either improving our linear regression model or choosing a better model for this data set, we can try to make the overall residuals lower and more normal without huge trends and that will benefit our model and reduce the RMSE.