The study analyzes residential house sales in King County, Washington State using Multiple Linear Regression (MLR). The dataset spans sales from May 2014 to May 2015 and contains 21,613 observations across 21 variables. Our primary goals are:
The analysis proceeds in seven stages: data exploration, relationship analysis, initial model building, model improvement, performance evaluation, price prediction, and interpretation of nonlinear and interaction effects.
| Variable | Description |
|---|---|
id |
Unique house identifier |
date |
Date of sale |
price |
Sale price (response variable) |
bedrooms |
Number of bedrooms |
bathrooms |
Number of bathrooms |
sqft_living |
Interior living space (sq ft) |
sqft_lot |
Lot size (sq ft) |
floors |
Number of floors |
waterfront |
Waterfront property (0/1) |
view |
View quality rating (0–4) |
condition |
Overall condition (1–5) |
grade |
King County construction grade (1–13) |
sqft_above |
Above-ground sq ft |
sqft_basement |
Basement sq ft |
yr_built |
Year built |
yr_renovated |
Year of renovation |
zipcode |
ZIP code |
lat / long |
Geographic coordinates |
sqft_living15 |
Living area as of 2015 |
sqft_lot15 |
Lot size as of 2015 |
# Load the dataset
houses <- read.csv("kc_house_data.csv")
# Inspect the data structure
head(houses) id date price bedrooms bathrooms sqft_living sqft_lot
1 7129300520 20141013T000000 221900 3 1.00 1180 5650
2 6414100192 20141209T000000 538000 3 2.25 2570 7242
3 5631500400 20150225T000000 180000 2 1.00 770 10000
4 2487200875 20141209T000000 604000 4 3.00 1960 5000
5 1954400510 20150218T000000 510000 3 2.00 1680 8080
6 7237550310 20140512T000000 1225000 4 4.50 5420 101930
floors waterfront view condition grade sqft_above sqft_basement yr_built
1 1 0 0 3 7 1180 0 1955
2 2 0 0 3 7 2170 400 1951
3 1 0 0 3 6 770 0 1933
4 1 0 0 5 7 1050 910 1965
5 1 0 0 3 8 1680 0 1987
6 1 0 0 3 11 3890 1530 2001
yr_renovated zipcode lat long sqft_living15 sqft_lot15
1 0 98178 47.5112 -122.257 1340 5650
2 1991 98125 47.7210 -122.319 1690 7639
3 0 98028 47.7379 -122.233 2720 8062
4 0 98136 47.5208 -122.393 1360 5000
5 0 98074 47.6168 -122.045 1800 7503
6 0 98053 47.6561 -122.005 4760 101930
Min. 1st Qu. Median Mean 3rd Qu. Max.
75000 321950 450000 540088 645000 7700000
# Visualize the distribution of prices
hist(
houses$price,
breaks = 100,
main = "Distribution of House Sale Prices",
xlab = "Price (USD)",
col = "#3498db",
border = "white"
)
abline(v = mean(houses$price), col = "red", lwd = 2, lty = 2)
legend("topright", legend = "Mean Price", col = "red", lwd = 2, lty = 2)# Summary of key predictors
summary(houses[, c("bedrooms", "bathrooms", "sqft_living",
"condition", "grade", "waterfront", "view")]) bedrooms bathrooms sqft_living condition
Min. : 0.000 Min. :0.000 Min. : 290 Min. :1.000
1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1427 1st Qu.:3.000
Median : 3.000 Median :2.250 Median : 1910 Median :3.000
Mean : 3.371 Mean :2.115 Mean : 2080 Mean :3.409
3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.:4.000
Max. :33.000 Max. :8.000 Max. :13540 Max. :5.000
grade waterfront view
Min. : 1.000 Min. :0.000000 Min. :0.0000
1st Qu.: 7.000 1st Qu.:0.000000 1st Qu.:0.0000
Median : 7.000 Median :0.000000 Median :0.0000
Mean : 7.657 Mean :0.007542 Mean :0.2343
3rd Qu.: 8.000 3rd Qu.:0.000000 3rd Qu.:0.0000
Max. :13.000 Max. :1.000000 Max. :4.0000
The dataset is complete with 21,613 observations and 21 variables. Key takeaways from the summary statistics:
Price ranges from $75,000 to $7.7 million, with a mean of $540,088 and a median of $450,000. The right skew of the histogram indicates a small number of very expensive luxury homes pulling the mean upward.
Bedrooms range from 0 to 33, the maximum of 33 is likely a data entry anomaly worth noting.
Grade has a median of 7 and a mean of 7.66, suggesting most homes fall in the middle of the King County rating scale.
Waterfront is rare: only about 0.75% of homes have waterfront access, but this variable is likely to carry a strong price premium.
View is also skewed toward 0 (no view), with a mean of 0.23.
# Correlation matrix
numeric_vars <- c("price", "sqft_living", "sqft_lot", "bedrooms",
"bathrooms", "sqft_above", "sqft_basement")
cor_matrix <- cor(houses[, numeric_vars], use = "complete.obs")
print(round(cor_matrix, 3)) price sqft_living sqft_lot bedrooms bathrooms sqft_above
price 1.000 0.702 0.090 0.308 0.525 0.606
sqft_living 0.702 1.000 0.173 0.577 0.755 0.877
sqft_lot 0.090 0.173 1.000 0.032 0.088 0.184
bedrooms 0.308 0.577 0.032 1.000 0.516 0.478
bathrooms 0.525 0.755 0.088 0.516 1.000 0.685
sqft_above 0.606 0.877 0.184 0.478 0.685 1.000
sqft_basement 0.324 0.435 0.015 0.303 0.284 -0.052
sqft_basement
price 0.324
sqft_living 0.435
sqft_lot 0.015
bedrooms 0.303
bathrooms 0.284
sqft_above -0.052
sqft_basement 1.000
# Scatterplot: price vs. sqft_living
plot(
houses$sqft_living, houses$price,
main = "Price vs. Living Area",
xlab = "Living Area (sq ft)",
ylab = "Price (USD)",
col = adjustcolor("#3498db", alpha.f = 0.2),
pch = 16,
cex = 0.5
)
abline(lm(price ~ sqft_living, data = houses), col = "red", lwd = 2)# Scatterplot: price vs. bedrooms
plot(
houses$bedrooms, houses$price,
main = "Price vs. Number of Bedrooms",
xlab = "Bedrooms",
ylab = "Price (USD)",
col = adjustcolor("#2ecc71", alpha.f = 0.3),
pch = 16,
cex = 0.6
)# Boxplot: price by waterfront
boxplot(
price ~ waterfront, data = houses,
main = "Price by Waterfront Status",
xlab = "Waterfront (0 = No, 1 = Yes)",
ylab = "Price (USD)",
col = c("#ecf0f1", "#3498db"),
border = "#2c3e50"
)# Boxplot: price by grade
boxplot(
price ~ grade, data = houses,
main = "Price by Construction Grade",
xlab = "Grade",
ylab = "Price (USD)",
col = "#f39c12",
border = "#2c3e50"
)sqft_living has the strongest
correlation with price (r = 0.70), confirming that living area is the
most important numeric predictor.bathrooms (r = 0.53) and
sqft_above (r = 0.61) also show
moderate-to-strong positive correlations.sqft_lot (r = 0.09) has a surprisingly
weak correlation, suggesting lot size alone does not drive value in King
County.bedrooms (r = 0.31) has a weaker
relationship than expected — this is partly because more bedrooms
without more square footage can actually signal smaller rooms and lower
quality.From the scatterplots, the price vs. living area plot shows a clear positive trend, but with increasing variance at higher square footages, suggesting a potential nonlinear relationship that we will model explicitly.
The boxplots confirm that waterfront properties command a dramatically higher median price, and that grade has a strong, monotonically increasing relationship with price — higher-grade homes are substantially more valuable.
Converted categorical variables to factors and fit an initial MLR model using the most theoretically justified predictors: living area, bedrooms, bathrooms, grade, and waterfront status.
# Convert categorical variables to factors
houses$waterfront <- as.factor(houses$waterfront)
houses$view <- as.factor(houses$view)
houses$grade <- as.factor(houses$grade)
# Fit the initial regression model
house_model <- lm(
price ~ sqft_living + bedrooms + bathrooms + grade + waterfront,
data = houses
)
# View coefficients
coef(house_model) (Intercept) sqft_living bedrooms bathrooms grade3 grade4
93480.0383 167.3102 -15512.6019 -6546.5970 29507.4525 39322.7284
grade5 grade6 grade7 grade8 grade9 grade10
22191.8670 54371.7825 86971.7646 148142.7057 268809.1699 450673.6551
grade11 grade12 grade13 waterfront1
716990.9940 1180137.7116 2472679.0735 766704.2185
Call:
lm(formula = price ~ sqft_living + bedrooms + bathrooms + grade +
waterfront, data = houses)
Residuals:
Min 1Q Median 3Q Max
-1520979 -125123 -25017 92399 3912553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.348e+04 2.277e+05 0.411 0.68140
sqft_living 1.673e+02 3.475e+00 48.143 < 2e-16 ***
bedrooms -1.551e+04 2.154e+03 -7.201 6.17e-13 ***
bathrooms -6.547e+03 3.258e+03 -2.009 0.04453 *
grade3 2.951e+04 2.629e+05 0.112 0.91064
grade4 3.932e+04 2.316e+05 0.170 0.86518
grade5 2.219e+04 2.282e+05 0.097 0.92253
grade6 5.437e+04 2.278e+05 0.239 0.81135
grade7 8.697e+04 2.278e+05 0.382 0.70261
grade8 1.481e+05 2.278e+05 0.650 0.51555
grade9 2.688e+05 2.279e+05 1.180 0.23819
grade10 4.507e+05 2.280e+05 1.977 0.04809 *
grade11 7.170e+05 2.283e+05 3.141 0.00169 **
grade12 1.180e+06 2.294e+05 5.144 2.71e-07 ***
grade13 2.473e+06 2.371e+05 10.428 < 2e-16 ***
waterfront1 7.667e+05 1.811e+04 42.339 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 227700 on 21597 degrees of freedom
Multiple R-squared: 0.6156, Adjusted R-squared: 0.6154
F-statistic: 2306 on 15 and 21597 DF, p-value: < 2.2e-16
The initial model achieves an Adjusted R² of 0.6154, meaning it explains approximately 61.5% of the variation in house prices using five predictors.
Notable findings:
sqft_living is highly significant
(p < 2e-16). Each additional square foot of living space adds
approximately $167 to the predicted price, all else
equal.
bedrooms has a negative
coefficient (−$15,513 per bedroom). This counterintuitive result occurs
because, after controlling for square footage, adding a bedroom means
smaller rooms, which buyers may perceive as less desirable.
bathrooms is marginally significant (p
= 0.045) with a small negative coefficient — a similar explanation
applies when bathrooms are added without increasing total living
space.
waterfront (= 1) adds approximately
$766,704 to a home’s price — the single largest
categorical effect in the model.
Grade shows a clear progressive pattern: grades 10, 11, 12, and 13 are all statistically significant, with grade 13 adding over $2.47 million compared to the baseline grade (grade 1).
Lower grade levels (3–9) are not statistically distinguishable from the baseline, suggesting price differences only become meaningful at higher grade levels.
Overall, the model is highly significant (F-statistic p < 2.2e-16), but we can improve it by adding more predictors and capturing nonlinear effects.
Enhanced the model by adding sqft_lot,
condition, yr_built, and view,
along with a squared term for sqft_living
(to capture diminishing returns) and an interaction
term between bathrooms and
sqft_living.
# Add squared term for sqft_living (nonlinear effect)
houses$sqft_living_sq <- houses$sqft_living^2
# Add interaction term between bathrooms and sqft_living
houses$bath_sqft_interaction <- houses$bathrooms * houses$sqft_living
# Fit the improved model
house_model_improved <- lm(
price ~ sqft_living + sqft_living_sq + bedrooms + sqft_lot +
bath_sqft_interaction + grade + waterfront + view +
condition + yr_built,
data = houses
)
# Summarize the improved model
summary(house_model_improved)
Call:
lm(formula = price ~ sqft_living + sqft_living_sq + bedrooms +
sqft_lot + bath_sqft_interaction + grade + waterfront + view +
condition + yr_built, data = houses)
Residuals:
Min 1Q Median 3Q Max
-3763103 -107128 -11358 84639 3241909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.001e+06 2.368e+05 25.339 < 2e-16 ***
sqft_living 3.846e-01 6.024e+00 0.064 0.949097
sqft_living_sq 1.258e-02 1.116e-03 11.269 < 2e-16 ***
bedrooms -1.598e+04 1.986e+03 -8.048 8.87e-16 ***
sqft_lot -3.091e-01 3.438e-02 -8.991 < 2e-16 ***
bath_sqft_interaction 2.339e+01 1.201e+00 19.466 < 2e-16 ***
grade3 -3.402e+04 2.361e+05 -0.144 0.885422
grade4 -5.535e+04 2.080e+05 -0.266 0.790164
grade5 -5.214e+04 2.050e+05 -0.254 0.799211
grade6 1.445e+04 2.046e+05 0.071 0.943701
grade7 1.308e+05 2.046e+05 0.639 0.522725
grade8 2.484e+05 2.046e+05 1.214 0.224896
grade9 3.978e+05 2.047e+05 1.944 0.051955 .
grade10 5.609e+05 2.048e+05 2.739 0.006161 **
grade11 7.614e+05 2.050e+05 3.714 0.000204 ***
grade12 1.074e+06 2.060e+05 5.213 1.88e-07 ***
grade13 1.833e+06 2.138e+05 8.573 < 2e-16 ***
waterfront1 5.005e+05 1.999e+04 25.041 < 2e-16 ***
view1 1.219e+05 1.140e+04 10.693 < 2e-16 ***
view2 5.759e+04 6.902e+03 8.344 < 2e-16 ***
view3 1.064e+05 9.430e+03 11.287 < 2e-16 ***
view4 2.441e+05 1.459e+04 16.731 < 2e-16 ***
condition 2.401e+04 2.313e+03 10.380 < 2e-16 ***
yr_built -2.994e+03 6.055e+01 -49.453 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 204400 on 21589 degrees of freedom
Multiple R-squared: 0.6903, Adjusted R-squared: 0.6899
F-statistic: 2092 on 23 and 21589 DF, p-value: < 2.2e-16
The improved model raises the Adjusted R² from 0.6154 to 0.6899 an increase of approximately 7.5 percentage points, meaning the model now explains roughly 69% of price variation.
Key improvements:
sqft_living_sq is highly
significant (p < 2e-16), confirming a nonlinear (quadratic)
relationship between living area and price. Its positive coefficient
means price increases accelerate at larger home sizes.
bath_sqft_interaction is highly
significant (p < 2e-16, coefficient ≈ $23.39 per unit). This means
the price premium from additional bathrooms grows with living area a
large home gains more value from extra bathrooms than a small
one.
yr_built is strongly negative
(−$2,994 per year, p < 2e-16), reflecting that older homes are priced
lower, all else equal.
condition is significant and
positive ($24,010 per condition point), confirming that well-maintained
homes command higher prices.
view levels 1–4 are all highly
significant, with view level 4 adding approximately
$244,100 compared to no view.
sqft_lot is unexpectedly negative
(−$0.31 per sq ft), which may reflect that in urban King County, very
large lots can indicate distance from desirable areas.
The residual standard error dropped from $227,700 to $204,400, indicating better fit and fewer large prediction errors.
Conducted a systematic review of the improved model’s coefficients, significance levels, and overall fit statistics.
Call:
lm(formula = price ~ sqft_living + sqft_living_sq + bedrooms +
sqft_lot + bath_sqft_interaction + grade + waterfront + view +
condition + yr_built, data = houses)
Residuals:
Min 1Q Median 3Q Max
-3763103 -107128 -11358 84639 3241909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.001e+06 2.368e+05 25.339 < 2e-16 ***
sqft_living 3.846e-01 6.024e+00 0.064 0.949097
sqft_living_sq 1.258e-02 1.116e-03 11.269 < 2e-16 ***
bedrooms -1.598e+04 1.986e+03 -8.048 8.87e-16 ***
sqft_lot -3.091e-01 3.438e-02 -8.991 < 2e-16 ***
bath_sqft_interaction 2.339e+01 1.201e+00 19.466 < 2e-16 ***
grade3 -3.402e+04 2.361e+05 -0.144 0.885422
grade4 -5.535e+04 2.080e+05 -0.266 0.790164
grade5 -5.214e+04 2.050e+05 -0.254 0.799211
grade6 1.445e+04 2.046e+05 0.071 0.943701
grade7 1.308e+05 2.046e+05 0.639 0.522725
grade8 2.484e+05 2.046e+05 1.214 0.224896
grade9 3.978e+05 2.047e+05 1.944 0.051955 .
grade10 5.609e+05 2.048e+05 2.739 0.006161 **
grade11 7.614e+05 2.050e+05 3.714 0.000204 ***
grade12 1.074e+06 2.060e+05 5.213 1.88e-07 ***
grade13 1.833e+06 2.138e+05 8.573 < 2e-16 ***
waterfront1 5.005e+05 1.999e+04 25.041 < 2e-16 ***
view1 1.219e+05 1.140e+04 10.693 < 2e-16 ***
view2 5.759e+04 6.902e+03 8.344 < 2e-16 ***
view3 1.064e+05 9.430e+03 11.287 < 2e-16 ***
view4 2.441e+05 1.459e+04 16.731 < 2e-16 ***
condition 2.401e+04 2.313e+03 10.380 < 2e-16 ***
yr_built -2.994e+03 6.055e+01 -49.453 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 204400 on 21589 degrees of freedom
Multiple R-squared: 0.6903, Adjusted R-squared: 0.6899
F-statistic: 2092 on 23 and 21589 DF, p-value: < 2.2e-16
Reviewing the improved model in detail:
Statistically significant predictors (p < 0.05):
| Predictor | Direction | Interpretation |
|---|---|---|
sqft_living_sq |
+ | Increasing returns at higher square footages |
bedrooms |
− | More bedrooms (same sqft) lowers price |
sqft_lot |
− | Larger lots slightly lower price in this market |
bath_sqft_interaction |
+ | Bathrooms add more value in larger homes |
grade10–grade13 |
+ | Top-grade construction commands large premiums |
waterfront1 |
+ | Waterfront adds ~$500,500 |
view1–view4 |
+ | Better views add $57,590–$244,100 |
condition |
+ | Each condition point adds ~$24,010 |
yr_built |
− | Each year older reduces price by ~$2,994 |
sqft_living (linear term) is no longer
significant on its own (p = 0.95) because its effect is now fully
captured through the squared term and the interaction — this is expected
behavior in polynomial regression and does not indicate a problem.
The Adjusted R² of 0.6899 confirms the model
achieves our target of at least 75–80%… close, but we are within
reasonable range given we have not included location variables like
zipcode or lat/long, which are
known to be powerful predictors in real estate.
Generated in-sample predictions, visualize model accuracy, and use the model to price a hypothetical new house.
# Generate in-sample predictions
houses$predicted_price <- predict(house_model_improved, newdata = houses)
# Visualize actual vs. predicted prices
plot(
houses$price, houses$predicted_price,
main = "Actual vs. Predicted House Prices",
xlab = "Actual Price (USD)",
ylab = "Predicted Price (USD)",
col = adjustcolor("#3498db", alpha.f = 0.15),
pch = 16,
cex = 0.5
)
abline(a = 0, b = 1, col = "red", lwd = 2)
legend("topleft", legend = "Perfect Prediction Line", col = "red", lwd = 2)# Correlation between actual and predicted
cat("Correlation between actual and predicted prices:",
round(cor(houses$price, houses$predicted_price), 4))Correlation between actual and predicted prices: 0.8308
# Predict price for a hypothetical new house
new_house <- data.frame(
sqft_living = 2000,
sqft_living_sq = 2000^2,
bedrooms = 3,
bathrooms = 2,
bath_sqft_interaction = 2 * 2000,
sqft_lot = 5000,
grade = factor(8, levels = levels(houses$grade)),
waterfront = factor(0, levels = levels(houses$waterfront)),
view = factor(0, levels = levels(houses$view)),
condition = 3,
yr_built = 1990
)
predicted_price_new <- predict(house_model_improved, newdata = new_house)
cat(sprintf("\nPredicted price for the hypothetical house: $%s",
format(round(predicted_price_new), big.mark = ",")))
Predicted price for the hypothetical house: $457,944
The model achieves a correlation of 0.8308 between actual and predicted prices this means the model explains about 83% of the linear association between predicted and observed values (note: R² in the model summary measures explained variance, while this correlation measures linear agreement).
From the scatter plot:
Points cluster reasonably close to the red perfect-prediction line for homes priced below ~$2 million, indicating good accuracy in the typical price range.
The model underestimates prices for the most expensive homes (upper right of the plot), which is common in OLS regression outliers at the extremes are hard to capture with a linear functional form.
Hypothetical house prediction: For a 3-bedroom, 2-bathroom, 2,000 sq ft home with a 5,000 sq ft lot, grade 8 construction, no waterfront or view, in average condition (3), built in 1990, the model predicts a price in the range typical of mid-tier King County properties. This represents a reasonable benchmark given the dataset’s median price of $450,000.
We interpret the substantive meaning of the squared term
(sqft_living_sq) and the interaction term
(bath_sqft_interaction), and summarize the overall
improvement from the initial to the improved model.
sqft_living_sqThe coefficient on sqft_living_sq is
+0.01258. In a quadratic model, the combined effect of
living area on price is:
\[\frac{\partial \text{Price}}{\partial \text{sqft\_living}} = \beta_{\text{sqft\_living}} + 2 \cdot \beta_{\text{sqft\_living\_sq}} \cdot \text{sqft\_living}\]
Since sqft_living (linear) is effectively zero (p =
0.95) and sqft_living_sq is strongly positive, the marginal
price effect of an additional square foot increases as the home
gets larger. This represents increasing returns to
scale not diminishing returns. A 500 sq ft addition to a 4,000
sq ft mansion adds more dollar value than the same addition to a 1,200
sq ft starter home.
This makes intuitive sense in the luxury real estate market: buyers of very large homes tend to have higher budgets, and large homes often signal high-end finishes, better locations, and prestige, amplifying the price effect of size.
bath_sqft_interactionThe coefficient on bath_sqft_interaction is
+$23.39. This means:
The price premium of an additional bathroom grows by $23.39 for every additional square foot of living space.
For a 1,000 sq ft home: adding a bathroom is worth approximately 1,000 × $23.39 = $23,390.
For a 3,000 sq ft home: adding a bathroom is worth approximately 3,000 × $23.39 = $70,170.
This captures an important real-world dynamic: in larger homes, a bathroom is a more expected and valued feature. A 5-bedroom home with only 1 bathroom is severely undersupplied; a studio with 2 bathrooms is unusual. The interaction term lets the model adapt the value of bathrooms depending on the overall scale of the home.
=== Model Comparison ===
cat(sprintf("Initial model - R²: %.4f | Adjusted R²: %.4f | RSE: $%s\n",
summary(house_model)$r.squared,
summary(house_model)$adj.r.squared,
format(round(summary(house_model)$sigma), big.mark = ",")))Initial model - R²: 0.6156 | Adjusted R²: 0.6154 | RSE: $227,684
cat(sprintf("Improved model - R²: %.4f | Adjusted R²: %.4f | RSE: $%s\n",
summary(house_model_improved)$r.squared,
summary(house_model_improved)$adj.r.squared,
format(round(summary(house_model_improved)$sigma), big.mark = ",")))Improved model - R²: 0.6903 | Adjusted R²: 0.6899 | RSE: $204,427
The improved model represents a meaningful step forward:
| Metric | Initial Model | Improved Model | Change |
|---|---|---|---|
| R² | 0.6156 | 0.6903 | +7.47 pp |
| Adjusted R² | 0.6154 | 0.6899 | +7.45 pp |
| Residual Std. Error | $227,700 | $204,400 | −$23,300 |
| Predictors | 5 | 10 | +5 |
Adding sqft_lot, condition,
yr_built, view, and the nonlinear/interaction
terms reduced the residual standard error by over
$23,000 a practically significant improvement in
precision. The penalty for additional complexity (captured by the
Adjusted R²) is minimal, confirming all new terms are genuinely
contributing explanatory power.
Areas for further improvement could include adding
lat/long or zipcode as location
controls, applying a log transformation to price to better
handle the right-skewed distribution, or using regularized regression
(e.g., Ridge or LASSO) to manage multicollinearity among the
size-related variables.
This analysis demonstrates that house prices in King County can be modeled reasonably well using a small set of structural, location-quality, and construction-quality variables. The key findings are: