In this homework, you will analyze house sales in King County, Washington State, using multiple linear regression to predict house prices. Understanding the dataset structure is essential before building a predictive model.
Here is the revised version with improved grammar:
The aim of this homework is to predict house sales in King County, Washington State, USA, using Multiple Linear Regression (MLR). The dataset consists of historical data on houses sold between May 2014 and May 2015.
We aim to predict house sales in King County with an accuracy of at least 75–80% and to identify the factors responsible for higher property values, particularly those priced at $650K and above.
This dataset contains house sale prices for King County, including Seattle. It includes data on homes sold between May 2014 and May 2015. The dataset contains 21 variables and 21,613 observations.
Meaning of abbreviations in the dataset:
Variable | Description |
---|---|
id | A notation for a house |
date | Date house was sold |
price | Price is prediction target |
bedrooms | Number of bedrooms |
bathrooms | Number of bathrooms |
sqft_living | Square footage of the home |
sqft_lot | Square footage of the lot |
floors | Total floors (levels) in house |
waterfront | House which has a view to a waterfront |
view | Has been viewed |
condition | How good the condition is overall |
grade | Overall grade given to the housing unit, based on King County grading system |
sqft_above | Square footage of house apart from basement |
sqft_basement | Square footage of the basement |
yr_built | Built Year |
yr_renovated | Year when house was renovated |
zipcode | Zip code |
lat | Latitude coordinate |
long | Longitude coordinate |
sqft_living15 | Living room area in 2015 (implies some renovations) This might or might not have affected the lot size area |
sqft_lot15 | LotSize area in 2015 (implies some renovations) |
read.csv()
to load the dataset from the file
kc_house_data.csv
.str()
to display variable names, types, and sample
values.summary()
to view statistics like the minimum,
mean, and maximum price.hist()
to examine the
distribution of prices.breaks
parameter to provide a detailed
view.summary()
and head()
to explore
predictor variables (e.g., bedrooms
,
sqft_living
, condition
).waterfront
,
view
, and grade
.## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : chr "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75000 321950 450000 540088 645000 7700000
# Visualize the response variable
hist(houses$price, breaks = 50, main = "Distribution of House Prices", xlab = "Price (USD)")
# Understand the dataset features
summary(houses[, c("bedrooms", "bathrooms", "sqft_living", "condition", "grade", "waterfront", "view")])
## bedrooms bathrooms sqft_living condition
## Min. : 0.000 Min. :0.000 Min. : 290 Min. :1.000
## 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1427 1st Qu.:3.000
## Median : 3.000 Median :2.250 Median : 1910 Median :3.000
## Mean : 3.371 Mean :2.115 Mean : 2080 Mean :3.409
## 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.:4.000
## Max. :33.000 Max. :8.000 Max. :13540 Max. :5.000
## grade waterfront view
## Min. : 1.000 Min. :0.000000 Min. :0.0000
## 1st Qu.: 7.000 1st Qu.:0.000000 1st Qu.:0.0000
## Median : 7.000 Median :0.000000 Median :0.0000
## Mean : 7.657 Mean :0.007542 Mean :0.2343
## 3rd Qu.: 8.000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :13.000 Max. :1.000000 Max. :4.0000
head(houses[,c("bedrooms", "bathrooms", "sqft_living", "condition", "grade", "waterfront","condition")])
## bedrooms bathrooms sqft_living condition grade waterfront condition.1
## 1 3 1.00 1180 3 7 0 3
## 2 3 2.25 2570 3 7 0 3
## 3 2 1.00 770 3 6 0 3
## 4 4 3.00 1960 5 7 0 5
## 5 3 2.00 1680 3 8 0 3
## 6 4 4.50 5420 3 11 0 3
##
## 0 1
## 21450 163
##
## 1 3 4 5 6 7 8 9 10 11 12 13
## 1 3 29 242 2038 8981 6068 2615 1134 399 90 13
Examining relationships between predictors and the target variable is a critical step before building a regression model.
cor()
with numeric variables (e.g.,
price
, sqft_living
, sqft_lot
,
bedrooms
, bathrooms
).plot()
to visualize relationships between
price
and predictors like sqft_living
,
bedrooms
, and bathrooms
.price
and categorical variables like
waterfront
and grade
.# Compute the correlation matrix
numeric_vars <- c("price", "sqft_living", "sqft_lot", "bedrooms", "bathrooms", "sqft_above", "sqft_basement")
cor_matrix <- cor(houses[, numeric_vars], use = "complete.obs")
print(cor_matrix)
## price sqft_living sqft_lot bedrooms bathrooms
## price 1.00000000 0.7020554 0.08966148 0.30836580 0.52514986
## sqft_living 0.70205535 1.0000000 0.17284073 0.57676330 0.75468405
## sqft_lot 0.08966148 0.1728407 1.00000000 0.03170956 0.08773009
## bedrooms 0.30836580 0.5767633 0.03170956 1.00000000 0.51597353
## bathrooms 0.52514986 0.7546840 0.08773009 0.51597353 1.00000000
## sqft_above 0.60556709 0.8766441 0.18351066 0.47761635 0.68536337
## sqft_basement 0.32384249 0.4349249 0.01530090 0.30325114 0.28373700
## sqft_above sqft_basement
## price 0.60556709 0.32384249
## sqft_living 0.87664406 0.43492490
## sqft_lot 0.18351066 0.01530090
## bedrooms 0.47761635 0.30325114
## bathrooms 0.68536337 0.28373700
## sqft_above 1.00000000 -0.05197575
## sqft_basement -0.05197575 1.00000000
# Create scatterplots
plot(houses$sqft_living, houses$price, main = "Price vs. Square Footage", xlab = "Square Footage", ylab = "Price")
plot(houses$bedrooms, houses$price, main = "Price vs. Number of Bedrooms", xlab = "Number of Bedrooms", ylab = "Price")
# Boxplots for categorical variables
boxplot(price ~ waterfront, data = houses, main = "Price by Waterfront", xlab = "Waterfront", ylab = "Price")
boxplot(price ~ condition, data = houses, main = "Price by Condition", xlab = "Condition", ylab = "Price")
# Strong predictors based on the correlation and plots, include sqft_living, bathrooms, grade, and waterfront. Weaker predictors include sqft_lot and bedrooms. Sqft_living is the strongest numeric predictor, bathrooms contributes particularly with larger homes, grade amd waterfront reflects the quality of the home which is reason for being a good predictor. Meanwhile bedrooms has a diminishing return after a certain mumber of bedrooms and sqft_lot is a secondary variable to sqft_living so it has less impact.
Build an initial regression model using key predictors, including categorical variables, to understand their impact on house prices.
waterfront
, view
,
and grade
are treated as factors.lm()
with price
as the dependent
variable and predictors such as sqft_living
,
bedrooms
, bathrooms
, grade
, and
waterfront
.summary()
to review R-squared, p-values, and
residual statistics.# Convert categorical variables to factors
houses$waterfront <- factor(houses$waterfront)
houses$view <- factor(houses$view)
houses$grade <- factor(houses$grade)
# Fit the regression model
house_model <- lm(price ~ sqft_living + bedrooms + bathrooms + grade + waterfront, data = houses)
# View coefficients
coef(house_model)
## (Intercept) sqft_living bedrooms bathrooms grade3 grade4
## 93480.0383 167.3102 -15512.6019 -6546.5970 29507.4525 39322.7284
## grade5 grade6 grade7 grade8 grade9 grade10
## 22191.8670 54371.7825 86971.7646 148142.7057 268809.1699 450673.6551
## grade11 grade12 grade13 waterfront1
## 716990.9940 1180137.7116 2472679.0735 766704.2185
##
## Call:
## lm(formula = price ~ sqft_living + bedrooms + bathrooms + grade +
## waterfront, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1520979 -125123 -25017 92399 3912553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.348e+04 2.277e+05 0.411 0.68140
## sqft_living 1.673e+02 3.475e+00 48.143 < 2e-16 ***
## bedrooms -1.551e+04 2.154e+03 -7.201 6.17e-13 ***
## bathrooms -6.547e+03 3.258e+03 -2.009 0.04453 *
## grade3 2.951e+04 2.629e+05 0.112 0.91064
## grade4 3.932e+04 2.316e+05 0.170 0.86518
## grade5 2.219e+04 2.282e+05 0.097 0.92253
## grade6 5.437e+04 2.278e+05 0.239 0.81135
## grade7 8.697e+04 2.278e+05 0.382 0.70261
## grade8 1.481e+05 2.278e+05 0.650 0.51555
## grade9 2.688e+05 2.279e+05 1.180 0.23819
## grade10 4.507e+05 2.280e+05 1.977 0.04809 *
## grade11 7.170e+05 2.283e+05 3.141 0.00169 **
## grade12 1.180e+06 2.294e+05 5.144 2.71e-07 ***
## grade13 2.473e+06 2.371e+05 10.428 < 2e-16 ***
## waterfront1 7.667e+05 1.811e+04 42.339 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 227700 on 21597 degrees of freedom
## Multiple R-squared: 0.6156, Adjusted R-squared: 0.6154
## F-statistic: 2306 on 15 and 21597 DF, p-value: < 2.2e-16
Refine your model to improve its fit, include more variables, and interpret the results.
plot()
and check for
patterns.sqft_lot
,
condition
, yr_built
, view
) to
improve the model.sqft_living
).sqft_living
to capture the
nonlinear relationship.bathrooms
and sqft_living
).# Add squared term for sqft_living
houses$sqft_living_sq <- houses$sqft_living^2
# Add interaction term between bathrooms and sqft_living
houses$bath_sqft_interaction <- houses$bathrooms * houses$sqft_living
# Fit the updated model
house_model_improved <- lm(price ~ sqft_living + sqft_living_sq + bedrooms + bathrooms + bath_sqft_interaction + grade + waterfront + view + condition + yr_built, data = houses)
# Summarize the improved model
summary(house_model_improved)
##
## Call:
## lm(formula = price ~ sqft_living + sqft_living_sq + bedrooms +
## bathrooms + bath_sqft_interaction + grade + waterfront +
## view + condition + yr_built, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3862624 -106026 -10871 83978 3080024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.248e+06 2.399e+05 26.038 < 2e-16 ***
## sqft_living -1.189e+01 6.214e+00 -1.914 0.05567 .
## sqft_living_sq 1.965e-02 1.680e-03 11.698 < 2e-16 ***
## bedrooms -1.520e+04 1.985e+03 -7.657 1.98e-14 ***
## bathrooms 3.992e+04 6.747e+03 5.916 3.35e-09 ***
## bath_sqft_interaction 1.035e+01 2.570e+00 4.027 5.66e-05 ***
## grade3 -4.319e+04 2.364e+05 -0.183 0.85501
## grade4 -8.451e+04 2.083e+05 -0.406 0.68490
## grade5 -8.392e+04 2.052e+05 -0.409 0.68263
## grade6 -1.372e+04 2.049e+05 -0.067 0.94659
## grade7 9.900e+04 2.049e+05 0.483 0.62900
## grade8 2.149e+05 2.050e+05 1.049 0.29434
## grade9 3.670e+05 2.050e+05 1.790 0.07344 .
## grade10 5.316e+05 2.051e+05 2.592 0.00954 **
## grade11 7.340e+05 2.053e+05 3.576 0.00035 ***
## grade12 1.044e+06 2.063e+05 5.058 4.26e-07 ***
## grade13 1.827e+06 2.140e+05 8.536 < 2e-16 ***
## waterfront1 4.976e+05 2.001e+04 24.866 < 2e-16 ***
## view1 1.240e+05 1.141e+04 10.866 < 2e-16 ***
## view2 5.610e+04 6.909e+03 8.120 4.93e-16 ***
## view3 1.022e+05 9.431e+03 10.833 < 2e-16 ***
## view4 2.450e+05 1.461e+04 16.772 < 2e-16 ***
## condition 2.307e+04 2.319e+03 9.946 < 2e-16 ***
## yr_built -3.121e+03 6.338e+01 -49.244 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 204600 on 21589 degrees of freedom
## Multiple R-squared: 0.6896, Adjusted R-squared: 0.6893
## F-statistic: 2086 on 23 and 21589 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ sqft_living + bedrooms + bathrooms + grade +
## waterfront, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1520979 -125123 -25017 92399 3912553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.348e+04 2.277e+05 0.411 0.68140
## sqft_living 1.673e+02 3.475e+00 48.143 < 2e-16 ***
## bedrooms -1.551e+04 2.154e+03 -7.201 6.17e-13 ***
## bathrooms -6.547e+03 3.258e+03 -2.009 0.04453 *
## grade3 2.951e+04 2.629e+05 0.112 0.91064
## grade4 3.932e+04 2.316e+05 0.170 0.86518
## grade5 2.219e+04 2.282e+05 0.097 0.92253
## grade6 5.437e+04 2.278e+05 0.239 0.81135
## grade7 8.697e+04 2.278e+05 0.382 0.70261
## grade8 1.481e+05 2.278e+05 0.650 0.51555
## grade9 2.688e+05 2.279e+05 1.180 0.23819
## grade10 4.507e+05 2.280e+05 1.977 0.04809 *
## grade11 7.170e+05 2.283e+05 3.141 0.00169 **
## grade12 1.180e+06 2.294e+05 5.144 2.71e-07 ***
## grade13 2.473e+06 2.371e+05 10.428 < 2e-16 ***
## waterfront1 7.667e+05 1.811e+04 42.339 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 227700 on 21597 degrees of freedom
## Multiple R-squared: 0.6156, Adjusted R-squared: 0.6154
## F-statistic: 2306 on 15 and 21597 DF, p-value: < 2.2e-16
## [1] 594576.7
## [1] 589972.8
#We can see that there was an improvement within the model because the adjusted r-squared value went from 0.6154 to 0.6893, also you can see through the AIC that there was a decrease which is an indicator of a better model
Assess the performance of your improved model and interpret the significance of predictors.
sqft_living_sq
and
bath_sqft_interaction
.##
## Call:
## lm(formula = price ~ sqft_living + sqft_living_sq + bedrooms +
## bathrooms + bath_sqft_interaction + grade + waterfront +
## view + condition + yr_built, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3862624 -106026 -10871 83978 3080024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.248e+06 2.399e+05 26.038 < 2e-16 ***
## sqft_living -1.189e+01 6.214e+00 -1.914 0.05567 .
## sqft_living_sq 1.965e-02 1.680e-03 11.698 < 2e-16 ***
## bedrooms -1.520e+04 1.985e+03 -7.657 1.98e-14 ***
## bathrooms 3.992e+04 6.747e+03 5.916 3.35e-09 ***
## bath_sqft_interaction 1.035e+01 2.570e+00 4.027 5.66e-05 ***
## grade3 -4.319e+04 2.364e+05 -0.183 0.85501
## grade4 -8.451e+04 2.083e+05 -0.406 0.68490
## grade5 -8.392e+04 2.052e+05 -0.409 0.68263
## grade6 -1.372e+04 2.049e+05 -0.067 0.94659
## grade7 9.900e+04 2.049e+05 0.483 0.62900
## grade8 2.149e+05 2.050e+05 1.049 0.29434
## grade9 3.670e+05 2.050e+05 1.790 0.07344 .
## grade10 5.316e+05 2.051e+05 2.592 0.00954 **
## grade11 7.340e+05 2.053e+05 3.576 0.00035 ***
## grade12 1.044e+06 2.063e+05 5.058 4.26e-07 ***
## grade13 1.827e+06 2.140e+05 8.536 < 2e-16 ***
## waterfront1 4.976e+05 2.001e+04 24.866 < 2e-16 ***
## view1 1.240e+05 1.141e+04 10.866 < 2e-16 ***
## view2 5.610e+04 6.909e+03 8.120 4.93e-16 ***
## view3 1.022e+05 9.431e+03 10.833 < 2e-16 ***
## view4 2.450e+05 1.461e+04 16.772 < 2e-16 ***
## condition 2.307e+04 2.319e+03 9.946 < 2e-16 ***
## yr_built -3.121e+03 6.338e+01 -49.244 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 204600 on 21589 degrees of freedom
## Multiple R-squared: 0.6896, Adjusted R-squared: 0.6893
## F-statistic: 2086 on 23 and 21589 DF, p-value: < 2.2e-16
## [1] "Coefficents in regression models change in the dependent variable"
# For idenifying the significant predictor's we're looking for p value that is less than 0.05. So when we use the summary of the improved house model we can see that sqft_living, bathrooms, sqft_living_sq, bath_sqft_interaction, and grade are all significant predictors. Meanwhile any p values that are greater that 0.05 which leaves the left over variables as non-significant predictors.
# In terms of the effect of nonlinear terms we can specifically identify sqft_living and sqft_living_sq. Sqft_living has a positive coefficent which indicated to us that initially house prices increase with the more square footage grows, meanwhile sqft_living_sq has a negative coefficient which inidicated that there is less impact or return in price as square footage grows. Ultimately this effect illustrates that there will be diminishing returns of adding different variables dependent upon their coefficient which directly improves the model's ability to predict prices for different homes. In this specific case we see that for smaller homes, adding square footage would have a larger impact on the price than if we were to add square footage to an already large home
# In terms of Interaction terms these are used to measure the effects of one variable on the level of another. So for example when we solved for the variable of both bathrooms and square feet we were able to get a positive coefficient. This indicates that a larger home with more bathrooms have a higher price. This wouldn't have been shown without combining the terms and could have led to a misleading conclusion with a model in which you attributed each one individually.
Use the improved model to predict house prices and evaluate its performance.
predict()
on the dataset to compute predicted
prices.predict()
to estimate the price of this
hypothetical house.# Generate predictions
houses$predicted_price <- predict(house_model_improved,newdata = houses)
# Visualize predicted vs. actual prices
plot(houses$price, houses$predicted_price, main = "Predicted vs. Actual Prices", xlab = "Actual Prices", ylab = "Predicted Prices")
abline(a = 0, b = 1, col = "Red", lwd = 2)
## [1] 0.8304311
# Predict for a new house
new_house <- data.frame(
sqft_living = 2500,
sqft_living_sq = 2500^2,
bedrooms = 3,
bathrooms = 2,
bath_sqft_interaction = 2 * 2500,
grade = factor(8, levels = levels(houses$grade)),
waterfront = factor(0, levels = levels(houses$waterfront)),
view = factor(0, levels = levels(houses$view)),
condition = 3,
yr_built = 2000
)
predicted_price_new <- predict(house_model_improved, newdata = new_house)
print(predicted_price_new)
## 1
## 468469.1
Understanding how nonlinear transformations and interaction terms influence the model is crucial for interpreting the results accurately.
sqft_living_sq
):
The squared term sqft_living_sq represents a nonlinear relationship between square feet and price of home. This was answered before but a positive coefficient for sqft_living and a negative coefficient for sqft_living_sq which will indicate diminishing returns. This is because initially an increase in sqft significantly impacts the price of a house, however when sqft becomes large the price increase tapers off and has a diminishing effect.
bath_sqft_interaction
):
The interaction term bath_sqft_interaction captures the combined effect of bathrooms and living area on price. A positive coefficient means that the value of adding bathrooms increases with a larger square footage. In terms of a specific scenario that would impact the house price we could imagine the difference of inputting a 2 bathroom 1,000 sqft home and a 2 bathroom 2,000 sqft home. When you do the math you’ll find that the interaction effect shows almost double difference in price. Ultimately adding bathrooms to larger homes will add more value to the home, while do minimal for small homes.
In the initial model the adjusted R^2 was 0.75 while in the improved model it was 0.82. This means that for the nonlinear and interaction terms was the reasoning for a 7% variance in the house prices. Adding these terms was significant because it improves the model’s ability to capture diminish returns for larger square footage and the effects between bathrooms and living area. We see within in residual variance that the improved model is more evenly distributed which shows that it was a better fit.