In this homework, you will analyze house sales in King County, Washington State, using multiple linear regression to predict house prices. Understanding the dataset structure is essential before building a predictive model.
Here is the revised version with improved grammar:
The aim of this homework is to predict house sales in King County, Washington State, USA, using Multiple Linear Regression (MLR). The dataset consists of historical data on houses sold between May 2014 and May 2015.
We aim to predict house sales in King County with an accuracy of at least 75–80% and to identify the factors responsible for higher property values, particularly those priced at $650K and above.
This dataset contains house sale prices for King County, including Seattle. It includes data on homes sold between May 2014 and May 2015. The dataset contains 21 variables and 21,613 observations.
Meaning of abbreviations in the dataset:
| Variable | Description |
|---|---|
| id | A notation for a house |
| date | Date house was sold |
| price | Price is prediction target |
| bedrooms | Number of bedrooms |
| bathrooms | Number of bathrooms |
| sqft_living | Square footage of the home |
| sqft_lot | Square footage of the lot |
| floors | Total floors (levels) in house |
| waterfront | House which has a view to a waterfront |
| view | Has been viewed |
| condition | How good the condition is overall |
| grade | Overall grade given to the housing unit, based on King County grading system |
| sqft_above | Square footage of house apart from basement |
| sqft_basement | Square footage of the basement |
| yr_built | Built Year |
| yr_renovated | Year when house was renovated |
| zipcode | Zip code |
| lat | Latitude coordinate |
| long | Longitude coordinate |
| sqft_living15 | Living room area in 2015 (implies some renovations) This might or might not have affected the lot size area |
| sqft_lot15 | LotSize area in 2015 (implies some renovations) |
read.csv() to load the dataset from the file
kc_house_data.csv.str() to display variable names, types, and sample
values.summary() to view statistics like the minimum,
mean, and maximum price.hist() to examine the
distribution of prices.breaks parameter to provide a detailed
view.summary() and head() to explore
predictor variables (e.g., bedrooms,
sqft_living, condition).waterfront,
view, and grade.## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : chr "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75000 321950 450000 540088 645000 7700000
# Visualize the response variable
hist(houses$price, breaks = 50, main = "Distribution of House Prices", xlab = "House Prices")# Understand the dataset features
summary(houses[, c("bedrooms", "bathrooms", "sqft_living", "condition", "grade", "waterfront", "view")])## bedrooms bathrooms sqft_living condition
## Min. : 0.000 Min. :0.000 Min. : 290 Min. :1.000
## 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1427 1st Qu.:3.000
## Median : 3.000 Median :2.250 Median : 1910 Median :3.000
## Mean : 3.371 Mean :2.115 Mean : 2080 Mean :3.409
## 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.:4.000
## Max. :33.000 Max. :8.000 Max. :13540 Max. :5.000
## grade waterfront view
## Min. : 1.000 Min. :0.000000 Min. :0.0000
## 1st Qu.: 7.000 1st Qu.:0.000000 1st Qu.:0.0000
## Median : 7.000 Median :0.000000 Median :0.0000
## Mean : 7.657 Mean :0.007542 Mean :0.2343
## 3rd Qu.: 8.000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :13.000 Max. :1.000000 Max. :4.0000
Examining relationships between predictors and the target variable is a critical step before building a regression model.
cor() with numeric variables (e.g.,
price, sqft_living, sqft_lot,
bedrooms, bathrooms).plot() to visualize relationships between
price and predictors like sqft_living,
bedrooms, and bathrooms.price and categorical variables like
waterfront and grade.# Compute the correlation matrix
numeric_vars <- c("price", "sqft_living", "sqft_lot", "bedrooms", "bathrooms", "sqft_above", "sqft_basement")
cor_matrix <- cor(houses[, numeric_vars], use = "complete.obs")
print(cor_matrix)## price sqft_living sqft_lot bedrooms bathrooms
## price 1.00000000 0.7020554 0.08966148 0.30836580 0.52514986
## sqft_living 0.70205535 1.0000000 0.17284073 0.57676330 0.75468405
## sqft_lot 0.08966148 0.1728407 1.00000000 0.03170956 0.08773009
## bedrooms 0.30836580 0.5767633 0.03170956 1.00000000 0.51597353
## bathrooms 0.52514986 0.7546840 0.08773009 0.51597353 1.00000000
## sqft_above 0.60556709 0.8766441 0.18351066 0.47761635 0.68536337
## sqft_basement 0.32384249 0.4349249 0.01530090 0.30325114 0.28373700
## sqft_above sqft_basement
## price 0.60556709 0.32384249
## sqft_living 0.87664406 0.43492490
## sqft_lot 0.18351066 0.01530090
## bedrooms 0.47761635 0.30325114
## bathrooms 0.68536337 0.28373700
## sqft_above 1.00000000 -0.05197575
## sqft_basement -0.05197575 1.00000000
# Create scatterplots
plot(houses$sqft_living, houses$price, main = "Price vs. Sqft Living", xlab = "Sqft Living", ylab = "Price")plot(houses$bathrooms, houses$price, main = "Price vs. Bathrooms", xlab = "Bathrooms", ylab = "Price")# Boxplots for categorical variables
boxplot(price ~ waterfront, data = houses, main = "Price by Waterfront", xlab = "Waterfront (0 = No, 1 = Yes)", ylab = "Price")Build an initial regression model using key predictors, including categorical variables, to understand their impact on house prices.
waterfront, view,
and grade are treated as factors.lm() with price as the dependent
variable and predictors such as sqft_living,
bedrooms, bathrooms, grade, and
waterfront.summary() to review R-squared, p-values, and
residual statistics.# Convert categorical variables to factors
houses$waterfront <- as.factor(houses$waterfront)
houses$view <- as.factor(houses$view)
houses$grade <- as.factor(houses$grade)
# Fit the regression model
house_model <- lm(price ~ sqft_living + bedrooms + bathrooms + grade + waterfront, data = houses)
# View coefficients
coef(house_model)## (Intercept) sqft_living bedrooms bathrooms grade3 grade4
## 93480.0383 167.3102 -15512.6019 -6546.5970 29507.4525 39322.7284
## grade5 grade6 grade7 grade8 grade9 grade10
## 22191.8670 54371.7825 86971.7646 148142.7057 268809.1699 450673.6551
## grade11 grade12 grade13 waterfront1
## 716990.9940 1180137.7116 2472679.0735 766704.2185
##
## Call:
## lm(formula = price ~ sqft_living + bedrooms + bathrooms + grade +
## waterfront, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1520979 -125123 -25017 92399 3912553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.348e+04 2.277e+05 0.411 0.68140
## sqft_living 1.673e+02 3.475e+00 48.143 < 2e-16 ***
## bedrooms -1.551e+04 2.154e+03 -7.201 6.17e-13 ***
## bathrooms -6.547e+03 3.258e+03 -2.009 0.04453 *
## grade3 2.951e+04 2.629e+05 0.112 0.91064
## grade4 3.932e+04 2.316e+05 0.170 0.86518
## grade5 2.219e+04 2.282e+05 0.097 0.92253
## grade6 5.437e+04 2.278e+05 0.239 0.81135
## grade7 8.697e+04 2.278e+05 0.382 0.70261
## grade8 1.481e+05 2.278e+05 0.650 0.51555
## grade9 2.688e+05 2.279e+05 1.180 0.23819
## grade10 4.507e+05 2.280e+05 1.977 0.04809 *
## grade11 7.170e+05 2.283e+05 3.141 0.00169 **
## grade12 1.180e+06 2.294e+05 5.144 2.71e-07 ***
## grade13 2.473e+06 2.371e+05 10.428 < 2e-16 ***
## waterfront1 7.667e+05 1.811e+04 42.339 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 227700 on 21597 degrees of freedom
## Multiple R-squared: 0.6156, Adjusted R-squared: 0.6154
## F-statistic: 2306 on 15 and 21597 DF, p-value: < 2.2e-16
Refine your model to improve its fit, include more variables, and interpret the results.
plot() and check for
patterns.sqft_lot,
condition, yr_built, view) to
improve the model.sqft_living).sqft_living to capture the
nonlinear relationship.bathrooms and sqft_living).# Add squared term for sqft_living
houses$sqft_living_sq <- houses$sqft_living^2
# Add interaction term between bathrooms and sqft_living
houses$bath_sqft_interaction <- houses$bathrooms * houses$sqft_living
# Fit the updated model
house_model_improved <- lm(price ~ sqft_living + sqft_living_sq + bedrooms + bathrooms + bath_sqft_interaction + grade + waterfront + view + sqft_lot + yr_built, data = houses)
# Summarize the improved model
summary(house_model_improved)##
## Call:
## lm(formula = price ~ sqft_living + sqft_living_sq + bedrooms +
## bathrooms + bath_sqft_interaction + grade + waterfront +
## view + sqft_lot + yr_built, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3812754 -105948 -11763 84257 3032377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.632e+06 2.364e+05 28.054 < 2e-16 ***
## sqft_living -6.037e+00 6.227e+00 -0.970 0.332294
## sqft_living_sq 2.001e-02 1.681e-03 11.905 < 2e-16 ***
## bedrooms -1.559e+04 1.989e+03 -7.840 4.70e-15 ***
## bathrooms 4.154e+04 6.745e+03 6.159 7.44e-10 ***
## bath_sqft_interaction 9.356e+00 2.568e+00 3.643 0.000270 ***
## grade3 7.455e+03 2.364e+05 0.032 0.974847
## grade4 -4.252e+04 2.083e+05 -0.204 0.838273
## grade5 -3.469e+04 2.053e+05 -0.169 0.865804
## grade6 3.342e+04 2.049e+05 0.163 0.870462
## grade7 1.461e+05 2.050e+05 0.713 0.475981
## grade8 2.600e+05 2.050e+05 1.268 0.204814
## grade9 4.103e+05 2.051e+05 2.001 0.045444 *
## grade10 5.736e+05 2.051e+05 2.796 0.005175 **
## grade11 7.749e+05 2.054e+05 3.774 0.000161 ***
## grade12 1.085e+06 2.064e+05 5.259 1.46e-07 ***
## grade13 1.851e+06 2.141e+05 8.645 < 2e-16 ***
## waterfront1 4.966e+05 2.002e+04 24.800 < 2e-16 ***
## view1 1.227e+05 1.142e+04 10.749 < 2e-16 ***
## view2 5.760e+04 6.914e+03 8.331 < 2e-16 ***
## view3 1.066e+05 9.445e+03 11.287 < 2e-16 ***
## view4 2.468e+05 1.461e+04 16.888 < 2e-16 ***
## sqft_lot -2.983e-01 3.445e-02 -8.658 < 2e-16 ***
## yr_built -3.303e+03 6.013e+01 -54.930 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 204800 on 21589 degrees of freedom
## Multiple R-squared: 0.6893, Adjusted R-squared: 0.6889
## F-statistic: 2082 on 23 and 21589 DF, p-value: < 2.2e-16
#Plotting the residuals a extra line
plot(house_model_improved$residuals, main = "Residual Plot", xlab = "Fitted Values", ylab = "Residuals", pch = 20)
abline(h = 0, col = "red")Assess the performance of your improved model and interpret the significance of predictors.
sqft_living_sq and
bath_sqft_interaction.##
## Call:
## lm(formula = price ~ sqft_living + sqft_living_sq + bedrooms +
## bathrooms + bath_sqft_interaction + grade + waterfront +
## view + sqft_lot + yr_built, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3812754 -105948 -11763 84257 3032377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.632e+06 2.364e+05 28.054 < 2e-16 ***
## sqft_living -6.037e+00 6.227e+00 -0.970 0.332294
## sqft_living_sq 2.001e-02 1.681e-03 11.905 < 2e-16 ***
## bedrooms -1.559e+04 1.989e+03 -7.840 4.70e-15 ***
## bathrooms 4.154e+04 6.745e+03 6.159 7.44e-10 ***
## bath_sqft_interaction 9.356e+00 2.568e+00 3.643 0.000270 ***
## grade3 7.455e+03 2.364e+05 0.032 0.974847
## grade4 -4.252e+04 2.083e+05 -0.204 0.838273
## grade5 -3.469e+04 2.053e+05 -0.169 0.865804
## grade6 3.342e+04 2.049e+05 0.163 0.870462
## grade7 1.461e+05 2.050e+05 0.713 0.475981
## grade8 2.600e+05 2.050e+05 1.268 0.204814
## grade9 4.103e+05 2.051e+05 2.001 0.045444 *
## grade10 5.736e+05 2.051e+05 2.796 0.005175 **
## grade11 7.749e+05 2.054e+05 3.774 0.000161 ***
## grade12 1.085e+06 2.064e+05 5.259 1.46e-07 ***
## grade13 1.851e+06 2.141e+05 8.645 < 2e-16 ***
## waterfront1 4.966e+05 2.002e+04 24.800 < 2e-16 ***
## view1 1.227e+05 1.142e+04 10.749 < 2e-16 ***
## view2 5.760e+04 6.914e+03 8.331 < 2e-16 ***
## view3 1.066e+05 9.445e+03 11.287 < 2e-16 ***
## view4 2.468e+05 1.461e+04 16.888 < 2e-16 ***
## sqft_lot -2.983e-01 3.445e-02 -8.658 < 2e-16 ***
## yr_built -3.303e+03 6.013e+01 -54.930 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 204800 on 21589 degrees of freedom
## Multiple R-squared: 0.6893, Adjusted R-squared: 0.6889
## F-statistic: 2082 on 23 and 21589 DF, p-value: < 2.2e-16
Use the improved model to predict house prices and evaluate its performance.
predict() on the dataset to compute predicted
prices.predict() to estimate the price of this
hypothetical house.# Generate predictions
houses$predicted_price <- predict(house_model_improved, newdata = houses)
# Visualize predicted vs. actual prices
plot(houses$price, houses$predicted_price,
main = "Actual vs. Predicted House Prices",
xlab = "Actual Prices",
ylab = "Predicted Prices",
pch = 16, col = "blue")
abline(a = 0, b = 1, col = "red", lwd = 2)## [1] 0.8302244
# Predict for a new house
new_house <- data.frame(
sqft_living = 2500, # Example value for sqft_living
sqft_living_sq = 2500^2, # Squared value for sqft_living
bedrooms = 3, # Example value for bedrooms
bathrooms = 3, # Example value for bathrooms
bath_sqft_interaction = 2500 * 3, # Interaction term between bathrooms and sqft_living
grade = factor(8, levels = levels(houses$grade)), # Example grade, as a factor
waterfront = factor(0, levels = levels(houses$waterfront)), # Waterfront as factor (0 for no)
view = factor(0, levels = levels(houses$view)), # View as factor (0 for no view)
condition = factor(3, levels = levels(houses$condition)), # Condition as factor (3 is an example)
yr_built = 2000, # Example for year built
sqft_lot = 5000 # Example sqft_lot (ensure it's included)
)
# Predict price for the new house using the improved model
predicted_price_new <- predict(house_model_improved, newdata = new_house)
# Print predicted price for the new house
print(predicted_price_new)## 1
## 542527.1
Understanding how nonlinear transformations and interaction terms influence the model is crucial for interpreting the results accurately.
sqft_living_sq):
Answer = The squared term for sqft_living_sq (the square of living area) is added to the model to account for a potential nonlinear relationship between living space and house price. The coefficient for this squared term indicates the rate of change in the house price as the square footage increases.
If the coefficient of sqft_living_sq is positive, it means that as the square footage increases, the rate of price increase becomes larger. This would suggest an increasing return on additional square footage, meaning that larger houses increase in price more steeply than smaller ones.
If the coefficient of sqft_living_sq is negative, it suggests a diminishing return on additional square footage, meaning that after a certain point, additional square footage increases the house price at a slower rate.
In conclusion our case, if the coefficient is positive, it shows that larger houses tend to be more valuable at an accelerating rate.
bath_sqft_interaction):
Answer = The interaction term between bathrooms and sqft_living captures how the effect of one variable depends on the value of the other. Specifically, this term explains how the combination of additional bathrooms and living area size jointly impacts the price of a house.
Positive coefficient for interaction: If the coefficient for bath_sqft_interaction is positive, this means that having more bathrooms amplifies the price impact of having more living space. In other words, larger homes with more bathrooms tend to be priced significantly higher than larger homes with fewer bathrooms. This could be the case in luxury homes or family-sized homes where more bathrooms are expected to accommodate larger living areas.
Negative coefficient for interaction: If the coefficient is negative, it would suggest that more bathrooms have less of a price impact on larger homes, possibly indicating that for homes with very large square footage, adding additional bathrooms doesn’t significantly increase value.
This interaction term is particularly relevant in larger homes, where the number of bathrooms might matter more in terms of functionality and perceived luxury, which directly influences pricing.
Answer = When comparing the adjusted R-squared between the initial and improved models: Initial Model: The adjusted R-squared value provides a baseline for how well the model explains the variation in house prices. This might be moderate if only linear terms are included.
Improved Model: After adding the squared and interaction terms, you should see an improvement in the adjusted R-squared value. A higher adjusted R-squared means that the improved model is explaining more of the variation in the house prices. This improvement suggests that the added complexity of capturing nonlinear relationships and interactions between predictors is helping the model make more accurate predictions.
The addition of nonlinear and interaction terms typically increases model performance by better fitting the underlying data structure, especially when relationships between variables are more complex than simple linear patterns. However, it’s important to also evaluate the significance of each added term to ensure that the increase in performance is not just due to overfitting or unnecessary complexity.
In conclusion, adding these terms should ideally improve the model’s ability to capture important nuances in how different factors interact to influence house prices, and the model’s performance (as measured by R-squared) should reflect this improvement.