#HW: Find a data set of which you can fit multiple linear regression and interpret your results
The purpose of this is to investigate factors that influence housing prices. Multiple Linear Regression is used to model the relationship between house price and several housing characteristics including area, bedrooms, bathrooms, stories, and parking spaces.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
housing <- read.csv("C:/Users/justine/Downloads/Housing.csv")
head(housing)
## price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420 4 2 3 yes no no
## 2 12250000 8960 4 4 4 yes no no
## 3 12250000 9960 3 2 2 yes no yes
## 4 12215000 7500 4 2 2 yes no yes
## 5 11410000 7420 4 1 2 yes yes yes
## 6 10850000 7500 3 3 1 yes no yes
## hotwaterheating airconditioning parking prefarea furnishingstatus
## 1 no yes 2 yes furnished
## 2 no yes 3 no furnished
## 3 no no 2 yes semi-furnished
## 4 no yes 3 yes furnished
## 5 no yes 2 no furnished
## 6 no yes 2 yes semi-furnished
str(housing)
## 'data.frame': 545 obs. of 13 variables:
## $ price : int 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
## $ area : int 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
## $ bedrooms : int 4 4 3 4 4 3 4 5 4 3 ...
## $ bathrooms : int 2 4 2 2 1 3 3 3 1 2 ...
## $ stories : int 3 4 2 2 2 1 4 2 2 4 ...
## $ mainroad : chr "yes" "yes" "yes" "yes" ...
## $ guestroom : chr "no" "no" "no" "no" ...
## $ basement : chr "no" "no" "yes" "yes" ...
## $ hotwaterheating : chr "no" "no" "no" "no" ...
## $ airconditioning : chr "yes" "yes" "no" "yes" ...
## $ parking : int 2 3 2 3 2 2 2 0 2 1 ...
## $ prefarea : chr "yes" "no" "yes" "yes" ...
## $ furnishingstatus: chr "furnished" "furnished" "semi-furnished" "furnished" ...
housing <- housing %>%
select(price, area, bedrooms, bathrooms, stories, parking)
summary(housing)
## price area bedrooms bathrooms
## Min. : 1750000 Min. : 1650 Min. :1.000 Min. :1.000
## 1st Qu.: 3430000 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000
## Median : 4340000 Median : 4600 Median :3.000 Median :1.000
## Mean : 4766729 Mean : 5151 Mean :2.965 Mean :1.286
## 3rd Qu.: 5740000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :13300000 Max. :16200 Max. :6.000 Max. :4.000
## stories parking
## Min. :1.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.0000
## Median :2.000 Median :0.0000
## Mean :1.806 Mean :0.6936
## 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :4.000 Max. :3.0000
pairs(housing)
model <- lm(price ~ area + bedrooms + bathrooms + stories + parking,
data = housing)
summary(model)
##
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories +
## parking, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3396744 -731825 -64056 601486 5651126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -145734.5 246634.5 -0.591 0.5548
## area 331.1 26.6 12.448 < 2e-16 ***
## bedrooms 167809.8 82932.7 2.023 0.0435 *
## bathrooms 1133740.2 118828.3 9.541 < 2e-16 ***
## stories 547939.8 68894.5 7.953 1.07e-14 ***
## parking 377596.3 66804.1 5.652 2.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared: 0.5616, Adjusted R-squared: 0.5575
## F-statistic: 138.1 on 5 and 539 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model)
housing$predicted_price <- predict(model)
head(housing)
## price area bedrooms bathrooms stories parking predicted_price
## 1 13300000 7420 4 2 3 2 7648874
## 2 12250000 8960 4 4 4 3 11351808
## 3 12250000 9960 3 2 2 2 7774158
## 4 12215000 7500 4 2 2 3 7505020
## 5 11410000 7420 4 1 2 2 5967194
## 6 10850000 7500 3 3 1 2 7545414
plot(housing$price,
housing$predicted_price,
xlab="Actual Price",
ylab="Predicted Price",
main="Actual vs Predicted Housing Prices")
abline(0,1)
In this analysis, house price was used as the dependent variable, while area, number of bedrooms, number of bathrooms, number of stories, and parking spaces were used as independent variables.
The results show that all the selected variables have a positive relationship with house price. This means that houses tend to be more expensive when they have a larger area, more bedrooms, more bathrooms, more stories, or more parking spaces.
Among all the variables, area had the strongest effect on house price. This suggests that larger houses generally have higher market values. The results also show that houses with more bathrooms, additional stories, and more parking spaces are likely to sell at higher prices.
The p-values for all variables were below 0.05, indicating that each variable makes a meaningful contribution to explaining house prices. In other words, these factors are important when determining the value of a house.
The model produced an R-squared value of 0.5616, which means that about 56% of the variation in house prices can be explained by the variables included in the model. The remaining variation may be due to other factors that were not included in this analysis, such as location, age of the property, or market conditions.
Overall, the regression model was statistically significant, showing that the selected variables are useful for predicting housing prices.
The purpose of this study was to examine the factors that influence housing prices using Multiple Linear Regression. The analysis found that area, bedrooms, bathrooms, stories, and parking spaces all have a positive effect on house prices.
The results suggest that larger houses with more facilities generally have higher prices. The model was able to explain a reasonable amount of the variation in housing prices and showed that the selected variables are important predictors of house value.
Therefore, Multiple Linear Regression is a useful method for understanding the relationship between housing characteristics and house prices, as well as for making predictions about property values.
In multiple linear regression, not all predictor variables contribute equally to the model. Some variables may have little effect on the response variable and can make the model unnecessarily complex. Variable selection methods are used to identify the most important predictors and improve the quality of the model.
The Housing dataset is used to demonstrate three common variable selection methods: Forward Selection, Backward Elimination, and Stepwise Selection.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
##
## housing
## The following object is masked from 'package:dplyr':
##
## select
# Create a multiple linear regression model
full_model <- lm(price ~ area + bedrooms + bathrooms +
stories + parking,
data = housing)
# Display model summary
summary(full_model)
##
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories +
## parking, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3396744 -731825 -64056 601486 5651126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -145734.5 246634.5 -0.591 0.5548
## area 331.1 26.6 12.448 < 2e-16 ***
## bedrooms 167809.8 82932.7 2.023 0.0435 *
## bathrooms 1133740.2 118828.3 9.541 < 2e-16 ***
## stories 547939.8 68894.5 7.953 1.07e-14 ***
## parking 377596.3 66804.1 5.652 2.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared: 0.5616, Adjusted R-squared: 0.5575
## F-statistic: 138.1 on 5 and 539 DF, p-value: < 2.2e-16
The full model contains all selected predictor variables and serves as the starting point for the variable selection methods.
Forward Selection begins with a model that contains only the intercept. Variables are added one at a time based on their contribution to improving the model.
# Create a model with no predictors
null_model <- lm(price ~ 1, data = housing)
# Apply Forward Selection
forward_model <- step(null_model,
scope = formula(full_model),
direction = "forward")
## Start: AIC=15742.43
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + area 1 5.4678e+14 1.3564e+15 15560
## + bathrooms 1 5.0978e+14 1.3934e+15 15574
## + stories 1 3.3687e+14 1.5663e+15 15638
## + parking 1 2.8122e+14 1.6220e+15 15657
## + bedrooms 1 2.5563e+14 1.6476e+15 15666
## <none> 1.9032e+15 15742
##
## Step: AIC=15559.85
## price ~ area
##
## Df Sum of Sq RSS AIC
## + bathrooms 1 3.3838e+14 1.0181e+15 15406
## + stories 1 2.7053e+14 1.0859e+15 15441
## + bedrooms 1 1.5835e+14 1.1981e+15 15494
## + parking 1 8.2837e+13 1.2736e+15 15528
## <none> 1.3564e+15 15560
##
## Step: AIC=15405.46
## price ~ area + bathrooms
##
## Df Sum of Sq RSS AIC
## + stories 1 1.2531e+14 8.9275e+14 15336
## + parking 1 4.8508e+13 9.6955e+14 15381
## + bedrooms 1 4.1866e+13 9.7619e+14 15385
## <none> 1.0181e+15 15406
##
## Step: AIC=15335.87
## price ~ area + bathrooms + stories
##
## Df Sum of Sq RSS AIC
## + parking 1 5.2007e+13 8.4074e+14 15305
## + bedrooms 1 8.8879e+12 8.8386e+14 15332
## <none> 8.9275e+14 15336
##
## Step: AIC=15305.16
## price ~ area + bathrooms + stories + parking
##
## Df Sum of Sq RSS AIC
## + bedrooms 1 6.3382e+12 8.3440e+14 15303
## <none> 8.4074e+14 15305
##
## Step: AIC=15303.04
## price ~ area + bathrooms + stories + parking + bedrooms
# Display results
summary(forward_model)
##
## Call:
## lm(formula = price ~ area + bathrooms + stories + parking + bedrooms,
## data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3396744 -731825 -64056 601486 5651126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -145734.5 246634.5 -0.591 0.5548
## area 331.1 26.6 12.448 < 2e-16 ***
## bathrooms 1133740.2 118828.3 9.541 < 2e-16 ***
## stories 547939.8 68894.5 7.953 1.07e-14 ***
## parking 377596.3 66804.1 5.652 2.57e-08 ***
## bedrooms 167809.8 82932.7 2.023 0.0435 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared: 0.5616, Adjusted R-squared: 0.5575
## F-statistic: 138.1 on 5 and 539 DF, p-value: < 2.2e-16
Forward Selection evaluates each variable and adds the one that provides the greatest improvement to the model. The process continues until no additional variable significantly improves the model.
Backward Elimination starts with all predictor variables and removes the least useful variable at each step.
# Apply Backward Elimination
backward_model <- step(full_model,
direction = "backward")
## Start: AIC=15303.04
## price ~ area + bedrooms + bathrooms + stories + parking
##
## Df Sum of Sq RSS AIC
## <none> 8.3440e+14 15303
## - bedrooms 1 6.3382e+12 8.4074e+14 15305
## - parking 1 4.9458e+13 8.8386e+14 15332
## - stories 1 9.7922e+13 9.3232e+14 15362
## - bathrooms 1 1.4092e+14 9.7532e+14 15386
## - area 1 2.3988e+14 1.0743e+15 15439
# Display results
summary(backward_model)
##
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories +
## parking, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3396744 -731825 -64056 601486 5651126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -145734.5 246634.5 -0.591 0.5548
## area 331.1 26.6 12.448 < 2e-16 ***
## bedrooms 167809.8 82932.7 2.023 0.0435 *
## bathrooms 1133740.2 118828.3 9.541 < 2e-16 ***
## stories 547939.8 68894.5 7.953 1.07e-14 ***
## parking 377596.3 66804.1 5.652 2.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared: 0.5616, Adjusted R-squared: 0.5575
## F-statistic: 138.1 on 5 and 539 DF, p-value: < 2.2e-16
Backward Elimination removes variables that contribute the least to explaining housing prices. The procedure stops when all remaining variables are considered important.
Stepwise Selection combines Forward Selection and Backward Elimination. Variables can be added or removed at different stages of the selection process.
# Apply Stepwise Selection
stepwise_model <- step(full_model,
direction = "both")
## Start: AIC=15303.04
## price ~ area + bedrooms + bathrooms + stories + parking
##
## Df Sum of Sq RSS AIC
## <none> 8.3440e+14 15303
## - bedrooms 1 6.3382e+12 8.4074e+14 15305
## - parking 1 4.9458e+13 8.8386e+14 15332
## - stories 1 9.7922e+13 9.3232e+14 15362
## - bathrooms 1 1.4092e+14 9.7532e+14 15386
## - area 1 2.3988e+14 1.0743e+15 15439
# Display results
summary(stepwise_model)
##
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories +
## parking, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3396744 -731825 -64056 601486 5651126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -145734.5 246634.5 -0.591 0.5548
## area 331.1 26.6 12.448 < 2e-16 ***
## bedrooms 167809.8 82932.7 2.023 0.0435 *
## bathrooms 1133740.2 118828.3 9.541 < 2e-16 ***
## stories 547939.8 68894.5 7.953 1.07e-14 ***
## parking 377596.3 66804.1 5.652 2.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared: 0.5616, Adjusted R-squared: 0.5575
## F-statistic: 138.1 on 5 and 539 DF, p-value: < 2.2e-16
Stepwise Selection attempts to find the best balance between model simplicity and model performance by continuously evaluating which variables should remain in the model.
AIC (Akaike Information Criterion) is commonly used to compare models. Lower AIC values generally indicate a better model.
# Compare models using AIC
AIC(full_model,
forward_model,
backward_model,
stepwise_model)
## df AIC
## full_model 7 16851.68
## forward_model 7 16851.68
## backward_model 7 16851.68
## stepwise_model 7 16851.68
The model with the lowest AIC value is usually preferred because it provides a good fit while avoiding unnecessary complexity.
This homework explored three commonly used variable selection methods: Forward Selection, Backward Elimination, and Stepwise Selection. These methods help identify the most important predictor variables in a regression model.
Using the Housing dataset, variable selection techniques were applied to evaluate the contribution of area, bedrooms, bathrooms, stories, and parking spaces to house prices. Variable selection is an important step in regression analysis because it improves model quality, reduces complexity, and helps researchers focus on the variables that have the greatest influence on the response variable.