#HW: Find a data set of which you can fit multiple linear regression and interpret your results

Introduction

The purpose of this is to investigate factors that influence housing prices. Multiple Linear Regression is used to model the relationship between house price and several housing characteristics including area, bedrooms, bathrooms, stories, and parking spaces.

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import Dataset

housing <- read.csv("C:/Users/justine/Downloads/Housing.csv")

head(housing)
##      price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420        4         2       3      yes        no       no
## 2 12250000 8960        4         4       4      yes        no       no
## 3 12250000 9960        3         2       2      yes        no      yes
## 4 12215000 7500        4         2       2      yes        no      yes
## 5 11410000 7420        4         1       2      yes       yes      yes
## 6 10850000 7500        3         3       1      yes        no      yes
##   hotwaterheating airconditioning parking prefarea furnishingstatus
## 1              no             yes       2      yes        furnished
## 2              no             yes       3       no        furnished
## 3              no              no       2      yes   semi-furnished
## 4              no             yes       3      yes        furnished
## 5              no             yes       2       no        furnished
## 6              no             yes       2      yes   semi-furnished
str(housing)
## 'data.frame':    545 obs. of  13 variables:
##  $ price           : int  13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
##  $ area            : int  7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
##  $ bedrooms        : int  4 4 3 4 4 3 4 5 4 3 ...
##  $ bathrooms       : int  2 4 2 2 1 3 3 3 1 2 ...
##  $ stories         : int  3 4 2 2 2 1 4 2 2 4 ...
##  $ mainroad        : chr  "yes" "yes" "yes" "yes" ...
##  $ guestroom       : chr  "no" "no" "no" "no" ...
##  $ basement        : chr  "no" "no" "yes" "yes" ...
##  $ hotwaterheating : chr  "no" "no" "no" "no" ...
##  $ airconditioning : chr  "yes" "yes" "no" "yes" ...
##  $ parking         : int  2 3 2 3 2 2 2 0 2 1 ...
##  $ prefarea        : chr  "yes" "no" "yes" "yes" ...
##  $ furnishingstatus: chr  "furnished" "furnished" "semi-furnished" "furnished" ...

Data Preparation

housing <- housing %>%
  select(price, area, bedrooms, bathrooms, stories, parking)

summary(housing)
##      price               area          bedrooms       bathrooms    
##  Min.   : 1750000   Min.   : 1650   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 3430000   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 4340000   Median : 4600   Median :3.000   Median :1.000  
##  Mean   : 4766729   Mean   : 5151   Mean   :2.965   Mean   :1.286  
##  3rd Qu.: 5740000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :13300000   Max.   :16200   Max.   :6.000   Max.   :4.000  
##     stories         parking      
##  Min.   :1.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2.000   Median :0.0000  
##  Mean   :1.806   Mean   :0.6936  
##  3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :3.0000

Exploratory Data Analysis

pairs(housing)

Fit Multiple Linear Regression Model

model <- lm(price ~ area + bedrooms + bathrooms + stories + parking,
            data = housing)

summary(model)
## 
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories + 
##     parking, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3396744  -731825   -64056   601486  5651126 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -145734.5   246634.5  -0.591   0.5548    
## area            331.1       26.6  12.448  < 2e-16 ***
## bedrooms     167809.8    82932.7   2.023   0.0435 *  
## bathrooms   1133740.2   118828.3   9.541  < 2e-16 ***
## stories      547939.8    68894.5   7.953 1.07e-14 ***
## parking      377596.3    66804.1   5.652 2.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared:  0.5616, Adjusted R-squared:  0.5575 
## F-statistic: 138.1 on 5 and 539 DF,  p-value: < 2.2e-16

Model Diagnostics

par(mfrow=c(2,2))
plot(model)

Predicted Values

housing$predicted_price <- predict(model)

head(housing)
##      price area bedrooms bathrooms stories parking predicted_price
## 1 13300000 7420        4         2       3       2         7648874
## 2 12250000 8960        4         4       4       3        11351808
## 3 12250000 9960        3         2       2       2         7774158
## 4 12215000 7500        4         2       2       3         7505020
## 5 11410000 7420        4         1       2       2         5967194
## 6 10850000 7500        3         3       1       2         7545414

Actual vs Predicted Prices

plot(housing$price,
     housing$predicted_price,
     xlab="Actual Price",
     ylab="Predicted Price",
     main="Actual vs Predicted Housing Prices")

abline(0,1)

Interpretation of Results

In this analysis, house price was used as the dependent variable, while area, number of bedrooms, number of bathrooms, number of stories, and parking spaces were used as independent variables.

The results show that all the selected variables have a positive relationship with house price. This means that houses tend to be more expensive when they have a larger area, more bedrooms, more bathrooms, more stories, or more parking spaces.

Among all the variables, area had the strongest effect on house price. This suggests that larger houses generally have higher market values. The results also show that houses with more bathrooms, additional stories, and more parking spaces are likely to sell at higher prices.

The p-values for all variables were below 0.05, indicating that each variable makes a meaningful contribution to explaining house prices. In other words, these factors are important when determining the value of a house.

The model produced an R-squared value of 0.5616, which means that about 56% of the variation in house prices can be explained by the variables included in the model. The remaining variation may be due to other factors that were not included in this analysis, such as location, age of the property, or market conditions.

Overall, the regression model was statistically significant, showing that the selected variables are useful for predicting housing prices.

Conclusion

The purpose of this study was to examine the factors that influence housing prices using Multiple Linear Regression. The analysis found that area, bedrooms, bathrooms, stories, and parking spaces all have a positive effect on house prices.

The results suggest that larger houses with more facilities generally have higher prices. The model was able to explain a reasonable amount of the variation in housing prices and showed that the selected variables are important predictors of house value.

Therefore, Multiple Linear Regression is a useful method for understanding the relationship between housing characteristics and house prices, as well as for making predictions about property values.

HW2: Variable Selection Methods

Introduction

In multiple linear regression, not all predictor variables contribute equally to the model. Some variables may have little effect on the response variable and can make the model unnecessarily complex. Variable selection methods are used to identify the most important predictors and improve the quality of the model.

The Housing dataset is used to demonstrate three common variable selection methods: Forward Selection, Backward Elimination, and Stepwise Selection.

Load Required Library

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     housing
## The following object is masked from 'package:dplyr':
## 
##     select

Create the Full Regression Model

# Create a multiple linear regression model
full_model <- lm(price ~ area + bedrooms + bathrooms +
                 stories + parking,
                 data = housing)

# Display model summary
summary(full_model)
## 
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories + 
##     parking, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3396744  -731825   -64056   601486  5651126 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -145734.5   246634.5  -0.591   0.5548    
## area            331.1       26.6  12.448  < 2e-16 ***
## bedrooms     167809.8    82932.7   2.023   0.0435 *  
## bathrooms   1133740.2   118828.3   9.541  < 2e-16 ***
## stories      547939.8    68894.5   7.953 1.07e-14 ***
## parking      377596.3    66804.1   5.652 2.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared:  0.5616, Adjusted R-squared:  0.5575 
## F-statistic: 138.1 on 5 and 539 DF,  p-value: < 2.2e-16

The full model contains all selected predictor variables and serves as the starting point for the variable selection methods.

Forward Selection

Forward Selection begins with a model that contains only the intercept. Variables are added one at a time based on their contribution to improving the model.

# Create a model with no predictors
null_model <- lm(price ~ 1, data = housing)

# Apply Forward Selection
forward_model <- step(null_model,
                      scope = formula(full_model),
                      direction = "forward")
## Start:  AIC=15742.43
## price ~ 1
## 
##             Df  Sum of Sq        RSS   AIC
## + area       1 5.4678e+14 1.3564e+15 15560
## + bathrooms  1 5.0978e+14 1.3934e+15 15574
## + stories    1 3.3687e+14 1.5663e+15 15638
## + parking    1 2.8122e+14 1.6220e+15 15657
## + bedrooms   1 2.5563e+14 1.6476e+15 15666
## <none>                    1.9032e+15 15742
## 
## Step:  AIC=15559.85
## price ~ area
## 
##             Df  Sum of Sq        RSS   AIC
## + bathrooms  1 3.3838e+14 1.0181e+15 15406
## + stories    1 2.7053e+14 1.0859e+15 15441
## + bedrooms   1 1.5835e+14 1.1981e+15 15494
## + parking    1 8.2837e+13 1.2736e+15 15528
## <none>                    1.3564e+15 15560
## 
## Step:  AIC=15405.46
## price ~ area + bathrooms
## 
##            Df  Sum of Sq        RSS   AIC
## + stories   1 1.2531e+14 8.9275e+14 15336
## + parking   1 4.8508e+13 9.6955e+14 15381
## + bedrooms  1 4.1866e+13 9.7619e+14 15385
## <none>                   1.0181e+15 15406
## 
## Step:  AIC=15335.87
## price ~ area + bathrooms + stories
## 
##            Df  Sum of Sq        RSS   AIC
## + parking   1 5.2007e+13 8.4074e+14 15305
## + bedrooms  1 8.8879e+12 8.8386e+14 15332
## <none>                   8.9275e+14 15336
## 
## Step:  AIC=15305.16
## price ~ area + bathrooms + stories + parking
## 
##            Df  Sum of Sq        RSS   AIC
## + bedrooms  1 6.3382e+12 8.3440e+14 15303
## <none>                   8.4074e+14 15305
## 
## Step:  AIC=15303.04
## price ~ area + bathrooms + stories + parking + bedrooms
# Display results
summary(forward_model)
## 
## Call:
## lm(formula = price ~ area + bathrooms + stories + parking + bedrooms, 
##     data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3396744  -731825   -64056   601486  5651126 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -145734.5   246634.5  -0.591   0.5548    
## area            331.1       26.6  12.448  < 2e-16 ***
## bathrooms   1133740.2   118828.3   9.541  < 2e-16 ***
## stories      547939.8    68894.5   7.953 1.07e-14 ***
## parking      377596.3    66804.1   5.652 2.57e-08 ***
## bedrooms     167809.8    82932.7   2.023   0.0435 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared:  0.5616, Adjusted R-squared:  0.5575 
## F-statistic: 138.1 on 5 and 539 DF,  p-value: < 2.2e-16

Interpretation

Forward Selection evaluates each variable and adds the one that provides the greatest improvement to the model. The process continues until no additional variable significantly improves the model.

Backward Elimination

Backward Elimination starts with all predictor variables and removes the least useful variable at each step.

# Apply Backward Elimination
backward_model <- step(full_model,
                       direction = "backward")
## Start:  AIC=15303.04
## price ~ area + bedrooms + bathrooms + stories + parking
## 
##             Df  Sum of Sq        RSS   AIC
## <none>                    8.3440e+14 15303
## - bedrooms   1 6.3382e+12 8.4074e+14 15305
## - parking    1 4.9458e+13 8.8386e+14 15332
## - stories    1 9.7922e+13 9.3232e+14 15362
## - bathrooms  1 1.4092e+14 9.7532e+14 15386
## - area       1 2.3988e+14 1.0743e+15 15439
# Display results
summary(backward_model)
## 
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories + 
##     parking, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3396744  -731825   -64056   601486  5651126 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -145734.5   246634.5  -0.591   0.5548    
## area            331.1       26.6  12.448  < 2e-16 ***
## bedrooms     167809.8    82932.7   2.023   0.0435 *  
## bathrooms   1133740.2   118828.3   9.541  < 2e-16 ***
## stories      547939.8    68894.5   7.953 1.07e-14 ***
## parking      377596.3    66804.1   5.652 2.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared:  0.5616, Adjusted R-squared:  0.5575 
## F-statistic: 138.1 on 5 and 539 DF,  p-value: < 2.2e-16

Interpretation

Backward Elimination removes variables that contribute the least to explaining housing prices. The procedure stops when all remaining variables are considered important.

Stepwise Selection

Stepwise Selection combines Forward Selection and Backward Elimination. Variables can be added or removed at different stages of the selection process.

# Apply Stepwise Selection
stepwise_model <- step(full_model,
                       direction = "both")
## Start:  AIC=15303.04
## price ~ area + bedrooms + bathrooms + stories + parking
## 
##             Df  Sum of Sq        RSS   AIC
## <none>                    8.3440e+14 15303
## - bedrooms   1 6.3382e+12 8.4074e+14 15305
## - parking    1 4.9458e+13 8.8386e+14 15332
## - stories    1 9.7922e+13 9.3232e+14 15362
## - bathrooms  1 1.4092e+14 9.7532e+14 15386
## - area       1 2.3988e+14 1.0743e+15 15439
# Display results
summary(stepwise_model)
## 
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories + 
##     parking, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3396744  -731825   -64056   601486  5651126 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -145734.5   246634.5  -0.591   0.5548    
## area            331.1       26.6  12.448  < 2e-16 ***
## bedrooms     167809.8    82932.7   2.023   0.0435 *  
## bathrooms   1133740.2   118828.3   9.541  < 2e-16 ***
## stories      547939.8    68894.5   7.953 1.07e-14 ***
## parking      377596.3    66804.1   5.652 2.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1244000 on 539 degrees of freedom
## Multiple R-squared:  0.5616, Adjusted R-squared:  0.5575 
## F-statistic: 138.1 on 5 and 539 DF,  p-value: < 2.2e-16

Interpretation

Stepwise Selection attempts to find the best balance between model simplicity and model performance by continuously evaluating which variables should remain in the model.

Compare the Models

AIC (Akaike Information Criterion) is commonly used to compare models. Lower AIC values generally indicate a better model.

# Compare models using AIC
AIC(full_model,
    forward_model,
    backward_model,
    stepwise_model)
##                df      AIC
## full_model      7 16851.68
## forward_model   7 16851.68
## backward_model  7 16851.68
## stepwise_model  7 16851.68

Interpretation

The model with the lowest AIC value is usually preferred because it provides a good fit while avoiding unnecessary complexity.

Advantages of Variable Selection

  • Reduces the number of unnecessary variables.
  • Improves model interpretability.
  • Helps prevent overfitting.
  • Makes predictions more reliable.
  • Produces a simpler and more efficient model.

Conclusion

This homework explored three commonly used variable selection methods: Forward Selection, Backward Elimination, and Stepwise Selection. These methods help identify the most important predictor variables in a regression model.

Using the Housing dataset, variable selection techniques were applied to evaluate the contribution of area, bedrooms, bathrooms, stories, and parking spaces to house prices. Variable selection is an important step in regression analysis because it improves model quality, reduces complexity, and helps researchers focus on the variables that have the greatest influence on the response variable.