Question 2

Use the data set OxHome.csv from AIS to answer the questions.

Question 2: (a)

Using the four explanatory variables one at a time, build four simple linear regression models to predict house sales price. Discuss which of these models provide significant fit, and which one among them is the best.

Solution

Load the required library for analysis.

library(ggplot2)
library(dplyr)
library(broom)

Now, Read the data from OxHome.csv file.

OxHome = read.csv("OxHome.csv", header = TRUE, stringsAsFactors = FALSE)
colnames(OxHome) <- c("Residence", "SalesPrice", "SquareFeet", "Rooms", "Bedrooms", "Age")

Check the data types of each of the columns in data.

str(OxHome)
## 'data.frame':    60 obs. of  6 variables:
##  $ Residence : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SalesPrice: num  110 101 104 103 107 ...
##  $ SquareFeet: int  1008 1290 860 912 1204 1204 1764 1600 1255 864 ...
##  $ Rooms     : int  5 6 8 5 6 5 8 7 5 5 ...
##  $ Bedrooms  : int  2 3 2 3 3 3 4 3 3 3 ...
##  $ Age       : int  35 36 36 41 40 10 64 19 16 37 ...
head(OxHome)
##   Residence SalesPrice SquareFeet Rooms Bedrooms Age
## 1         1      110.0       1008     5        2  35
## 2         2      101.0       1290     6        3  36
## 3         3      104.0        860     8        2  36
## 4         4      102.8        912     5        3  41
## 5         5      107.0       1204     6        3  40
## 6         6      113.0       1204     5        3  10
attach(OxHome)

(i) Creating first model using formula Sales price ~ Square Feet

Compute the correlation coefficient between the two variables using the R function cor():

cor(SalesPrice, SquareFeet)
## [1] 0.831543

The correlation coefficient measures the level of the association between two variables sales price and square feet.

Create a scatter plot displaying the sales price versus square feet of the home.

ggplot(OxHome, aes(x = SquareFeet, y = SalesPrice)) +
  geom_point(color = "Red") +
  theme(panel.background = element_blank(),
        axis.line = element_line(colour = "Black")) +
  xlab(label = "Square feet of Homes") +
  ylab("Sales Price (in $K)") +
  geom_smooth(method = lm,
              se = FALSE,
              colour = "Blue")

Model Computation

The simple linear regression tries to find the best line to predict sales price on the basis of size(square feet) of the home.

model.squarefeet <- lm(SalesPrice ~ SquareFeet, data = OxHome)
model.squarefeet
## 
## Call:
## lm(formula = SalesPrice ~ SquareFeet, data = OxHome)
## 
## Coefficients:
## (Intercept)   SquareFeet  
##    -11.5379       0.1122

The results show the intercept and the beta coefficient for the OxHome variable.

Model summary

We start by displaying the statistical summary of the model using the R function summary():

summary(model.squarefeet)
## 
## Call:
## lm(formula = SalesPrice ~ SquareFeet, data = OxHome)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.010 -17.411   0.281  12.838  85.577 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.537932  14.929279  -0.773    0.443    
## SquareFeet    0.112158   0.009837  11.401   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.33 on 58 degrees of freedom
## Multiple R-squared:  0.6915, Adjusted R-squared:  0.6861 
## F-statistic:   130 on 1 and 58 DF,  p-value: < 2.2e-16

The summary outputs shows 6 components, including:

  • Call: Shows the function call used to compute the regression model.
  • Residuals: Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.
  • Coefficients: Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.
  • Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.

Coefficients significance

The coefficients table, in the above statistical summary:

  • the estimates of the beta coefficients
  • the Std. Error (SE), which defines the accuracy of beta coefficients. It can be used to compute the confidence intervals and the t-statistic.
  • The t-statistic and the associated p-value, denotes the statistical significance of the beta coefficients.

Model Evaluation

To use R’s regression diagnostic plots, we plot the regression model as an object.

par(mfrow = c(2,2))
plot(model.squarefeet)

  • The first plot (residuals vs. fitted values) is a simple scatterplot between residuals and predicted values. It should look more or less random. This shows how are fitted value and their errors varies.

  • The second plot (normal Q-Q) is a normal probability plot. It will give a straight line if the errors are distributed normally, but at the end we can see there is some deviation from the straight line it shows some skewness in the distribution.

  • The third plot (Scale-Location), like the the first, should look random. This plot is used to assess the assumptions of Heteroscedasticity i.e a problem because ordinary least squares (OLS) regression assumes that all residuals are drawn from a population that has a constant variance (homoscedasticity). Here, we can see that the variance is almost constant.

  • The last plot (Residuals vs Leverage) tells us which points have the greatest influence on the regression model.

Now, lets store the statistical coefficient in a data frame to compare all the models at the once based on their coefficient, to arrive at the conclusion which among these are the best fit model.

df = data.frame()

df <- rbind(
  df,
  glance(model.squarefeet) %>%
    select(adj.r.squared, sigma, AIC, BIC, p.value) %>%
    mutate(Model.Name = "model.squarefeet")
)

(ii) Creating first model using formula Sales price ~ Rooms

Compute the correlation coefficient between the two variables using the R function cor():

cor(SalesPrice, Rooms)
## [1] 0.588339

The correlation coefficient measures the level of the association between two variables sales price and Rooms.

Create a scatter plot displaying the sales price versus square feet of the home.

ggplot(OxHome, aes(x = Rooms, y = SalesPrice)) +
  geom_point(color = "Blue") +
  theme(panel.background = element_blank(),
        axis.line = element_line(colour = "Black")) +
  xlab(label = "Square feet of Homes") +
  ylab("Sales Price (in $K)") +
  geom_smooth(method = lm,
              se = FALSE,
              colour = "Orange")

Model Computation

The simple linear regression tries to find the best line to predict sales price on the basis of size(square feet) of the home.

model.Rooms <- lm(SalesPrice ~ Rooms, data = OxHome)
model.Rooms
## 
## Call:
## lm(formula = SalesPrice ~ Rooms, data = OxHome)
## 
## Coefficients:
## (Intercept)        Rooms  
##       13.13        22.56

The results show the intercept and the beta coefficient for the OxHome variable.

Model summary

We start by displaying the statistical summary of the model using the R function summary():

summary(model.Rooms)
## 
## Call:
## lm(formula = SalesPrice ~ Rooms, data = OxHome)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -89.621 -29.950  -6.719  19.660 136.940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13.134     25.766   0.510    0.612    
## Rooms         22.561      4.071   5.541 7.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.07 on 58 degrees of freedom
## Multiple R-squared:  0.3461, Adjusted R-squared:  0.3349 
## F-statistic:  30.7 on 1 and 58 DF,  p-value: 7.673e-07

Model Evaluation

To use R’s regression diagnostic plots, we plot the regression model as an object.

par(mfrow = c(2,2))
plot(model.Rooms)

Appending the coefficient to the previous data frame

df <- rbind(
  df,
  glance(model.Rooms) %>%
    select(adj.r.squared, sigma, AIC, BIC, p.value) %>%
    mutate(Model.Name = "model.Rooms")
)

(iii) Creating first model using formula Sales price ~ Bedrooms

Compute the correlation coefficient between the two variables using the R function cor():

cor(SalesPrice, Bedrooms)
## [1] 0.4544255

The correlation coefficient measures the level of the association between two variables sales price and Bedrooms

Create a scatter plot displaying the sales price versus square feet of the home.

ggplot(OxHome, aes(x = Bedrooms, y = SalesPrice)) +
  geom_point(color = "deeppink1") +
  theme(panel.background = element_blank(),
        axis.line = element_line(colour = "Black")) +
  xlab(label = "Square feet of Homes") +
  ylab("Sales Price (in $K)") +
  geom_smooth(method = lm,
              se = FALSE,
              colour = "blue")

Model Computation

The simple linear regression tries to find the best line to predict sales price on the basis of size(square feet) of the home.

model.Bedrooms <- lm(SalesPrice ~ Bedrooms, data = OxHome)
model.Bedrooms
## 
## Call:
## lm(formula = SalesPrice ~ Bedrooms, data = OxHome)
## 
## Coefficients:
## (Intercept)     Bedrooms  
##       31.11        41.65

The results show the intercept and the beta coefficient for the OxHome variable.

Model summary

We start by displaying the statistical summary of the model using the R function summary():

summary(model.Bedrooms)
## 
## Call:
## lm(formula = SalesPrice ~ Bedrooms, data = OxHome)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.693 -36.814  -8.048  20.952 151.952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    31.11      31.80   0.978 0.331932    
## Bedrooms       41.65      10.72   3.885 0.000265 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.85 on 58 degrees of freedom
## Multiple R-squared:  0.2065, Adjusted R-squared:  0.1928 
## F-statistic: 15.09 on 1 and 58 DF,  p-value: 0.000265

Model Evaluation

To use R’s regression diagnostic plots, we plot the regression model as an object.

par(mfrow = c(2,2))
plot(model.Bedrooms)

Appending the coefficient to the previous data frame

df <- rbind(
  df,
  glance(model.Bedrooms) %>%
    select(adj.r.squared, sigma, AIC, BIC, p.value) %>%
    mutate(Model.Name = "model.Bedrooms")
)

(iv) Creating first model using formula Sales price ~ Age

Compute the correlation coefficient between the two variables using the R function cor():

cor(SalesPrice, Age)
## [1] -0.294834

The correlation coefficient measures the level of the association between two variables sales price and Age

Create a scatter plot displaying the sales price versus square feet of the home.

ggplot(OxHome, aes(x = Age, y = SalesPrice)) +
  geom_point(color = "Green") +
  theme(panel.background = element_blank(),
        axis.line = element_line(colour = "Black")) +
  xlab(label = "Square feet of Homes") +
  ylab("Sales Price (in $K)") +
  geom_smooth(method = lm,
              se = FALSE,
              colour = "Red")

Model Computation

The simple linear regression tries to find the best line to predict sales price on the basis of size(square feet) of the home.

model.Age <- lm(SalesPrice ~ Age, data = OxHome)
model.Age
## 
## Call:
## lm(formula = SalesPrice ~ Age, data = OxHome)
## 
## Coefficients:
## (Intercept)          Age  
##    174.9518      -0.7138

The results show the intercept and the beta coefficient for the OxHome variable.

Model summary

We start by displaying the statistical summary of the model using the R function summary():

summary(model.Age)
## 
## Call:
## lm(formula = SalesPrice ~ Age, data = OxHome)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73.97 -41.36 -14.96  22.94 154.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 174.9518    12.1630   14.38   <2e-16 ***
## Age          -0.7138     0.3038   -2.35   0.0222 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.62 on 58 degrees of freedom
## Multiple R-squared:  0.08693,    Adjusted R-squared:  0.07118 
## F-statistic: 5.522 on 1 and 58 DF,  p-value: 0.02221

Model Evaluation

To use R’s regression diagnostic plots, we plot the regression model as an object.

par(mfrow = c(2,2))
plot(model.Age)

Appending the coefficient to the previous data frame

df <- rbind(
  df,
  glance(model.Age) %>%
    select(adj.r.squared, sigma, AIC, BIC, p.value) %>%
    mutate(Model.Name = "model.Age")
)

Model Comparision

Now we need to compare all the above 4 models to find out, which one is the best fit. We have collected the different statistical coefficient in a data frame.

df.comparision <-
  df %>% select(Model.Name, adj.r.squared, sigma, AIC, BIC, p.value)
df.comparision
## # A tibble: 4 x 6
##   Model.Name       adj.r.squared sigma   AIC   BIC  p.value
##   <chr>                    <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 model.squarefeet        0.686   32.3  591.  598. 1.93e-16
## 2 model.Rooms             0.335   47.1  636.  643. 7.67e- 7
## 3 model.Bedrooms          0.193   51.9  648.  654. 2.65e- 4
## 4 model.Age               0.0712  55.6  656.  663. 2.22e- 2

Choosing the best fit model

From above table we can conclude that the model 1 i.e model.squarefeet looks best fit among all others due to the following reasons:

  • model.squarefeet has the lowest AIC, we know that the lower the AIC, better is the model.

  • Adjusted R-square is also highest among all the models and based on this we can say that the model with higest adjusted R-squared can explain the model outcome better. As this metrix penalise the extra variable in the model which is less significant.

  • Also, this model has the least residual error(sigma). We know that Regression is based on the OLS( ordinal lease square Law), so the less error signifies the better fit for the model.

Question 2: (b)

Build a multiple linear regression model to predict house sales price using all four explanatory variables. Comment on the fit.

Solution

Model Computation

The Multiple linear regression tries to find the best line to predict sales price on the basis of all variables SquareFeet, Rooms, Bedrooms & Age available for the home.

MLR.Model <-
  lm(SalesPrice ~ SquareFeet + Rooms + Bedrooms + Age, data = OxHome)

MLR.Model
## 
## Call:
## lm(formula = SalesPrice ~ SquareFeet + Rooms + Bedrooms + Age, 
##     data = OxHome)
## 
## Coefficients:
## (Intercept)   SquareFeet        Rooms     Bedrooms          Age  
##     19.4865       0.1079      11.8807     -25.7280      -0.7207

The results show the intercept and the beta coefficient for each variable of OxHome data.

Model summary

We start by displaying the statistical summary of the model using the R function summary():

summary(MLR.Model)
## 
## Call:
## lm(formula = SalesPrice ~ SquareFeet + Rooms + Bedrooms + Age, 
##     data = OxHome)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.14 -14.68  -1.13  15.62  66.08 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.48651   16.86523   1.155  0.25291    
## SquareFeet    0.10791    0.01228   8.788 4.62e-12 ***
## Rooms        11.88067    3.51702   3.378  0.00135 ** 
## Bedrooms    -25.72796    8.13677  -3.162  0.00255 ** 
## Age          -0.72074    0.15307  -4.709 1.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.21 on 55 degrees of freedom
## Multiple R-squared:  0.8078, Adjusted R-squared:  0.7938 
## F-statistic: 57.77 on 4 and 55 DF,  p-value: < 2.2e-16

The above statistical coefficient signifies that the model will perform much better than the previous models with only signle variable. Based on the P-values of the coefficient of variable, SquareFeet and Age are the two highly significant vaariable for model and contribution the most to describing the model. This is also denoted by the Signif. codes(***). The variable with 3-stars are the most significant variabe for models.

Model Evaluation

To check how the model fitted or the diagnostics of the model to reject or accept this.

par(mfrow = c(2,2))
plot(MLR.Model)

From the avove chart we can say that the residuals and error are almost equal uniform and also the standard residuals are not having much variance. This shows that the model will perform good to predict the sales price based on these variable.

Question 2: (c)

Using stepwise regression or otherwise, find the best model either in terms of adjusted R2, or AIC. Discuss the significance of the predictors in this model.

Solution

For Stepwise Regression, there are 3 methods to choose the best fitting methods based on the AIC The 3 methods are:

  • The backward method: The backward procedure begins with a general model that includes all variables –> eliminates variable one at a time —> until the best model obtained.
  • The forward method: The forward method begins with a simplest level model (no predictor) –>> adds suitable variable one at a time —> until the best model obtained.
  • The both side Method: The stepwise method is the combination of backwrad and forward procedures AIC is sued as the marker for selecting best model.

Model Computation

Lets create the stepwise regression with the step functions using forward method where we start with the creating the constant model as SalesPrice ~ 1 and there after adding each one of the variable and compute the AIC to find the best model having lowest AIC.

Step.Model <-
  step(
    lm(SalesPrice ~ 1, data = OxHome[, -1]),
    direction = "forward",
    scope =  ~ SquareFeet + Rooms + Bedrooms + Age
  )
## Start:  AIC=487.65
## SalesPrice ~ 1
## 
##              Df Sum of Sq    RSS    AIC
## + SquareFeet  1    135891  60636 419.10
## + Rooms       1     68026 128501 464.16
## + Bedrooms    1     40583 155944 475.77
## + Age         1     17084 179444 484.20
## <none>                    196527 487.65
## 
## Step:  AIC=419.1
## SalesPrice ~ SquareFeet
## 
##            Df Sum of Sq   RSS    AIC
## + Age       1   12173.5 48462 407.65
## + Bedrooms  1    4909.4 55726 416.03
## <none>                  60636 419.10
## + Rooms     1     369.7 60266 420.73
## 
## Step:  AIC=407.65
## SalesPrice ~ SquareFeet + Age
## 
##            Df Sum of Sq   RSS    AIC
## + Rooms     1    3813.8 44648 404.73
## + Bedrooms  1    2842.9 45619 406.02
## <none>                  48462 407.65
## 
## Step:  AIC=404.73
## SalesPrice ~ SquareFeet + Age + Rooms
## 
##            Df Sum of Sq   RSS    AIC
## + Bedrooms  1    6867.7 37781 396.71
## <none>                  44648 404.73
## 
## Step:  AIC=396.71
## SalesPrice ~ SquareFeet + Age + Rooms + Bedrooms
Step.Model
## 
## Call:
## lm(formula = SalesPrice ~ SquareFeet + Age + Rooms + Bedrooms, 
##     data = OxHome[, -1])
## 
## Coefficients:
## (Intercept)   SquareFeet          Age        Rooms     Bedrooms  
##     19.4865       0.1079      -0.7207      11.8807     -25.7280

Below is the statistical significant coefficient of each varaible considered by the model to find the best model.

summary(Step.Model)
## 
## Call:
## lm(formula = SalesPrice ~ SquareFeet + Age + Rooms + Bedrooms, 
##     data = OxHome[, -1])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.14 -14.68  -1.13  15.62  66.08 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.48651   16.86523   1.155  0.25291    
## SquareFeet    0.10791    0.01228   8.788 4.62e-12 ***
## Age          -0.72074    0.15307  -4.709 1.73e-05 ***
## Rooms        11.88067    3.51702   3.378  0.00135 ** 
## Bedrooms    -25.72796    8.13677  -3.162  0.00255 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.21 on 55 degrees of freedom
## Multiple R-squared:  0.8078, Adjusted R-squared:  0.7938 
## F-statistic: 57.77 on 4 and 55 DF,  p-value: < 2.2e-16

Question 2: (d)

Forecast the price of the following house using the model fitted in part c): Area = 5000 sq. ft., 20 rooms with 10 bedrooms, 100 years old. Comment on the validity of the forecast.

Solution

We need to create a dataframe with the desired input to predict their sales price of the home. After that we predict the price of the home using the model created in stepwise regression with predict function.

test <- data.frame(SquareFeet=5000,Rooms=20,Bedrooms=10,Age=100,SalesPrice=NA)
home.pred <- predict(Step.Model, test)
home.pred
##        1 
## 467.3019

The sales price of the house having Area = 5000 sq. ft., 20 rooms with 10 bedrooms, 100 years old is comes out to be 467.3019.