Use the data set OxHome.csv from AIS to answer the questions.
Question 2: (a)
Using the four explanatory variables one at a time, build four simple linear regression models to predict house sales price. Discuss which of these models provide significant fit, and which one among them is the best.
Solution
Load the required library for analysis.
library(ggplot2)
library(dplyr)
library(broom)
Now, Read the data from OxHome.csv file.
OxHome = read.csv("OxHome.csv", header = TRUE, stringsAsFactors = FALSE)
colnames(OxHome) <- c("Residence", "SalesPrice", "SquareFeet", "Rooms", "Bedrooms", "Age")
Check the data types of each of the columns in data.
str(OxHome)
## 'data.frame': 60 obs. of 6 variables:
## $ Residence : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SalesPrice: num 110 101 104 103 107 ...
## $ SquareFeet: int 1008 1290 860 912 1204 1204 1764 1600 1255 864 ...
## $ Rooms : int 5 6 8 5 6 5 8 7 5 5 ...
## $ Bedrooms : int 2 3 2 3 3 3 4 3 3 3 ...
## $ Age : int 35 36 36 41 40 10 64 19 16 37 ...
head(OxHome)
## Residence SalesPrice SquareFeet Rooms Bedrooms Age
## 1 1 110.0 1008 5 2 35
## 2 2 101.0 1290 6 3 36
## 3 3 104.0 860 8 2 36
## 4 4 102.8 912 5 3 41
## 5 5 107.0 1204 6 3 40
## 6 6 113.0 1204 5 3 10
attach(OxHome)
Compute the correlation coefficient between the two variables using the R function cor():
cor(SalesPrice, SquareFeet)
## [1] 0.831543
The correlation coefficient measures the level of the association between two variables sales price and square feet.
Create a scatter plot displaying the sales price versus square feet of the home.
ggplot(OxHome, aes(x = SquareFeet, y = SalesPrice)) +
geom_point(color = "Red") +
theme(panel.background = element_blank(),
axis.line = element_line(colour = "Black")) +
xlab(label = "Square feet of Homes") +
ylab("Sales Price (in $K)") +
geom_smooth(method = lm,
se = FALSE,
colour = "Blue")
Model Computation
The simple linear regression tries to find the best line to predict sales price on the basis of size(square feet) of the home.
model.squarefeet <- lm(SalesPrice ~ SquareFeet, data = OxHome)
model.squarefeet
##
## Call:
## lm(formula = SalesPrice ~ SquareFeet, data = OxHome)
##
## Coefficients:
## (Intercept) SquareFeet
## -11.5379 0.1122
The results show the intercept and the beta coefficient for the OxHome variable.
Model summary
We start by displaying the statistical summary of the model using the R function summary():
summary(model.squarefeet)
##
## Call:
## lm(formula = SalesPrice ~ SquareFeet, data = OxHome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.010 -17.411 0.281 12.838 85.577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.537932 14.929279 -0.773 0.443
## SquareFeet 0.112158 0.009837 11.401 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.33 on 58 degrees of freedom
## Multiple R-squared: 0.6915, Adjusted R-squared: 0.6861
## F-statistic: 130 on 1 and 58 DF, p-value: < 2.2e-16
The summary outputs shows 6 components, including:
Coefficients significance
The coefficients table, in the above statistical summary:
Model Evaluation
To use R’s regression diagnostic plots, we plot the regression model as an object.
par(mfrow = c(2,2))
plot(model.squarefeet)
The first plot (residuals vs. fitted values) is a simple scatterplot between residuals and predicted values. It should look more or less random. This shows how are fitted value and their errors varies.
The second plot (normal Q-Q) is a normal probability plot. It will give a straight line if the errors are distributed normally, but at the end we can see there is some deviation from the straight line it shows some skewness in the distribution.
The third plot (Scale-Location), like the the first, should look random. This plot is used to assess the assumptions of Heteroscedasticity i.e a problem because ordinary least squares (OLS) regression assumes that all residuals are drawn from a population that has a constant variance (homoscedasticity). Here, we can see that the variance is almost constant.
The last plot (Residuals vs Leverage) tells us which points have the greatest influence on the regression model.
Now, lets store the statistical coefficient in a data frame to compare all the models at the once based on their coefficient, to arrive at the conclusion which among these are the best fit model.
df = data.frame()
df <- rbind(
df,
glance(model.squarefeet) %>%
select(adj.r.squared, sigma, AIC, BIC, p.value) %>%
mutate(Model.Name = "model.squarefeet")
)
Compute the correlation coefficient between the two variables using the R function cor():
cor(SalesPrice, Rooms)
## [1] 0.588339
The correlation coefficient measures the level of the association between two variables sales price and Rooms.
Create a scatter plot displaying the sales price versus square feet of the home.
ggplot(OxHome, aes(x = Rooms, y = SalesPrice)) +
geom_point(color = "Blue") +
theme(panel.background = element_blank(),
axis.line = element_line(colour = "Black")) +
xlab(label = "Square feet of Homes") +
ylab("Sales Price (in $K)") +
geom_smooth(method = lm,
se = FALSE,
colour = "Orange")
Model Computation
The simple linear regression tries to find the best line to predict sales price on the basis of size(square feet) of the home.
model.Rooms <- lm(SalesPrice ~ Rooms, data = OxHome)
model.Rooms
##
## Call:
## lm(formula = SalesPrice ~ Rooms, data = OxHome)
##
## Coefficients:
## (Intercept) Rooms
## 13.13 22.56
The results show the intercept and the beta coefficient for the OxHome variable.
Model summary
We start by displaying the statistical summary of the model using the R function summary():
summary(model.Rooms)
##
## Call:
## lm(formula = SalesPrice ~ Rooms, data = OxHome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.621 -29.950 -6.719 19.660 136.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.134 25.766 0.510 0.612
## Rooms 22.561 4.071 5.541 7.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.07 on 58 degrees of freedom
## Multiple R-squared: 0.3461, Adjusted R-squared: 0.3349
## F-statistic: 30.7 on 1 and 58 DF, p-value: 7.673e-07
Model Evaluation
To use R’s regression diagnostic plots, we plot the regression model as an object.
par(mfrow = c(2,2))
plot(model.Rooms)
Appending the coefficient to the previous data frame
df <- rbind(
df,
glance(model.Rooms) %>%
select(adj.r.squared, sigma, AIC, BIC, p.value) %>%
mutate(Model.Name = "model.Rooms")
)
Compute the correlation coefficient between the two variables using the R function cor():
cor(SalesPrice, Bedrooms)
## [1] 0.4544255
The correlation coefficient measures the level of the association between two variables sales price and Bedrooms
Create a scatter plot displaying the sales price versus square feet of the home.
ggplot(OxHome, aes(x = Bedrooms, y = SalesPrice)) +
geom_point(color = "deeppink1") +
theme(panel.background = element_blank(),
axis.line = element_line(colour = "Black")) +
xlab(label = "Square feet of Homes") +
ylab("Sales Price (in $K)") +
geom_smooth(method = lm,
se = FALSE,
colour = "blue")
Model Computation
The simple linear regression tries to find the best line to predict sales price on the basis of size(square feet) of the home.
model.Bedrooms <- lm(SalesPrice ~ Bedrooms, data = OxHome)
model.Bedrooms
##
## Call:
## lm(formula = SalesPrice ~ Bedrooms, data = OxHome)
##
## Coefficients:
## (Intercept) Bedrooms
## 31.11 41.65
The results show the intercept and the beta coefficient for the OxHome variable.
Model summary
We start by displaying the statistical summary of the model using the R function summary():
summary(model.Bedrooms)
##
## Call:
## lm(formula = SalesPrice ~ Bedrooms, data = OxHome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.693 -36.814 -8.048 20.952 151.952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.11 31.80 0.978 0.331932
## Bedrooms 41.65 10.72 3.885 0.000265 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.85 on 58 degrees of freedom
## Multiple R-squared: 0.2065, Adjusted R-squared: 0.1928
## F-statistic: 15.09 on 1 and 58 DF, p-value: 0.000265
Model Evaluation
To use R’s regression diagnostic plots, we plot the regression model as an object.
par(mfrow = c(2,2))
plot(model.Bedrooms)
Appending the coefficient to the previous data frame
df <- rbind(
df,
glance(model.Bedrooms) %>%
select(adj.r.squared, sigma, AIC, BIC, p.value) %>%
mutate(Model.Name = "model.Bedrooms")
)
Compute the correlation coefficient between the two variables using the R function cor():
cor(SalesPrice, Age)
## [1] -0.294834
The correlation coefficient measures the level of the association between two variables sales price and Age
Create a scatter plot displaying the sales price versus square feet of the home.
ggplot(OxHome, aes(x = Age, y = SalesPrice)) +
geom_point(color = "Green") +
theme(panel.background = element_blank(),
axis.line = element_line(colour = "Black")) +
xlab(label = "Square feet of Homes") +
ylab("Sales Price (in $K)") +
geom_smooth(method = lm,
se = FALSE,
colour = "Red")
Model Computation
The simple linear regression tries to find the best line to predict sales price on the basis of size(square feet) of the home.
model.Age <- lm(SalesPrice ~ Age, data = OxHome)
model.Age
##
## Call:
## lm(formula = SalesPrice ~ Age, data = OxHome)
##
## Coefficients:
## (Intercept) Age
## 174.9518 -0.7138
The results show the intercept and the beta coefficient for the OxHome variable.
Model summary
We start by displaying the statistical summary of the model using the R function summary():
summary(model.Age)
##
## Call:
## lm(formula = SalesPrice ~ Age, data = OxHome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.97 -41.36 -14.96 22.94 154.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 174.9518 12.1630 14.38 <2e-16 ***
## Age -0.7138 0.3038 -2.35 0.0222 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.62 on 58 degrees of freedom
## Multiple R-squared: 0.08693, Adjusted R-squared: 0.07118
## F-statistic: 5.522 on 1 and 58 DF, p-value: 0.02221
Model Evaluation
To use R’s regression diagnostic plots, we plot the regression model as an object.
par(mfrow = c(2,2))
plot(model.Age)
Appending the coefficient to the previous data frame
df <- rbind(
df,
glance(model.Age) %>%
select(adj.r.squared, sigma, AIC, BIC, p.value) %>%
mutate(Model.Name = "model.Age")
)
Now we need to compare all the above 4 models to find out, which one is the best fit. We have collected the different statistical coefficient in a data frame.
df.comparision <-
df %>% select(Model.Name, adj.r.squared, sigma, AIC, BIC, p.value)
df.comparision
## # A tibble: 4 x 6
## Model.Name adj.r.squared sigma AIC BIC p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 model.squarefeet 0.686 32.3 591. 598. 1.93e-16
## 2 model.Rooms 0.335 47.1 636. 643. 7.67e- 7
## 3 model.Bedrooms 0.193 51.9 648. 654. 2.65e- 4
## 4 model.Age 0.0712 55.6 656. 663. 2.22e- 2
Choosing the best fit model
From above table we can conclude that the model 1 i.e model.squarefeet looks best fit among all others due to the following reasons:
model.squarefeet has the lowest AIC, we know that the lower the AIC, better is the model.
Adjusted R-square is also highest among all the models and based on this we can say that the model with higest adjusted R-squared can explain the model outcome better. As this metrix penalise the extra variable in the model which is less significant.
Also, this model has the least residual error(sigma). We know that Regression is based on the OLS( ordinal lease square Law), so the less error signifies the better fit for the model.
Question 2: (b)
Build a multiple linear regression model to predict house sales price using all four explanatory variables. Comment on the fit.
Solution
Model Computation
The Multiple linear regression tries to find the best line to predict sales price on the basis of all variables SquareFeet, Rooms, Bedrooms & Age available for the home.
MLR.Model <-
lm(SalesPrice ~ SquareFeet + Rooms + Bedrooms + Age, data = OxHome)
MLR.Model
##
## Call:
## lm(formula = SalesPrice ~ SquareFeet + Rooms + Bedrooms + Age,
## data = OxHome)
##
## Coefficients:
## (Intercept) SquareFeet Rooms Bedrooms Age
## 19.4865 0.1079 11.8807 -25.7280 -0.7207
The results show the intercept and the beta coefficient for each variable of OxHome data.
Model summary
We start by displaying the statistical summary of the model using the R function summary():
summary(MLR.Model)
##
## Call:
## lm(formula = SalesPrice ~ SquareFeet + Rooms + Bedrooms + Age,
## data = OxHome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.14 -14.68 -1.13 15.62 66.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.48651 16.86523 1.155 0.25291
## SquareFeet 0.10791 0.01228 8.788 4.62e-12 ***
## Rooms 11.88067 3.51702 3.378 0.00135 **
## Bedrooms -25.72796 8.13677 -3.162 0.00255 **
## Age -0.72074 0.15307 -4.709 1.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.21 on 55 degrees of freedom
## Multiple R-squared: 0.8078, Adjusted R-squared: 0.7938
## F-statistic: 57.77 on 4 and 55 DF, p-value: < 2.2e-16
The above statistical coefficient signifies that the model will perform much better than the previous models with only signle variable. Based on the P-values of the coefficient of variable, SquareFeet and Age are the two highly significant vaariable for model and contribution the most to describing the model. This is also denoted by the Signif. codes(***). The variable with 3-stars are the most significant variabe for models.
Model Evaluation
To check how the model fitted or the diagnostics of the model to reject or accept this.
par(mfrow = c(2,2))
plot(MLR.Model)
From the avove chart we can say that the residuals and error are almost equal uniform and also the standard residuals are not having much variance. This shows that the model will perform good to predict the sales price based on these variable.
Question 2: (c)
Using stepwise regression or otherwise, find the best model either in terms of adjusted R2, or AIC. Discuss the significance of the predictors in this model.
Solution
For Stepwise Regression, there are 3 methods to choose the best fitting methods based on the AIC The 3 methods are:
Model Computation
Lets create the stepwise regression with the step functions using forward method where we start with the creating the constant model as SalesPrice ~ 1 and there after adding each one of the variable and compute the AIC to find the best model having lowest AIC.
Step.Model <-
step(
lm(SalesPrice ~ 1, data = OxHome[, -1]),
direction = "forward",
scope = ~ SquareFeet + Rooms + Bedrooms + Age
)
## Start: AIC=487.65
## SalesPrice ~ 1
##
## Df Sum of Sq RSS AIC
## + SquareFeet 1 135891 60636 419.10
## + Rooms 1 68026 128501 464.16
## + Bedrooms 1 40583 155944 475.77
## + Age 1 17084 179444 484.20
## <none> 196527 487.65
##
## Step: AIC=419.1
## SalesPrice ~ SquareFeet
##
## Df Sum of Sq RSS AIC
## + Age 1 12173.5 48462 407.65
## + Bedrooms 1 4909.4 55726 416.03
## <none> 60636 419.10
## + Rooms 1 369.7 60266 420.73
##
## Step: AIC=407.65
## SalesPrice ~ SquareFeet + Age
##
## Df Sum of Sq RSS AIC
## + Rooms 1 3813.8 44648 404.73
## + Bedrooms 1 2842.9 45619 406.02
## <none> 48462 407.65
##
## Step: AIC=404.73
## SalesPrice ~ SquareFeet + Age + Rooms
##
## Df Sum of Sq RSS AIC
## + Bedrooms 1 6867.7 37781 396.71
## <none> 44648 404.73
##
## Step: AIC=396.71
## SalesPrice ~ SquareFeet + Age + Rooms + Bedrooms
Step.Model
##
## Call:
## lm(formula = SalesPrice ~ SquareFeet + Age + Rooms + Bedrooms,
## data = OxHome[, -1])
##
## Coefficients:
## (Intercept) SquareFeet Age Rooms Bedrooms
## 19.4865 0.1079 -0.7207 11.8807 -25.7280
Below is the statistical significant coefficient of each varaible considered by the model to find the best model.
summary(Step.Model)
##
## Call:
## lm(formula = SalesPrice ~ SquareFeet + Age + Rooms + Bedrooms,
## data = OxHome[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.14 -14.68 -1.13 15.62 66.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.48651 16.86523 1.155 0.25291
## SquareFeet 0.10791 0.01228 8.788 4.62e-12 ***
## Age -0.72074 0.15307 -4.709 1.73e-05 ***
## Rooms 11.88067 3.51702 3.378 0.00135 **
## Bedrooms -25.72796 8.13677 -3.162 0.00255 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.21 on 55 degrees of freedom
## Multiple R-squared: 0.8078, Adjusted R-squared: 0.7938
## F-statistic: 57.77 on 4 and 55 DF, p-value: < 2.2e-16
Question 2: (d)
Forecast the price of the following house using the model fitted in part c): Area = 5000 sq. ft., 20 rooms with 10 bedrooms, 100 years old. Comment on the validity of the forecast.
Solution
We need to create a dataframe with the desired input to predict their sales price of the home. After that we predict the price of the home using the model created in stepwise regression with predict function.
test <- data.frame(SquareFeet=5000,Rooms=20,Bedrooms=10,Age=100,SalesPrice=NA)
home.pred <- predict(Step.Model, test)
home.pred
## 1
## 467.3019
The sales price of the house having Area = 5000 sq. ft., 20 rooms with 10 bedrooms, 100 years old is comes out to be 467.3019.