Boston housing data

Construct a working model which has the capability of predicting the value of houses. The features ‘RM’, ‘LSTAT’ and ‘PTRATIO’ give us quantitative information about each data point. The Target variable, ‘MEDV’ is the variable we seek to predict.

Note: 1) RM is the average number of rooms among homes in the neighbourhood 2) LSTAT is the % of homeowners in the neighbourhood considered lower class / working poor 3) PTRATIO is the ratio of students to teachers in primary and secondary schools in the neighbourhood

Use R Markdown to generate your report either as a HTML file or Word file.

Load data

df <- read.table("D:/Great Lakes/Predictive Modeling/Multiple Linear Regression/housing.csv",sep = ",", header = T)
summary(df)
##        RM            LSTAT          PTRATIO           MEDV        
##  Min.   :3.561   Min.   : 1.98   Min.   :12.60   Min.   : 105000  
##  1st Qu.:5.880   1st Qu.: 7.37   1st Qu.:17.40   1st Qu.: 350700  
##  Median :6.185   Median :11.69   Median :19.10   Median : 438900  
##  Mean   :6.240   Mean   :12.94   Mean   :18.52   Mean   : 454343  
##  3rd Qu.:6.575   3rd Qu.:17.12   3rd Qu.:20.20   3rd Qu.: 518700  
##  Max.   :8.398   Max.   :37.97   Max.   :22.00   Max.   :1024800

Load required libraries

library(ggplot2)

USeful functions

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {

  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Initial visualization

p1 <- qplot(MEDV, RM, data = df, geom = c("point", "smooth")) + ggtitle("MEDV by RM") 
p2 <- qplot(MEDV, LSTAT, data = df, geom = c("point", "smooth")) + ggtitle("MEDV by LSTAT")
p3 <- qplot(MEDV, PTRATIO, data = df, geom = c("point", "smooth")) + ggtitle("MEDV by PTRATIO")

multiplot(p1, p2, p3, cols=3)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'

Observation

  • In the first plot, we observe that RM and MEDV are positively correlated, meaning the more the value of RM, the more will be the value of MEDV. It is pretty evident that with increase in the number of rooms, the price of the house will increase.

  • In the second plot, we observe that LSTAT and MEDV are negatively correlated, meaning the more the value of LSTAT, the less will be the value of MEDV. It is pretty evident that with increase in the lower class homeowners, then more likely very expensive real estate owners will not build their housing complexes in that region as most of the people will not be able to afford it.

  • In the third plot, we observe that PTRATIO and MEDV are negatively correlated, meaning the more the value of PTRATIO, the less will be the value of MEDV. It is pretty evident that with increase in the students to teachers ratio, teachers will not be able to attend to students individually everytime and hence this may affect the education of students. So regions with a low PTRATIO will have higher prices for houses.

Apply Multiple Linear Regression

names(df)
## [1] "RM"      "LSTAT"   "PTRATIO" "MEDV"
MEDV.lm <- lm(MEDV ~ RM + LSTAT + PTRATIO, data=df)

Define a Model Performance Metric

We will calculate the coefficient of determination, R^2, to quantify our model’s performance. The coefficient of determination for a model is useful statistic in regression analysis, as it often describes how good that model is at making predictions. We print the summary of the object MEDV.lm to get this metric.

summary(MEDV.lm)
## 
## Call:
## lm(formula = MEDV ~ RM + LSTAT + PTRATIO, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -231330  -55228   -8137   41788  326444 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415464.4    68845.7   6.035 3.17e-09 ***
## RM           86565.2     7888.9  10.973  < 2e-16 ***
## LSTAT       -10849.3      732.1 -14.819  < 2e-16 ***
## PTRATIO     -19492.1     2039.0  -9.559  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88130 on 485 degrees of freedom
## Multiple R-squared:  0.7176, Adjusted R-squared:  0.7159 
## F-statistic: 410.9 on 3 and 485 DF,  p-value: < 2.2e-16

Observation

We find the Coefficient of Determination, Multiple R-squared is 71.76% and value of Adjusted R-square is 71.59% This means about 72% of variation in target variable (MEDV) can be predicted by predictor variables. This is good.

Conclusion

Regression equation is MEDV = 415464.4 + 86565.2 * RM - 10849.3 * LSTAT - 19492.1 * PTRATIO

  • The coefficient for the variable “RM” has a specific interpretation. It says that for a fixed combination of LSTAT and PTRATIO, on average RM will be costing 86565.2 more than others.

  • The coefficient of about -10849 for LSTAT tells us that for a given RM and PTRATIO, the predicted MEDV decreases by about 10849 for every 1.0 unit increase in LSTAT.

  • The coefficient of about -19492 for PTRATIO tells us that for a given RM and LSTAT, the predicted MEDV decreases by about 19492 for every 1.0 unit increase in PTRATIO.

For each of the coefficients, a test for H0: β = 0 versus Ha: β ≠ 0 has p-value of 0.000. (See the column headed P>|t|.) These are conditional hypotheses. They are testing whether or not each explanatory variable needs to be in the model, given that the others are already there.

Therefore, in this example, the tests tell us that all 3 of the explanatory variables are useful (highly significant) in the model. The p-value is very much less than the significance level of 5% resulting in the rejection of H0 and accept the alternative hypothesis the coefficients are not 0.