BOSTON HOUSING DATA ANALYSIS


Q1 Expolaratory Data Analysis

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms.

The data was originally published by Harrison, D. and Rubinfeld, D.L. “Hedonic prices and the demand for clean air”, J. Environ. Economics & Management, vol.5, 81-102, 1978.

Based on the sample training dataset we have 354 observations in the sample with 13 predictor variables and 1 response variable.
Below are the columns included in the data set, all of which are numeric:
1. crim - per capita crime rate by town
2. zn - proportion of residential land zoned for lots over 25,000 sq.ft.
3. indus - proportion of non-retail business acres per town
4. chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. nox - nitrogen oxides concentration (parts per 10 million)
6. rm - average number of rooms per dwelling
7. age - proportion of owner-occupied units built prior to 1940
8. dis - weighted mean of distances to five Boston employment centres
9. rad - index of accessibility to radial highways
10. tax - full-value property-tax rate per $10,000
11. ptratio - pupil-teacher ratio by town
12. black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. lstat - lower status of the population (percent)
14. Medv - median value of owner-occupied homes in $1000

Note: We have dropped column black from the dataset to avoid racial discrimination.

PRINTING COLUMN AND HEAD OF DATASET

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "lstat"   "medv"

STRUCTURE OF DATASET

## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

MISSING VALUES AND DUPLICATES:
While checking for missing values in any of the observation for any covariate, we can observe that there are no missing values and no duplicates.

##    crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
##       0       0       0       0       0       0       0       0       0       0 
## ptratio   lstat    medv 
##       0       0       0

Boxplot: We have analysed the box plot to understand the distribution of every variable and also to observe if there are any notable outliers. The distribution for age and tax seems pretty reasonable and there are no outliers which is a positive point to start with. We can clearly observe that there are many outliers particularly in 3 variables named crime, zn and black. Zn and crim distribution are skewed right even after standard normalization and black variable seems to have a left skewed distribution. Nox and rm variable has a perfectly equal distribution and the boxplot is symmetrical. Rad and crim are extremely negatively skewed. Since, chas is a dummy variable with a value of eityher 0 or 1 it doesnt make any sense looking at its distribution.

Scatter plot and correlation matrix:

To further the analysis, we will have a look at the correlation matrix and scatter plots and see which predictor variables have a positive or a negative correlation with the response variable (medv). This will give a starting point to analyse which covariates are important for consideration while building the regression model. Moreover, it will also help us realise if there are any covariates which are strongly correlated with each other leading to collinearity. Looking at these plots and values while exploring the data initially sets the ground for model building and variable selection. In general, there are both positive and negative correlation with the response varaible. Crim, indus, nox, age, rad, tax, ptratio, lstat all of them have a negative correlation with the lstat one having the strongest correlation. While, zn, chas, rm, dis and black have a positive correlation with rm having the strongest positive correlation. rad and tax have a strong positive correlation of 0.91 which implies that as accessibility of radial highways increases, the full value property-tax rate per $10,000 also increases. indus has a strong positive correlation with nox indicating that the concentration of nitrogen oxides are very high in industrial areas.

Linear Regression using all covariates(no transformation)

We have created a linear regression model on Boston Housing Dataset using 12 covariates(excluding black) and response variable as medv with no variable transformation. Below is the summary statistics of this linear regression.

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Based on the above summary statistics, we can observe that indus, chas and age seems to be statistically insignificant as they have a p-value greater than 0.05, whereas other covariates are significant as they have a p-value less than 0.05. Since this is a multiple linear gression case, we will take into consideration the adjusted R squared value which is apporximately 73% indicating that 73% variation in the response variable can be explained by the above model. Looking at these values, the model seems to be doing fairly well but then we can still improve the model efficiency by trying different models and transformations by analysing the residual plots.


There appears to be a curved pattern in the residual vs. fitted values plot, which implies that the constant variance assumption does not hold. Also most of the points appear to be clutered around the center.So we need to perform model transformations for improving the model.

Finding the best linear Model

Best subset, stepwise and LASSO techniques of variable selection are used to come up with the best linear regression model for the dependent variable medv.

SUMMARY OF FULL MODEL

## 
## Call:
## lm(formula = medv ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3128  -2.8070  -0.5767   1.9120  24.0823 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.174847   5.938594   8.449 8.63e-16 ***
## crim         -0.106329   0.045861  -2.319  0.02101 *  
## zn            0.052258   0.016913   3.090  0.00217 ** 
## indus        -0.061985   0.074798  -0.829  0.40785    
## chas          3.222319   1.172301   2.749  0.00630 ** 
## nox         -21.220168   4.766714  -4.452 1.16e-05 ***
## rm            2.445294   0.497402   4.916 1.37e-06 ***
## age           0.014605   0.016398   0.891  0.37373    
## dis          -1.844692   0.254354  -7.252 2.77e-12 ***
## rad           0.336090   0.081706   4.113 4.89e-05 ***
## tax          -0.014132   0.004511  -3.133  0.00188 ** 
## ptratio      -0.791082   0.167241  -4.730 3.29e-06 ***
## lstat        -0.623497   0.060971 -10.226  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.936 on 341 degrees of freedom
## Multiple R-squared:  0.7018, Adjusted R-squared:  0.6913 
## F-statistic: 66.89 on 12 and 341 DF,  p-value: < 2.2e-16
## [1] "*****Full Model*****"
## Model MSE:  23.47246
## R-square is:  0.7018391
## Adjusted R-square is:  0.6913466
## AIC:  2149.772
## Mean Square Prediction Error:  21.08274

VARIABLE SELECTION USING BEST SUBSET REGRESSION

## [1] "*****Best Subset Model*****"
## Model MSE:  25.03166
## R-square is:  0.6820333
## Adjusted R-square is:  0.6756004
## AIC:  2162.538

VARIABLE SELECTION USING STEPWISE

##   parameter     RMSE  Rsquared      MAE   RMSESD RsquaredSD     MAESD
## 1      none 5.029353 0.6969678 3.649234 1.072075  0.1129444 0.5161122
## 
## Call:
## lm(formula = .outcome ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + lstat, data = dat)
## 
## Coefficients:
## (Intercept)         crim           zn         chas          nox           rm  
##    49.44221     -0.10812      0.06321      2.76617    -25.52852      3.28926  
##         dis          rad          tax      ptratio        lstat  
##    -1.93374      0.35976     -0.01434     -0.94196     -0.53131
## 
## Call:
## lm(formula = .outcome ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + lstat, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0479  -3.0262  -0.6149   1.8695  25.0983 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.442214   6.094380   8.113 8.78e-15 ***
## crim         -0.108123   0.057050  -1.895 0.058894 .  
## zn            0.063205   0.016891   3.742 0.000214 ***
## chas          2.766172   1.044823   2.648 0.008481 ** 
## nox         -25.528516   4.439259  -5.751 1.96e-08 ***
## rm            3.289263   0.528039   6.229 1.36e-09 ***
## dis          -1.933741   0.243339  -7.947 2.75e-14 ***
## rad           0.359759   0.079041   4.552 7.39e-06 ***
## tax          -0.014336   0.004134  -3.468 0.000590 ***
## ptratio      -0.941964   0.161800  -5.822 1.33e-08 ***
## lstat        -0.531314   0.059661  -8.905  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.947 on 345 degrees of freedom
## Multiple R-squared:  0.7121, Adjusted R-squared:  0.7038 
## F-statistic: 85.35 on 10 and 345 DF,  p-value: < 2.2e-16
## [1]   11.000 1149.163

VARIABLE SELECTION USING LASSO

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  48.469114242
## crim         -0.101572321
## zn            0.061513439
## indus        -0.026860343
## chas          2.768221633
## nox         -24.873957187
## rm            3.271541811
## age           0.004609758
## dis          -1.885012174
## rad           0.332719829
## tax          -0.012839759
## ptratio      -0.930299286
## lstat        -0.535026514
##        MSE   Rsquare
## 1 20.91137 0.7671125

From the above table we see that the lowest value for model MSE is for our LASSO model. So, we will consider LASSO as our best linear model. Since we are using the ‘caret’ function, we are not manually putting in any optimal lambda value. The ‘caret’ package is automatically populating the model equation with the best lambda value and it is retaining all the variables from the full model.

Q2 Simulation Study

DESCRIBING THE DATASET
To begin this exercise, the team generated data according to the following guidelines:
●Sample size of 200
●E(Y | X) = 5 + 1.2 x X1 + 3 x X2
●X1 ~ N(2, 0.16)
●X2 ~ N(-1, 0.01)
●X3 = X1*X2
●Error Term ~ N(0,1)

Using the dataset that has been randomly generated, we conducted a stepwise variable selection. By this method of variable selection, we looked at the lowest value for AIC for making the variable selection.

For eg:
Simulation result for N = 200
●Y = 5.9120 + (1.2697)x X1)+(-1.9583)x X2
●MSE = 0.9875602
●r^2 = 0.2074085
●Adj. r^2 = 0.1993618

## Start:  AIC=42.96
## y ~ 1
## 
##        Df Sum of Sq    RSS    AIC
## + x3    1    50.997 194.46 -1.616
## + x1    1    44.640 200.82  4.818
## + x2    1     9.795 235.66 36.819
## <none>              245.46 42.963
## 
## Step:  AIC=-1.62
## y ~ x3
## 
##        Df Sum of Sq    RSS    AIC
## <none>              194.46 -1.616
## + x2    1     0.284 194.18  0.093
## + x1    1     0.255 194.21  0.122
## - x3    1    50.997 245.46 42.963
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6858 -0.6590 -0.0388  0.6847  3.4782 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.9120     0.8431   7.012 3.66e-11 ***
## x1            1.2697     0.1968   6.452 8.35e-10 ***
## x2           -1.9583     0.7772  -2.520   0.0125 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9938 on 197 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1994 
## F-statistic: 25.78 on 2 and 197 DF,  p-value: 1.139e-10

Regression with Different Sample Sizes and Model Performance

Below is a table summarizing the results of running the same variable selection technique above but with different sample sizes and noise level for the error term. The results are displayed from the smallest error to the largest error with different sample size as N = 25,100,200,500,1000.

We ran a random simulation on different sample size and different error term and the output of the same is shown in the above table. We had three variables x1, x2 and x3. We used stepwise regression to find the variables that can fit well in a model for the given sample size randomly. It can be clearly depicted from the above table that as standard deviation for the error term is decreased, the r-square value for the model increases and the mean-square error decreases respectively. For SE = 1 and Sample N = 25 we are getting adjustedR^2 as a negative value. It states that the adjusted R^2 allows it to be negative. It is intended to approximate the actual percentage variance explained. So, if the actual R square is close to zero the adjusted R square can be slightly negative. Just think of it as an estimate of zero.
[Source]

Q3 Monte-Carlo Simulation

PART 1:
Monte Carlo simulation is used to model the probability of different outcomes in a process. In this question, we are performing a Monte Carlo simulation study on the mean function E(y|x)= 5 + 1.2* x1 +3* x2 with x1~N(2, .4^2 ), x2 ~N(-1, .1^2 ), sample size n=200 and by fixing the predictor x, we are showing the bias, variance and MSE for each of these variables.

n <- 200 #sample size
m <- 100 #number of simulation iterations
coef <- c(5,1.2,3)
betaMatrix <- matrix(NA,100,3)
listMSE <-c()
#part i) simulate predictors X
x1 <- rnorm(n=n, mean=2, sd=0.4)
x2 <- rnorm(n=n, mean=-1, sd=0.1)

PART 2:
By fitting a linear regression model for m = 100 with e~N(0, σ^2 ), where σ 2=1, we are getting bias, variance and MSE as follows:

for(j in 1:m){
  #simulate the error term m=100 times...
  #generate response vector y with error term per iteration
  error <- rnorm(n=n, mean=0, sd=1)
  y<-5+1.2*x1+3*x2+error
  lm.out <-lm(y~x1+x2)
  betaMatrix[j,] <- lm.out$coefficients
  mysummary <- summary(lm.out)
  listMSE[j] <- mysummary$sigma^2 #get MSE per iteration
}

PART 3:
Interpretation of the result:

Estimation Bias

## [1] 4.918727 1.214877 2.957856

Variance

## [1] 0.59474576 0.02699447 0.53522885

MSE

## [1] 1.006445

Also, we are getting model MSE as 1.01 which can be almost approximated equal to sigma square = 1.

SUMMARY

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.63677 -0.57054 -0.04064  0.58262  2.72738 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.7945     0.7061   6.791 1.29e-10 ***
## x1            1.2231     0.1563   7.823 3.07e-13 ***
## x2            2.9373     0.6390   4.597 7.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9352 on 197 degrees of freedom
## Multiple R-squared:  0.291,  Adjusted R-squared:  0.2838 
## F-statistic: 40.43 on 2 and 197 DF,  p-value: 1.944e-15

From the model summary, we get the following model : Y = 4.79 + 1.22X1 + 2.93X2

So we see that this result is close to the overall correct model, though not exactly the same. This simulation demonstrates the fact that no model is perfect, and even when using simulated data, a perfect model is not attainable. There will always be an element of error in the model.