Boston Housing

I am very proud to share this case with you because this exercise taught me again to respect the rules. You can find many ways to use the dataset on Internet, I show someone this way of playing with them and the other did not have enough creativity to change it and made a copy. I believe in open culture and education and that one of the multiple ways to learn is by copying or learning by doing it. Lots of love

I will use the Boston Housing Data file, which represents data on the housing characteristics of 506 the corrected data have been taken from the library mlbench or Statlib athttp://lib.stat.cmu.edu/datasets/

Boston neighborhoods from the 1970 census. In it we can highlight the following variables:

CRIM represents crime per capita by city.
ZN The proportion of residential areas in a given area.
INDUS The proportion of acres dedicated to retail in the city.
CHAS Binary variable (= 1 if the tracks cross the river and 0 otherwise).
NOX Nitric oxide concentration (parts per million).
RM Average number of rooms per dwelling.
AGE Proportion of owner-occupied buildings, built before 1940.
DIS Represents the weighted distance to five employment centers in Boston.
RAD Radial road accessibility index.
TAX Total value of the tax rate per 10,000 dollars.**
PTRATIO Represents the ratio of students per teacher by city.**
B Value defined as 1000 (Bk0.63) 2 where Bk is the proportion of African Americans in the city.**
LSTAT Percentage of lower class in the population.**
MEDV Median value of owner-occupied homes at $ 1,000.**
library(LearnBayes)
library(dplyr)
library(mlbench)

data(BostonHousing)   #To call data from a library (like mlblench), use "data()"
BHD<- BostonHousing
colnames(BHD)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "b"       "lstat"   "medv"
names(BHD) = c("CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV")
dim(BHD)
## [1] 506  14
head(BHD)
##      CRIM ZN INDUS CHAS   NOX    RM  AGE    DIS RAD TAX PTRATIO      B LSTAT
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   MEDV
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
summary(BHD)# With the summary I observe that there are NO missing values
##       CRIM                ZN             INDUS       CHAS         NOX        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471   Min.   :0.3850  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1: 35   1st Qu.:0.4490  
##  Median : 0.25651   Median :  0.00   Median : 9.69           Median :0.5380  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14           Mean   :0.5547  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10           3rd Qu.:0.6240  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
##        RM             AGE              DIS              RAD        
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
##  1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
##  Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
##  Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
##  3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000  
##       TAX           PTRATIO            B              LSTAT      
##  Min.   :187.0   Min.   :12.60   Min.   :  0.32   Min.   : 1.73  
##  1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95  
##  Median :330.0   Median :19.05   Median :391.44   Median :11.36  
##  Mean   :408.2   Mean   :18.46   Mean   :356.67   Mean   :12.65  
##  3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95  
##  Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97  
##       MEDV      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

I confirm that there are no missing values and the dimensions are as described, then I build some initial graphs and do the exploratory analysis of the data. I do the descriptive analysis

par(mfrow=c(1,1))
pairs(~ ., data = BHD, main = "BHD Data")

LSTAT, DIS and RM are good linear variables but the CRIM type is not linear, and in fact the relationship is complicated.

plot(MEDV~LSTAT, BHD)

plot(MEDV~DIS, BHD)

plot(MEDV~RM, BHD)

I see with these three graphs that the correlation of the independent variable with DIS is very weak, while with RM and LSTAT a stronger relationship is intuited, intuiting a parabola in LSTAT and RM. I do the correlation matrix to confirm the above.

#round(cor(BHD),1)

FIRST REGRESSION WITH ALL INDEPENDENT VARIABLES Now I analyze the linearity of the independent variable with the dependent variables that can explain the model. I perform a classical least squares fit using lm

fit1=lm(MEDV~.,BHD)
summary(fit1)
## 
## Call:
## lm(formula = MEDV ~ ., data = BHD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## CRIM        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## ZN           4.642e-02  1.373e-02   3.382 0.000778 ***
## INDUS        2.056e-02  6.150e-02   0.334 0.738288    
## CHAS1        2.687e+00  8.616e-01   3.118 0.001925 ** 
## NOX         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## RM           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## AGE          6.922e-04  1.321e-02   0.052 0.958229    
## DIS         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## RAD          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## TAX         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## PTRATIO     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## B            9.312e-03  2.686e-03   3.467 0.000573 ***
## LSTAT       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

I see that the AGE and INDUS variables are not statistically significant since their p-value is very high and they do not explain the dependent variable that is MEDV. And the errors present a normal distribution in one part, but it can be adjusted much more, so that it is more decisive.

par(mfrow=c(2,2))
plot(fit1)

I remove these two variables to fit the model

fit2=update(fit1,~.-AGE-INDUS)
summary(fit2)
## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + 
##     TAX + PTRATIO + B + LSTAT, data = BHD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## CRIM         -0.108413   0.032779  -3.307 0.001010 ** 
## ZN            0.045845   0.013523   3.390 0.000754 ***
## CHAS1         2.718716   0.854240   3.183 0.001551 ** 
## NOX         -17.376023   3.535243  -4.915 1.21e-06 ***
## RM            3.801579   0.406316   9.356  < 2e-16 ***
## DIS          -1.492711   0.185731  -8.037 6.84e-15 ***
## RAD           0.299608   0.063402   4.726 3.00e-06 ***
## TAX          -0.011778   0.003372  -3.493 0.000521 ***
## PTRATIO      -0.946525   0.129066  -7.334 9.24e-13 ***
## B             0.009291   0.002674   3.475 0.000557 ***
## LSTAT        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

The relationship of the dependent variable and the two most significant variables is not linear, as I have observed previously, these are parabolas. So I fit the determination model by adding these two variables squared. For the LSTAT variable:

fit3=lm(MEDV~LSTAT +I(LSTAT^2),BHD)
summary(fit3)
## 
## Call:
## lm(formula = MEDV ~ LSTAT + I(LSTAT^2), data = BHD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## LSTAT       -2.332821   0.123803  -18.84   <2e-16 ***
## I(LSTAT^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

I observe the graph of the relationship of the dependent variable with LSTAT

par(mfrow=c(1,1)) plot(MEDV~LSTAT, BHD) points(LSTAT,fitted(fit3), col=“red”,pch=20) fit4=lm(MEDV~ RM +I(RM^2),BHD) summary(fit4)

par(mfrow=c(1,1)) plot(MEDV~RM, BHD) points(RM,fitted(fit4), col=“red”,pch=20)

MODELO

fit5=lm(MEDV~LSTAT+CRIM+RM+DIS+B+ZN+CHAS+NOX+RAD+TAX+PTRATIO+I(LSTAT2)+I(RM2),BHD) summary(fit5)

#ZN no es significativa al 5%

#MODELO FINAL

fit6=lm(MEDV~LSTAT+CRIM+RM+DIS+B+CHAS+NOX+RAD+TAX+PTRATIO+I(LSTAT2)+I(RM2),BHD) summary(fit6) plot(fit6) # Observo que el Coeficiente de determinación queda ajustado al máximo #Adjusted R-squared: 0.8115 MÁXIMO

REGRESIÓN BAYESIANA

Variable dependiente medv

y<-as.vector(fit6\(model[,1]) # Variables indep. más el intercept (los unos) x<-as.matrix(cbind(rep(1,506),fit6\)model[,-1]))

theta.sample=blinreg(y,x,5000) theta.sample\(beta S=sum(fit6\)residual^2) shape=fit6$df.residual/2 rate=S/2 sigma2=rigamma(1,shape,rate)

MSE=sum(fit6\(residuals^2)/fit6\)df.residual vbeta=vcov(fit6)/MSE beta=rmnorm(1,mean=fit6$coef,varcov=vbeta*sigma2)

#histogramas de coeficientes par(mfrow=c(2,2)) hist(theta.sample\(beta[,2],main='LSTAT',xlab=expression(beta[1])) hist(theta.sample\)beta[,3],main=‘CRIM’,xlab=expression(beta[2])) hist(theta.sample\(beta[,4],main='RM',xlab=expression(beta[3])) hist(theta.sample\)beta[,5],main=‘DIS’,xlab=expression(beta[4])) par(mfrow=c(2,2)) hist(theta.sample\(beta[,6],main='B',xlab=expression(beta[5])) hist(theta.sample\)beta[,7],main=‘CHAS’,xlab=expression(beta[6])) hist(theta.sample\(beta[,8],main='NOX',xlab=expression(beta[7])) hist(theta.sample\)beta[,9],main=‘RAD’,xlab=expression(beta[8])) par(mfrow=c(2,2)) hist(theta.sample\(beta[,10],main='TAX',xlab=expression(beta[9])) hist(theta.sample\)beta[,11],main=‘PTRATIO’,xlab=expression(beta[10])) hist(theta.sample\(beta[,12],main='LSTAT^2',xlab=expression(beta[11])) hist(theta.sample\)beta[,13],main=‘RM^2’,xlab=expression(beta[12])) par(mfrow=c(1,1)) hist(theta.sample$sigma,main=‘ERROR SD’,xlab=expression(sigma))

apply(theta.sample$beta,2,quantile,c(.05,.5,.95))

quantile(theta.sample$sigma,c(.05,.5,.95))

#Hago una predicción cov1=x[400,] cov2=x[356,] cov3=x[2,] cov4=x[476,] X1=rbind(cov1,cov2,cov3,cov4)

mean.draws=blinregexpected(X1,theta.sample) par(mfrow=c(2,2)) hist(mean.draws[,1],main=“Predicción 1 (esperado)”,xlab=“MEDV”) hist(mean.draws[,2],main=“Predicción 2 (esperado)”,xlab=“MEDV”) hist(mean.draws[,3],main=“Predicción 3 (esperado)”,xlab=“MEDV”) hist(mean.draws[,4],main=“Predicción 4 (esperado)”,xlab=“MEDV”)

#Predicción de forma análoga al estido anterior

mean.draws=blinregpred(X1,theta.sample) par(mfrow=c(2,2)) hist(mean.draws[,1],main=“Predicción 1 (Predicción)”,xlab=“MEDV”) hist(mean.draws[,2],main=“Predicción 2 (Predicción)”,xlab=“MEDV”) hist(mean.draws[,3],main=“Predicción 3 (Predicción)”,xlab=“MEDV”) hist(mean.draws[,4],main=“Predicción 4 (Predicción)”,xlab=“MEDV”)

prob.out=bayesresiduals(fit6,theta.sample,2) par(mfrow=c(1,1)) #plot(BHD\(LSTAT,prob.out) identify(BHD\)LSTAT,prob.out,n=4) prob.out

```