I am very proud to share this case with you because this exercise taught me again to respect the rules. You can find many ways to use the dataset on Internet, I show someone this way of playing with them and the other did not have enough creativity to change it and made a copy. I believe in open culture and education and that one of the multiple ways to learn is by copying or learning by doing it. Lots of love
I will use the Boston Housing Data file, which represents data on the housing characteristics of 506 the corrected data have been taken from the library mlbench or Statlib athttp://lib.stat.cmu.edu/datasets/
Boston neighborhoods from the 1970 census. In it we can highlight the following variables:
| CRIM | represents crime per capita by city. |
|---|---|
| ZN | The proportion of residential areas in a given area. |
| INDUS | The proportion of acres dedicated to retail in the city. |
| CHAS | Binary variable (= 1 if the tracks cross the river and 0 otherwise). |
| NOX | Nitric oxide concentration (parts per million). |
| RM | Average number of rooms per dwelling. |
| AGE | Proportion of owner-occupied buildings, built before 1940. |
| DIS | Represents the weighted distance to five employment centers in Boston. |
| RAD | Radial road accessibility index. |
| TAX | Total value of the tax rate per 10,000 dollars.** |
| PTRATIO | Represents the ratio of students per teacher by city.** |
| B | Value defined as 1000 (Bk0.63) 2 where Bk is the proportion of African Americans in the city.** |
| LSTAT | Percentage of lower class in the population.** |
| MEDV | Median value of owner-occupied homes at $ 1,000.** |
library(LearnBayes)
library(dplyr)
library(mlbench)
data(BostonHousing) #To call data from a library (like mlblench), use "data()"
BHD<- BostonHousing
colnames(BHD)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "b" "lstat" "medv"
names(BHD) = c("CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV")
dim(BHD)
## [1] 506 14
head(BHD)
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## MEDV
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
summary(BHD)# With the summary I observe that there are NO missing values
## CRIM ZN INDUS CHAS NOX
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710
## RM AGE DIS RAD
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
## TAX PTRATIO B LSTAT
## Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
## MEDV
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
I confirm that there are no missing values and the dimensions are as described, then I build some initial graphs and do the exploratory analysis of the data. I do the descriptive analysis
par(mfrow=c(1,1))
pairs(~ ., data = BHD, main = "BHD Data")
LSTAT, DIS and RM are good linear variables but the CRIM type is not linear, and in fact the relationship is complicated.
plot(MEDV~LSTAT, BHD)
plot(MEDV~DIS, BHD)
plot(MEDV~RM, BHD)
I see with these three graphs that the correlation of the independent variable with DIS is very weak, while with RM and LSTAT a stronger relationship is intuited, intuiting a parabola in LSTAT and RM. I do the correlation matrix to confirm the above.
#round(cor(BHD),1)
FIRST REGRESSION WITH ALL INDEPENDENT VARIABLES Now I analyze the linearity of the independent variable with the dependent variables that can explain the model. I perform a classical least squares fit using lm
fit1=lm(MEDV~.,BHD)
summary(fit1)
##
## Call:
## lm(formula = MEDV ~ ., data = BHD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## CRIM -1.080e-01 3.286e-02 -3.287 0.001087 **
## ZN 4.642e-02 1.373e-02 3.382 0.000778 ***
## INDUS 2.056e-02 6.150e-02 0.334 0.738288
## CHAS1 2.687e+00 8.616e-01 3.118 0.001925 **
## NOX -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## RM 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## AGE 6.922e-04 1.321e-02 0.052 0.958229
## DIS -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## RAD 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## TAX -1.233e-02 3.760e-03 -3.280 0.001112 **
## PTRATIO -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## B 9.312e-03 2.686e-03 3.467 0.000573 ***
## LSTAT -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
I see that the AGE and INDUS variables are not statistically significant since their p-value is very high and they do not explain the dependent variable that is MEDV. And the errors present a normal distribution in one part, but it can be adjusted much more, so that it is more decisive.
par(mfrow=c(2,2))
plot(fit1)
I remove these two variables to fit the model
fit2=update(fit1,~.-AGE-INDUS)
summary(fit2)
##
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD +
## TAX + PTRATIO + B + LSTAT, data = BHD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## CRIM -0.108413 0.032779 -3.307 0.001010 **
## ZN 0.045845 0.013523 3.390 0.000754 ***
## CHAS1 2.718716 0.854240 3.183 0.001551 **
## NOX -17.376023 3.535243 -4.915 1.21e-06 ***
## RM 3.801579 0.406316 9.356 < 2e-16 ***
## DIS -1.492711 0.185731 -8.037 6.84e-15 ***
## RAD 0.299608 0.063402 4.726 3.00e-06 ***
## TAX -0.011778 0.003372 -3.493 0.000521 ***
## PTRATIO -0.946525 0.129066 -7.334 9.24e-13 ***
## B 0.009291 0.002674 3.475 0.000557 ***
## LSTAT -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
The relationship of the dependent variable and the two most significant variables is not linear, as I have observed previously, these are parabolas. So I fit the determination model by adding these two variables squared. For the LSTAT variable:
fit3=lm(MEDV~LSTAT +I(LSTAT^2),BHD)
summary(fit3)
##
## Call:
## lm(formula = MEDV ~ LSTAT + I(LSTAT^2), data = BHD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## LSTAT -2.332821 0.123803 -18.84 <2e-16 ***
## I(LSTAT^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
I observe the graph of the relationship of the dependent variable with LSTAT
par(mfrow=c(1,1)) plot(MEDV~LSTAT, BHD) points(LSTAT,fitted(fit3), col=“red”,pch=20) fit4=lm(MEDV~ RM +I(RM^2),BHD) summary(fit4)
par(mfrow=c(1,1)) plot(MEDV~RM, BHD) points(RM,fitted(fit4), col=“red”,pch=20)
fit5=lm(MEDV~LSTAT+CRIM+RM+DIS+B+ZN+CHAS+NOX+RAD+TAX+PTRATIO+I(LSTAT2)+I(RM2),BHD) summary(fit5)
#ZN no es significativa al 5%
#MODELO FINAL
fit6=lm(MEDV~LSTAT+CRIM+RM+DIS+B+CHAS+NOX+RAD+TAX+PTRATIO+I(LSTAT2)+I(RM2),BHD) summary(fit6) plot(fit6) # Observo que el Coeficiente de determinación queda ajustado al máximo #Adjusted R-squared: 0.8115 MÁXIMO
y<-as.vector(fit6\(model[,1]) # Variables indep. más el intercept (los unos) x<-as.matrix(cbind(rep(1,506),fit6\)model[,-1]))
theta.sample=blinreg(y,x,5000) theta.sample\(beta S=sum(fit6\)residual^2) shape=fit6$df.residual/2 rate=S/2 sigma2=rigamma(1,shape,rate)
MSE=sum(fit6\(residuals^2)/fit6\)df.residual vbeta=vcov(fit6)/MSE beta=rmnorm(1,mean=fit6$coef,varcov=vbeta*sigma2)
#histogramas de coeficientes par(mfrow=c(2,2)) hist(theta.sample\(beta[,2],main='LSTAT',xlab=expression(beta[1])) hist(theta.sample\)beta[,3],main=‘CRIM’,xlab=expression(beta[2])) hist(theta.sample\(beta[,4],main='RM',xlab=expression(beta[3])) hist(theta.sample\)beta[,5],main=‘DIS’,xlab=expression(beta[4])) par(mfrow=c(2,2)) hist(theta.sample\(beta[,6],main='B',xlab=expression(beta[5])) hist(theta.sample\)beta[,7],main=‘CHAS’,xlab=expression(beta[6])) hist(theta.sample\(beta[,8],main='NOX',xlab=expression(beta[7])) hist(theta.sample\)beta[,9],main=‘RAD’,xlab=expression(beta[8])) par(mfrow=c(2,2)) hist(theta.sample\(beta[,10],main='TAX',xlab=expression(beta[9])) hist(theta.sample\)beta[,11],main=‘PTRATIO’,xlab=expression(beta[10])) hist(theta.sample\(beta[,12],main='LSTAT^2',xlab=expression(beta[11])) hist(theta.sample\)beta[,13],main=‘RM^2’,xlab=expression(beta[12])) par(mfrow=c(1,1)) hist(theta.sample$sigma,main=‘ERROR SD’,xlab=expression(sigma))
apply(theta.sample$beta,2,quantile,c(.05,.5,.95))
quantile(theta.sample$sigma,c(.05,.5,.95))
#Hago una predicción cov1=x[400,] cov2=x[356,] cov3=x[2,] cov4=x[476,] X1=rbind(cov1,cov2,cov3,cov4)
mean.draws=blinregexpected(X1,theta.sample) par(mfrow=c(2,2)) hist(mean.draws[,1],main=“Predicción 1 (esperado)”,xlab=“MEDV”) hist(mean.draws[,2],main=“Predicción 2 (esperado)”,xlab=“MEDV”) hist(mean.draws[,3],main=“Predicción 3 (esperado)”,xlab=“MEDV”) hist(mean.draws[,4],main=“Predicción 4 (esperado)”,xlab=“MEDV”)
#Predicción de forma análoga al estido anterior
mean.draws=blinregpred(X1,theta.sample) par(mfrow=c(2,2)) hist(mean.draws[,1],main=“Predicción 1 (Predicción)”,xlab=“MEDV”) hist(mean.draws[,2],main=“Predicción 2 (Predicción)”,xlab=“MEDV”) hist(mean.draws[,3],main=“Predicción 3 (Predicción)”,xlab=“MEDV”) hist(mean.draws[,4],main=“Predicción 4 (Predicción)”,xlab=“MEDV”)
prob.out=bayesresiduals(fit6,theta.sample,2) par(mfrow=c(1,1)) #plot(BHD\(LSTAT,prob.out) identify(BHD\)LSTAT,prob.out,n=4) prob.out
```