Each record in the database describes a Boston suburb or town.
library(mlbench)
data("BostonHousing")
head(BostonHousing, n=10)
## crim zn indus chas nox rm age dis rad tax ptratio b
## 1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## 7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60
## 8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90
## 9 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63
## 10 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
## 7 12.43 22.9
## 8 19.15 27.1
## 9 29.93 16.5
## 10 17.10 18.9
dim(BostonHousing)
## [1] 506 14
We have 506 rows which means we are talking about 506 towns or suburbs in Boston. In addition, we have 14 columns which means we have 14 characteristics of each town or suburb.
summary(BostonHousing)
## crim zn indus chas nox
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710
## rm age dis rad
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
## tax ptratio b lstat
## Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
Crime has is very large range. The minimum is 0.00632 and the maximum is 88.97620.
Zn (proportion of residential land zoned for lots over 25,000 sq.ft.) has 0 for half of the cities/suburbs. In addition, there are cities/suburbs with 100% of the land being residential which could be a sign of a food dessert.
Rm (average number of rooms per dwelling) is 6 rooms. I found this interesting.
Lstat (Proportion of population that is lower status - non educated) has a large range we see that there are towns that have very educated residents and we see some towns were 38% of the population didn’t finish high school.
Medv (Median value of owner-occupied homes in $1000’s) also has a large range with $5,000 being the average value of a home in some towns and $50,000 being the average in other neighbors.
str(BostonHousing)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : num 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
par(mfrow=c(1,7))
for(i in 1:7) {
boxplot(BostonHousing[,i], main=names(BostonHousing)[i])
}
par(mfrow=c(1,7))
for(i in 8:14) {
boxplot(BostonHousing[,i], main=names(BostonHousing)[i])
}
REMEMBER The larger the value the better the accessibility
unique(BostonHousing$rad)
## [1] 1 2 3 5 4 8 6 7 24
table(BostonHousing$rad)
##
## 1 2 3 4 5 6 7 8 24
## 20 24 38 110 115 26 17 24 132
Conclusion: We see that the majority of the towns in Boston have low crime rates but there are a few outliers with 30 times more than the average. I am curious about those towns and what other characteristics they share.
Remember CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
library(dplyr)
newboston = select(BostonHousing, -4)
library(corrplot)
correlations <-cor(newboston[,1:13])
corrplot(correlations, method="circle")
Rad (Highway accessibility index) and Tax (full-value property tax rate)
Indus (proportion non-retail business) and nox (pollution)
Indus (proportion non-retail business) and tax (full-value property tax rate)
Age (proportion of owner-occupied units built prior to 1940) and Nox (pollution)
Lstat (proportion of uneducated individuals) and medv (median value of owner-occupied homes in $1000s.)
Nox (pollution) and dis (weighed distances to five Boston employment centers)
Dis (weighed distances to five Boston employment centers) and indus (proportion non-retail business)
Dis (weighed distances to five Boston employment centers) and age (proportion of owner-occupied units built prior to 1940)
plot(newboston$lstat, newboston$medv, main="Uneducated towns and Average house price", xlab= "Proportion of uneducated inviduals", ylab="Medium value of home (thousands)")
abline(lm(newboston$lstat~newboston$medv), col="red")
plot(newboston$rm, newboston$medv, main="Average number of rooms and Average house price", xlab= "Average number of rooms in a home", ylab="Medium value of home (thousands)")
abline(lm(newboston$medv~newboston$rm), col="red")
We can see that if a town has a high proportion of population that is low status (uneducated with blue collar jobs) then the average house price is low.
high_crime %>% slice(501:506)
## crim zn indus nox rm age dis rad tax ptratio b lstat medv
## 405 41.5292 0 18.1 0.693 5.531 85.4 1.6074 24 666 20.2 329.46 27.38 8.5
## 415 45.7461 0 18.1 0.693 4.519 100.0 1.6582 24 666 20.2 88.27 36.98 7.0
## 411 51.1358 0 18.1 0.597 5.757 100.0 1.4130 24 666 20.2 2.60 10.11 15.0
## 406 67.9208 0 18.1 0.693 5.683 100.0 1.4254 24 666 20.2 384.97 22.98 5.0
## 419 73.5341 0 18.1 0.679 5.957 100.0 1.8026 24 666 20.2 16.45 20.62 8.8
## 381 88.9762 0 18.1 0.671 6.968 91.9 1.4165 24 666 20.2 396.90 17.21 10.4