Boston House Price Dataset

Spatial Data Visualization of Boston using Google API

Questions of interest

  1. Is crime highly correlated with other attributes of the city of Boston?
  2. What characteristics of the city of Boston does have high correlation?
  3. Is there a correlation between housing prices and other characteristics of dataset?
  4. What are other characteristics do the towns with the highest crime have in common?

Characteristics of Boston Housing Dataset

  • CRIM - Crime rate by town
  • ZN - proportion of residential land zones for lots over 25,000 sq. ft.
  • INDUS - proportion of non-retail business acres per town
  • CHAS - Charles River dummy variable ( 1 if tract bounds river; 0 otherwise)
  • NOX - nitric oxides concentration (parts per 10 million)
  • RM - average number of rooms per dwelling
  • AGE - proportion of owner-occupied units built prior to 1940
  • DIS - weighed distances to five Boston employment centers
  • RAD - index of accessibility to radial highways (The larger the value the better the accessibility)
  • TAX - full-value property-tax rate per $10,000
  • PTRATIO - pupil-teacher ratio by town school district
  • B - Proportion of black individuals by town
  • LSTAT - Proportion of population that is lower status
  • MEDV - Median value of owner-occupied homes in $1000’s

Each record in the database describes a Boston suburb or town.

Importing Dataset

library(mlbench)
data("BostonHousing")
head(BostonHousing, n=10)
##       crim   zn indus chas   nox    rm   age    dis rad tax ptratio      b
## 1  0.00632 18.0  2.31    0 0.538 6.575  65.2 4.0900   1 296    15.3 396.90
## 2  0.02731  0.0  7.07    0 0.469 6.421  78.9 4.9671   2 242    17.8 396.90
## 3  0.02729  0.0  7.07    0 0.469 7.185  61.1 4.9671   2 242    17.8 392.83
## 4  0.03237  0.0  2.18    0 0.458 6.998  45.8 6.0622   3 222    18.7 394.63
## 5  0.06905  0.0  2.18    0 0.458 7.147  54.2 6.0622   3 222    18.7 396.90
## 6  0.02985  0.0  2.18    0 0.458 6.430  58.7 6.0622   3 222    18.7 394.12
## 7  0.08829 12.5  7.87    0 0.524 6.012  66.6 5.5605   5 311    15.2 395.60
## 8  0.14455 12.5  7.87    0 0.524 6.172  96.1 5.9505   5 311    15.2 396.90
## 9  0.21124 12.5  7.87    0 0.524 5.631 100.0 6.0821   5 311    15.2 386.63
## 10 0.17004 12.5  7.87    0 0.524 6.004  85.9 6.5921   5 311    15.2 386.71
##    lstat medv
## 1   4.98 24.0
## 2   9.14 21.6
## 3   4.03 34.7
## 4   2.94 33.4
## 5   5.33 36.2
## 6   5.21 28.7
## 7  12.43 22.9
## 8  19.15 27.1
## 9  29.93 16.5
## 10 17.10 18.9

Dimensions of dataset

dim(BostonHousing)
## [1] 506  14

Conclusion:

We have 506 rows which means we are talking about 506 towns or suburbs in Boston. In addition, we have 14 columns which means we have 14 characteristics of each town or suburb.

Summary of Boston Housing Characteristics

summary(BostonHousing)
##       crim                zn             indus       chas         nox        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471   Min.   :0.3850  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1: 35   1st Qu.:0.4490  
##  Median : 0.25651   Median :  0.00   Median : 9.69           Median :0.5380  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14           Mean   :0.5547  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10           3rd Qu.:0.6240  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
##        rm             age              dis              rad        
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
##  1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
##  Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
##  Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
##  3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000  
##       tax           ptratio            b              lstat      
##  Min.   :187.0   Min.   :12.60   Min.   :  0.32   Min.   : 1.73  
##  1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95  
##  Median :330.0   Median :19.05   Median :391.44   Median :11.36  
##  Mean   :408.2   Mean   :18.46   Mean   :356.67   Mean   :12.65  
##  3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95  
##  Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

Conclusion of interesting finds:

  • Crime has is very large range. The minimum is 0.00632 and the maximum is 88.97620.

  • Zn (proportion of residential land zoned for lots over 25,000 sq.ft.) has 0 for half of the cities/suburbs. In addition, there are cities/suburbs with 100% of the land being residential which could be a sign of a food dessert.

  • Rm (average number of rooms per dwelling) is 6 rooms. I found this interesting.

  • Lstat (Proportion of population that is lower status - non educated) has a large range we see that there are towns that have very educated residents and we see some towns were 38% of the population didn’t finish high school.

  • Medv (Median value of owner-occupied homes in $1000’s) also has a large range with $5,000 being the average value of a home in some towns and $50,000 being the average in other neighbors.

Finding column class type.

str(BostonHousing)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ b      : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Box plots of Characteristics

par(mfrow=c(1,7))
for(i in 1:7) {
  boxplot(BostonHousing[,i], main=names(BostonHousing)[i])
}

par(mfrow=c(1,7))
for(i in 8:14) {
  boxplot(BostonHousing[,i], main=names(BostonHousing)[i])
}

Pie Chart for Visualizing Index of Accessibility to Radial Highways

REMEMBER The larger the value the better the accessibility

unique(BostonHousing$rad)
## [1]  1  2  3  5  4  8  6  7 24
table(BostonHousing$rad)
## 
##   1   2   3   4   5   6   7   8  24 
##  20  24  38 110 115  26  17  24 132

Histogram of Crime Rate per Capita

Conclusion: We see that the majority of the towns in Boston have low crime rates but there are a few outliers with 30 times more than the average. I am curious about those towns and what other characteristics they share.

Creating a subset without CHAS column

Remember CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

library(dplyr)
newboston = select(BostonHousing, -4)

Correlation Plot

library(corrplot)
correlations <-cor(newboston[,1:13])
corrplot(correlations, method="circle")

Creating an Ordered Heatmap

  • I wanted to order the correlation plot so it would be easier to detect characteristics that have high correlation. - You can see the process to arrive at the final heatmap.

Question 1: Is crime highly correlated with other characteristics of the city of Boston?

  • The only characteristics that are slightly correlated with crime are rad and tax.
    • Rad is the index of accessibility to radial highways

    • Tax is the full-value property-tax rate per 10,000 USD.

  • I will perform a hypothesis test using p-value method to check for if there is significant linear correlation.

Correlation between Crime & Highway Accessibility

cor(newboston$crim, newboston$rad)
## [1] 0.6255051
cor.test(newboston$crim, newboston$rad)
## 
##  Pearson's product-moment correlation
## 
## data:  newboston$crim and newboston$rad
## t = 17.998, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5693817 0.6758248
## sample estimates:
##       cor 
## 0.6255051

Conclusion: Although the correlation is relatively small, the P-value smaller than 0.01 indicates a significant linear dependence at the 1% - significance level. Thus crime rates is positively correlated with highway accessibility index.

Correlation between Crime & Property Tax

cor(newboston$crim, newboston$tax)
## [1] 0.5827643
cor.test(newboston$crim, newboston$tax)
## 
##  Pearson's product-moment correlation
## 
## data:  newboston$crim and newboston$tax
## t = 16.099, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5221186 0.6375464
## sample estimates:
##       cor 
## 0.5827643

Conclusion: Although the correlation is relatively small, the P-value smaller than 0.01 indicates a significant linear dependence at the 1% - significance level. Thus crime rates is positively correlated with full-value property tax rate.

Question 2: What characteristics of the city of Boston have high correlation?

Positive correlation

  • Rad (Highway accessibility index) and Tax (full-value property tax rate)

  • Indus (proportion non-retail business) and nox (pollution)

  • Indus (proportion non-retail business) and tax (full-value property tax rate)

  • Age (proportion of owner-occupied units built prior to 1940) and Nox (pollution)

Negative correlation

  • Lstat (proportion of uneducated individuals) and medv (median value of owner-occupied homes in $1000s.)

  • Nox (pollution) and dis (weighed distances to five Boston employment centers)

  • Dis (weighed distances to five Boston employment centers) and indus (proportion non-retail business)

  • Dis (weighed distances to five Boston employment centers) and age (proportion of owner-occupied units built prior to 1940)

Question 3: Is there a correlation between housing prices and other characteristics of dataset?

  • Yes we see that Lstat (proportion of uneducated individuals) and medv (median value of owner-occupied homes in $1000s.) are negatively correlated. In addition, Rm (average number of rooms) is positively correlated with housing prices.

Scatterplot Proportion of uneducated individuals and Average house price

plot(newboston$lstat, newboston$medv, main="Uneducated towns and Average house price", xlab= "Proportion of uneducated inviduals", ylab="Medium value of home (thousands)")

abline(lm(newboston$lstat~newboston$medv), col="red")

plot(newboston$rm, newboston$medv, main="Average number of rooms and Average house price", xlab= "Average number of rooms in a home", ylab="Medium value of home (thousands)")

abline(lm(newboston$medv~newboston$rm), col="red")

Conclusion

We can see that if a town has a high proportion of population that is low status (uneducated with blue collar jobs) then the average house price is low.

Question 4: What are other characteristics do the towns with the highest crime have in common?

Order data by Crime in Ascending order

  • The table below shows the characteristics for the 6 towns with the highest crime rate.
high_crime %>% slice(501:506)
##        crim zn indus   nox    rm   age    dis rad tax ptratio      b lstat medv
## 405 41.5292  0  18.1 0.693 5.531  85.4 1.6074  24 666    20.2 329.46 27.38  8.5
## 415 45.7461  0  18.1 0.693 4.519 100.0 1.6582  24 666    20.2  88.27 36.98  7.0
## 411 51.1358  0  18.1 0.597 5.757 100.0 1.4130  24 666    20.2   2.60 10.11 15.0
## 406 67.9208  0  18.1 0.693 5.683 100.0 1.4254  24 666    20.2 384.97 22.98  5.0
## 419 73.5341  0  18.1 0.679 5.957 100.0 1.8026  24 666    20.2  16.45 20.62  8.8
## 381 88.9762  0  18.1 0.671 6.968  91.9 1.4165  24 666    20.2 396.90 17.21 10.4

Observations:

  • The towns with the highest crime rate have a value of 0 for zn. This means that the proportion of residential land zoned for lots is 0 in those towns.
  • These 6 towns have the same value of 18.1 for indus which is the proportion of non-retail business acres per town. The value 18.1 is in the 75th percentile which is above the average proportion of non-retail businesses.
  • In regards to nitric oxide (nox) concentration levels in these towns it is above the average.
  • The next piece that is interesting is age. The majority of these towns have 100% of the units built prior to 1940 and thus the units are quite old.
  • All these towns have a radial highway accessibility (rad) index of 24 which is the highest value that means the accessbility is high in these towns.
  • Ptratio for all these high crime towns are 20.2 which is above the average. This means there are about 20 students to 1 teacher. This may not seem high in these times but 20.2 is in the 75th percentile so 75% of the towns have a ratio less than this.
  • Last one is Lstat this is the proportion of the population that is low status. This means that the population didn’t go to high school. The median value of Lstat for Boston is 11.36 but notice that the majority of these high crime towns have a proportion that is double or triple the average value.