The data set is standard and is easily obtained from MASS library in R. Our goal is to perform Exploratory Data Analysis on this data set and see what insights we can gain.
We are focused on crime rate and how different features impact crime rate.
We begin by loading the data set into our RStudio environment. Since the data set is in MASS library, simply loading the library will provide us access to this data set.
library(MASS)
data("Boston")
Now that the library is loaded, let’s find out the dimension of this data set.
dim(Boston)
[1] 506 14
Our data set contains 506 observations and 14 features. Let’s find out if there are any missing values in the data set.
sum(is.na(Boston))
[1] 0
There are no value marked as NA in our data set. Let’s take a look at how our data set appears.
head(Boston)
summary(Boston)
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
Let’s see how these features in our data set correlate. However, since we are only focused on crime rate, we will simply look at crime rate and features that have a correlation with it.
NOTE: If we were building a predictive model, we would have to focus on multicollinearity and therefore correlation of each feature with other feature will be of importance too but we are only performing EDA with crime rate as the focus.
cor(Boston, method="pearson")[1,]
crim zn indus chas nox rm
1.00000000 -0.20046922 0.40658341 -0.05589158 0.42097171 -0.21924670
age dis rad tax ptratio black
0.35273425 -0.37967009 0.62550515 0.58276431 0.28994558 -0.38506394
lstat medv
0.45562148 -0.38830461
We notice that crime rate is highest when there is high accessibility to radial highways. This could be because once a person has committed a crime, they can easily get away from the crime scene. Let’s verify this when rad is greater than the median of 5 and less than 5.
mean(Boston[Boston$rad >= 5, ]$crim)
[1] 5.664632
mean(Boston[Boston$rad < 5, ]$crim)
[1] 0.2591066
Our median rad is 5, which means that we are can expect to see approximately 50% of the data on either side of this value, we notice that lower rad means lower average crime rate and rad rises, crime rate rises along with it. This indicates that accessibility to radial highways definitely has an impact on the crime rate.
Let’s create some plots and graphs to visualize the information we have so far.
rad <- as.factor(Boston$rad)
plot(x=rad, y=Boston$crim,
xlab="Accessibility to Radial Highways",
ylab="Crime Rate",
main="Crime Rate and Accessibility to Radial Highways")
We have an outlier index for rad in our data set. While most of the accessibility index ratings are between 1 - 8, we have a rating of 24 that maybe skewing our results. Let’s dig deeper and see how the data fares if we discount index rating of 24.
boston <- Boston[Boston$rad < 24, ]
mean(boston$crim)
[1] 0.3856057
boston <- Boston[Boston$rad == 24, ]
mean(boston$crim)
[1] 12.75929
Let’s find out how many suburbs actually have a rating of 24.
dim(boston)
[1] 132 14
Our data set contains 506 observations out of which 132 belong to rad rating of 24.
132/506
[1] 0.2608696
Approximately 26.08% of our data is being impacted by this. Let’s see what our correlations look like if we remove rad rating of 24.
boston <- Boston[Boston$rad < 24, ]
cor(boston, method="pearson")[, 1]
crim zn indus chas nox rm age
1.0000000 -0.2887260 0.5006359 0.1355393 0.7465676 -0.2123193 0.4683864
dis rad tax ptratio black lstat medv
-0.4460934 0.1459919 0.2958862 -0.2378530 -0.5192336 0.3840303 -0.1716970
We no longer have such a high correlation with rad anymore. Removing high rad rating is also leading us to see that there is a high correlation between crime rate and nox, which is nitrogen oxides concentration. EPA states the following about NO2:
Breathing air with a high concentration of NO2 can irritate airways in the human respiratory system. Such exposures over short periods can aggravate respiratory diseases, particularly asthma, leading to respiratory symptoms (such as coughing, wheezing or difficulty breathing), hospital admissions and visits to emergency rooms. Longer exposures to elevated concentrations of NO2 may contribute to the development of asthma and potentially increase susceptibility to respiratory infections. People with asthma, as well as children and the elderly are generally at greater risk for the health effects of NO2.
More information on NO2 can be obtained from here.
We can conclude that discounting for accessibility to radial highways, NO2 is the most important factor.
Our top three correlations are industrial area, high levels of N02, and age.
boxplot(boston$nox, horizontal=T, col="#52BE80",
xlab="Levels of Nox",
main="Boxplot of Nox Levels in Suburbs of Boston")
There seems to be only one outlier in this group. Let’s go ahead and take a look at the range of N02.
range(boston$nox)
[1] 0.385 0.871
boston <- boston[boston$nox < 0.8, ]
boxplot(boston$nox, horizontal=T, col="#BDC3C7",
xlab="Levels of Nox",
main="Boxplot of Nox Levels in Suburbs of Boston")
Let’s check if this changes our correlation or not.
cor(boston, method="pearson")[, 1]
crim zn indus chas nox rm
1.00000000 -0.29573551 0.39893305 0.05963264 0.54058368 -0.06586902
age dis rad tax ptratio black
0.43672998 -0.37937676 0.14113302 0.21731414 -0.05835157 -0.44310017
lstat medv pop_crim
0.20295390 -0.03481710 1.00000000
So far we have removed two outliers, and these are as follows:
At this point we can say that high accessibility rating to radial highways, and high levels of nox in the environment are a significant contributing factor to the crime rates in Boston suburbs.
We are seeing that our correlations have definitely changed however, our top 3 choice still remain top 3.