Question-a

To begin, load in the Boston data set.The Boston data set is part of the MASS library in R. How many rows are in this data set? How many columns? What do the rows and columns represent?:

library(MASS)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
dim(Boston)
## [1] 506  14

Each row represent the set of predictor obeservations for a given Neighborhood in Boston. Each column represent each predictor variable for which an observation was made in 506 neighborhoods of Boston.

Question-b

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston$chas <- as.numeric(Boston$chas)
Boston$rad <- as.numeric(Boston$rad)
pairs(Boston)

Not much can be discerned other than the fact that some variables appear to be correlated. A correlation matrix would be more helpful and question-c gives us the opportunity to make one.

Question-c

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

##       i       j         cor            p
## 1  crim      zn -0.20046922 5.506472e-06
## 2  crim   indus  0.40658341 0.000000e+00
## 4  crim    chas -0.05589158 2.094345e-01
## 7  crim     nox  0.42097171 0.000000e+00
## 11 crim      rm -0.21924670 6.346703e-07
## 16 crim     age  0.35273425 2.220446e-16
## 22 crim     dis -0.37967009 0.000000e+00
## 29 crim     rad  0.62550515 0.000000e+00
## 37 crim     tax  0.58276431 0.000000e+00
## 46 crim ptratio  0.28994558 2.942924e-11
## 56 crim   black -0.38506394 0.000000e+00
## 67 crim   lstat  0.45562148 0.000000e+00
## 79 crim    medv -0.38830461 0.000000e+00

Based on the correlation coefficients and their corresponding p-values, there is indeed an association between the per capita crime rate (crim) and the other predictors.

Question-d

Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

summary(Boston$crim)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25650  3.61400  3.67700 88.98000
summary(Boston$tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0
summary(Boston$ptratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00
library(ggplot2)
qplot(Boston$crim, binwidth=5 , xlab = "Crime rate", ylab="Number of Suburbs" )

qplot(Boston$tax, binwidth=50 , xlab = "Full-value property-tax rate per $10,000", ylab="Number of Suburbs")

qplot(Boston$ptratio, binwidth=5, xlab ="Pupil-teacher ratio by town", ylab="Number of Suburbs")

Considering that the median and maximum crime rate values are respectively about 0.26% and 89%, there are indeed some neighborhoods where the crime rate is alarmingly high

selection <- subset( Boston, crim > 10)
nrow(selection)/ nrow(Boston)
## [1] 0.1067194

11% of the neighborhood’s have crime rates above 10%;

selection <- subset( Boston, crim > 50)
nrow(selection)/ nrow(Boston)
## [1] 0.007905138

0.8% of the neighborhoods have crim rates above 50%

Based on the histogram of the Tax rates, they are few neighborhoods where rates are relative higher. The median and average tax amount are $330 and $408.20 ( per Full-value property-tax rate per $10,000) respectively.

selection <- subset( Boston, tax< 600)
nrow(selection)/ nrow(Boston)
## [1] 0.729249

73% of the neighborhood pay under $600

selection <- subset( Boston, tax> 600) 
nrow(selection)/ nrow(Boston)
## [1] 0.270751

27% of the neighborhood pay over $600

Question-e

How many of the suburbs in this data set bound the Charles river?

nrow(subset(Boston, chas ==1)) 
## [1] 35

There are 35 such suburbs.

Question-f

What is the median pupil-teacher ratio among the towns in this data set?

summary(Boston$ptratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

The median pupil-teacher ratio is 19 pupils for each teacher

Question-g

Which suburb of Boston has lowest median value of owner occupied homes?What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors?

selection <- Boston[order(Boston$medv),]
selection[1,]
##        crim zn indus chas   nox    rm age    dis rad tax ptratio black
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.9
##     lstat medv
## 399 30.59    5

Suburb #399 with a median value of $5000. We can use the following summary information to answer part-2 of this question

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Suburd #399: * Crime is very high compared to median and average rates of all Boston neighborhoods. * No residential land zoned for lots over 25,000 sq.ft. This applies to more than half of the neighborhoods in Boston * Proportion of non-retail business acres per town is very high compared to most suburbs. * This suburd is not one of the suburbs that bound the Charles river. * Nitrogen oxides concentration (parts per 10 million) is one of the highest. * Average number of rooms per dwelling is one of the lowest * Highest proportion of owner proportion of owner-occupied units built prior to 1940. * One of the lowest weighted mean of distances to five Boston employment centres. * Highest index of accessibility to radial highways. * One of the highest full-value property-tax rate per $10,000. * One of the highest pupil-teacher ratio by town * Highest value for 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. * One of the highest lower status of the population (percent) * Lowest median value of owner-occupied homes in $1000s.

Based on the list above, suburb 399 can be classified as one of the least desirable places to live in Boston.

Question-h

In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

rm_over_7 <- subset(Boston, rm>7)
nrow(rm_over_7)  
## [1] 64
rm("rm_over_7")

There are 64 suburbs with more than 7 rooms per dwelling.

rm_over_8 <- subset(Boston, rm>8)
nrow(rm_over_8)   
## [1] 13

There are 13 suburbs with more than 7 rooms per dwelling

summary(rm_over_8)
##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          black      
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
##  Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
##      lstat           medv     
##  Min.   :2.47   Min.   :21.9  
##  1st Qu.:3.32   1st Qu.:41.7  
##  Median :4.14   Median :48.3  
##  Mean   :4.31   Mean   :44.2  
##  3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :7.44   Max.   :50.0