Final Exam

Let’s turn to the Boston housing dataset, which contains the following variables from 506 different towns in Boston collected by the US Census Service:

library(MASS)

To begin, load the Boston dataset. We can fetch this dataset by calling sklearn’s API.

data("Boston")
Boston = Boston
head(data)

##                                                                             
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,        
## 2     verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE) 
## 3 {                                                                         
## 4     fileExt <- function(x) {                                              
## 5         db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)                     
## 6         ans <- sub(".*\\\\.", "", x)

Make pairwise scatterplots of some predictors (columns) in this dataset. Since this dataset includes many predictors, avoid using the seaborn pairplot() function with all the predictors to minimize run-time. Comment on your observations.

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Boston$chas <- as.numeric(Boston$chas)
Boston$rad <- as.numeric(Boston$rad)
pairs(Boston)

#Ans: We can't deduce much from the current visualization except for the presence of potential correlations among variables. To gain clearer insights, a correlation matrix would be more informative. Luckily, question-c provides the chance to construct one.

Are any of the predictors associated with per capita crime rate? If so, explain the relationship. Q: Which predictor has the highest correlation with ‘CRIM’, besides ‘CRIM’ itself?

par(mfrow = c(2, 2))
plot(Boston$crim ~ Boston$zn,
     log = 'xy',
     col = 'steelblue')

## Warning in xy.coords(x, y, xlabel, ylabel, log): 372 x values <= 0 omitted from
## logarithmic plot

plot(Boston$crim ~ Boston$age,
     log = 'xy',
     col = 'steelblue')

plot(Boston$crim ~ Boston$dis,
     log = 'xy',
     col = 'black')

plot(Boston$crim ~ Boston$lstat,
     log = 'xy',
     col = 'black')

#Based on the correlation coefficients and their corresponding p-values, there is indeed an association between the per capita crime rate (crim) and the other predictors.

Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor. Q: Which town number has the highest crime rate? Which town number has the lowest tax rate?

summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

summary(Boston$tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.3

qplot(Boston$crim, binwidth=5 , xlab = "Crime rate", ylab="Number of Suburbs" )

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

qplot(Boston$tax, binwidth=50 , xlab = "Full-value property-tax rate per $10,000", ylab="Number of Suburbs")

qplot(Boston$ptratio, binwidth=5, xlab ="Pupil-teacher ratio by town", ylab="Number of Suburbs")

#Only a few Boston suburbs exhibit elevated crime rates, as is evident from the Histogram of Per Capita Crime Rate. The majority of instances cluster around a zero Per Capita Crime Rate.By analyzing the Histogram of Full-value Property-tax Rate per $10,000, it's apparent that numerous observations correspond to a value of approximately 700 on the x-axis. Nevertheless, a significant number of observations also fall within the range of 200 to 400 on the x-axis.The distribution in the Histogram of Pupil-teacher Ratio by Town showcases a relatively uniform number of observations across various Pupil-teacher Ratios. However, a notable spike near 20 on the x-axis signifies a heightened frequency of occurrences at that ratio.

Count the number of towns in this dataset that are bound to the Charles river and name this number as the variable charles.

nrow(subset(Boston, chas ==1))

## [1] 35

What is the median pupil-teacher ratio among the towns in this dataset? Name this number as the variable med_pt.

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

Which town of Boston has lowest median value of owner- occupied homes? Name this index as the variable min_medv. If there are multiple towns with the same minimum median value, then choose the first such observation. What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your observations.

min_medv<- Boston[order(Boston$medv),]
min_medv[1,]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio black lstat
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.9 30.59
##     medv
## 399    5

In this data set, how many of the suburbs average more than eight rooms per dwelling? Name this number as the variable num_rooms. Comment on the suburbs that average more than eight rooms per dwelling.

variablenum_rooms <- subset(Boston, rm>7)
nrow(variablenum_rooms)

## [1] 64

#There are 64 suburbs with more than 7 rooms per dwelling.

rm_over_8 <- subset(Boston, rm>8)
nrow(rm_over_8)

## [1] 13

#There are 13 suburbs with more than 7 rooms per dwelling
summary(rm_over_8)

##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          black      
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
##  Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
##      lstat           medv     
##  Min.   :2.47   Min.   :21.9  
##  1st Qu.:3.32   1st Qu.:41.7  
##  Median :4.14   Median :48.3  
##  Mean   :4.31   Mean   :44.2  
##  3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :7.44   Max.   :50.0

Final Exam

ANLY 560

8/3/2023