Introduction to Statistical Learning

Boston Housing Analysis

You are a data scientist in Awesome Business Analytics (ABA). ABA is a public company traded in Stock exchange. The CEO of ABA wants to invest in the real estate properties in the Boston area. You are given a dataset containing housing values in the suburbs of Boston in the file Boston.csv.

Data Source:

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Data Dictionary

Attributes	Description	Data Type	Constraints / Rules
`crim`	Per capita crime rate by town	`float64`	Must be ≥ 0
`zn`	Proportion of residential land zoned for lots over 25,000 sq.ft	`float64`	Range: 0–100
`indus`	Proportion of non-retail business acres per town	`float64`	Must be ≥ 0
`chas`	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)	`int64`	Binary: 0 or 1
`nox`	Nitric oxides concentration (parts per 10 million)	`float64`	Range: Typically 0.3–1.0
`rm`	Average number of rooms per dwelling	`float64`	Typically between 3 and 9
`age`	Proportion of owner-occupied units built before 1940	`float64`	Range: 0–100
`dis`	Weighted distances to five Boston employment centers	`float64`	Must be > 0
`rad`	Accessibility index to radial highways	`int64`	Discrete integer; values typically range from 1 to 24
`tax`	Property-tax rate per $10,000	`int64`	Must be > 0
`ptratio`	Pupil-teacher ratio by town	`float64`	Typical range: 12–22
`lstat`	% of lower status population	`float64`	Range: 0–100
`medv`	Median home value (in $1000s)	`float64`	Typical range: 5–50 (capped at 50)

Question 1

Read the dataset in Boston.csv into R. Call the loaded data Boston. Make sure that you have the directory set to the correct location for the data.

Read the data set into R memory

Boston <- read.csv("data/Boston.csv")

Check the dimension of the data frame

dim(Boston)

## [1] 506  13

Display the column names

colnames(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "lstat"   "medv"

Display the structure of the data frame

str(Boston)

## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Display the stastical summary of the data frame

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

Question 2

How many rows are in the data frame? How many columns? What do the rows and columns represent?

There are 506 number of rows in the data frame.
There are 13 number of columns int the data frame.
Each row represents suburbs in the Boston area, while each column corresponds to a different attribute of these suburbs.

Question 3

Select the 1st, 100th, and 500th rows with columns tax and medv.

Extract 1st, 100th, and 500th rows with columns ‘tax’ and ‘medv’

selected_rows <- Boston[c(1, 100, 500), c("tax", "medv")]
print(selected_rows)

##     tax medv
## 1   296 24.0
## 100 276 33.2
## 500 391 17.5

Question 4

Look at the data using cor function. Are any of the predictors associated with per capita crime rate? If so, explain the relationship based on correlation coefficents.

Correlation analysis to identify associations with crime rate.

cor_matrix <- cor(Boston)

Print the crime rate correlation matrix

print(cor_matrix["crim", ])

##        crim          zn       indus        chas         nox          rm 
##  1.00000000 -0.20046922  0.40658341 -0.05589158  0.42097171 -0.21924670 
##         age         dis         rad         tax     ptratio       lstat 
##  0.35273425 -0.37967009  0.62550515  0.58276431  0.28994558  0.45562148 
##        medv 
## -0.38830461

The correlation matrix provides valuable insight into the relationships between different predictors, particularly in relation to the per capita crime rate (crim). For example:
- Positive correlation: crim appears to be positively correlated with variables like indus (the proportion of non-retail business acres per town), suggesting that areas with more industrial zones tend to have higher crime rates.
- Negative correlation: Conversely, crim may show a negative correlation with dis (the weighted mean of distances to five Boston employment centers), indicating that suburbs farther from employment centers generally have lower crime rates.
Correlation values close to 1 or -1 signify strong relationships, while values near 0 indicate weak or no significant relationships. A detailed understanding of these correlations is crucial for identifying factors that influence crime rates in the Boston suburbs.

Question 5

Make some pairwise scatterplots of the predictors, crim, rad, tax, indus, and lstat in this data set. Describe your findings.

Pairwise scatter plots for selected predictors

pairs(Boston[, c("crim", "rad", "tax", "indus", "lstat")], 
      main = "Pairwise Scatterplots of Selected Predictors")

From the pairwise scatterplot matrix, we can examine the relationships between the predictors: crim, rad, tax, indus, and lstat. Below is a description of each pairwise relationship along with key observations:

crim vs. rad:
A concentration of points at lower levels of both variables suggests that many areas have low crime rates and limited access to radial highways (rad). However, some higher-crime areas show a correlation with higher rad values, potentially indicating a relationship between highway accessibility and crime rates.
crim vs. tax:
The scatterplot reveals a slight spread of points as tax values increase, suggesting that higher crime rates may coincide with higher tax rates. Still, many observations are clustered around lower tax and crime values, indicating exceptions to this trend.
crim vs. indus:
A positive trend is visible, implying that areas with a higher proportion of industrial activities (non-retail business) tend to experience higher crime rates. Notably, there are several significant outliers in the high-crime and high-industry range, which warrant further investigation.
crim vs. lstat:
There appears to be a moderate positive correlation between crime rates (crim) and the percentage of the population with lower socioeconomic status (lstat). This suggests that areas with a higher proportion of low-income residents tend to experience higher crime rates, potentially reflecting a socioeconomic influence on crime.
rad vs. tax:
A strong positive relationship is evident between radial highway accessibility (rad) and tax rates, suggesting that areas with better highway access tend to have higher tax rates. This could be reflective of urban infrastructure and property value dynamics.
rad vs. indus:
There is some clustering where areas with higher rad values are associated with higher industrial activity (indus), suggesting that areas with better access to radial highways are more industrialized.
rad vs. lstat:
The relationship between rad and lstat is not immediately clear, though a few cases show higher lstat values paired with mid-range rad values.
tax vs. indus:
A positive trend between tax rates and industrial activity suggests that areas with a higher proportion of industrial activities tend to have higher tax rates. This relationship appears relatively strong compared to other pairs.
tax vs. lstat:
While no clear linear trend emerges between tax rates and lstat, there seems to be a mild positive association, indicating a weak correlation between higher taxes and a higher proportion of low-income residents.
indus vs. lstat:
A moderate positive correlation is apparent between industrial activity and lower socioeconomic status, suggesting that areas with a higher concentration of industry tend to have higher proportions of low-income residents.

Key Takeaways:

Strong relationships exist between some pairs of variables, such as rad vs. tax and indus vs. lstat, suggesting potential co-dependencies or shared influencing factors within the dataset.
The correlations between crim and variables like lstat, indus, and rad suggest that socioeconomic factors and infrastructure may play a significant role in determining crime rates.
Outliers are present in multiple scatterplots, indicating the presence of extreme values that could potentially skew linear models or require further analysis.

Question 6

Do any of the suburbs of Boston appear to have particularly high crime rates by looking at the histogram of crim? What is the range of crim by using range() function in R?

Histogram of the Crime Rate (crim)

# Histogram of 'crim'
hist(
  Boston$crim,
  main = "Histogram of Crime Rate (crim)",
  xlab = "Crime Rate",
  labels = TRUE,
  col = "red",
  border = "black"
)

Based on the histogram, the distribution of the crime rate (crim) variable is heavily right skewed. Most suburbs have relatively low crime rates, with the highest frequency concentrated in the first bin, indicating that most suburbs in Boston experience very low crime rates.

However, there are bins beyond a crime rate of 20, particularly those exceeding 80, which suggest the presence of suburbs with unusually high crime rates. These are outliers compared to the rest of the data. Specifically, there are about five or six suburbs with notably higher crime rates, with at least one suburb surpassing the crime rate of 80.

Key Observations:

Low crime rates dominate:
Most of the data points are clustered in the first bin, indicating that approximately 452 suburbs have very low crime rates, close to zero.
Outliers with high crime rates:
The few higher bins, especially those beyond 20, highlight suburbs with significantly higher crime rates. These suburbs are clear outliers in the distribution.

In summary, while most suburbs exhibit low crime rates, a small number of suburbs with much higher crime rates stand out as distinct outliers in this dataset.

Range of Crime Rate crim

crime_range <- range(Boston$crim)
cat("Range of crime:", crime_range)

## Range of crime: 0.00632 88.9762

Question 7

How many of the suburbs in this dataset bound the Charles River?

Number of suburbs bordering the Charles River

charles_suburbs <- sum(Boston$chas == 1)
cat("Number of suburbs bordering the Charles River:",
    charles_suburbs, "\n")

## Number of suburbs bordering the Charles River: 35

Question 8

What is the median pupil-teach ratio among the towns in this dataset? What’s the mean?

Calculate median and mean pupil-teacher ratio

median_ptratio <- median(Boston$ptratio)
mean_ptratio <- mean(Boston$ptratio)
cat("Median pupil-teacher ratio:", median_ptratio, "\n")

## Median pupil-teacher ratio: 19.05

cat("Mean pupil-teacher ratio:", mean_ptratio, "\n")

## Mean pupil-teacher ratio: 18.45553

Question 9

In this dataset, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

Calculate the number of suburbs averaging more than 7 and 8 rooms per dwelling

more_than_7_rooms <- sum(Boston$rm > 7)
more_than_8_rooms <- sum(Boston$rm > 8)
cat("Suburbs averaging more than 7 rooms:", more_than_7_rooms, "\n")

## Suburbs averaging more than 7 rooms: 64

cat("Suburbs averaging more than 8 rooms:", more_than_8_rooms, "\n")

## Suburbs averaging more than 8 rooms: 13

Suburbs averaging more than 7 rooms: 64
Suburbs averaging more than 8 rooms: 13
Suburbs with more than 8 rooms per home are typically associated with greater affluence, as larger homes often correlate with higher property values and wealthier populations. These areas may represent attractive investment opportunities due to their spacious housing and potential for long-term value appreciation.

Question 10

Convert chas to a factor. Boxplot the medv against chas. Are houses around the Charles River more expensive?

Convert chas to a factor

Boston$chas <- as.factor(Boston$chas)
str(Boston$chas)

##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Plot boxplot of medv against chas

boxplot(
  medv ~ chas,
  data = Boston,
  main = "Boxplot of Median Value (medv) vs Charles River (chas)",
  xlab = "Charles River (chas)",
  ylab = "Median Value (medv)",
  col = c("blue", "green")
)

The boxplot might show that homes near the Charles River (where chas = 1) have a higher median value compared to those farther away (chas = 0). This makes sense, as being close to a major water body like the Charles River could be perceived as an amenity, leading to higher property values in those suburbs.