Introduction to Statistical Learning
Boston Housing Analysis
You are a data scientist in Awesome Business Analytics (ABA). ABA is a public company traded in Stock exchange. The CEO of ABA wants to invest in the real estate properties in the Boston area. You are given a dataset containing housing values in the suburbs of Boston in the file Boston.csv.
Data Source:
- Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the
demand for clean air. J. Environ. Economics and Management 5,
81–102.
- Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Data Dictionary
Attributes | Description | Data Type | Constraints / Rules |
---|---|---|---|
crim |
Per capita crime rate by town | float64 |
Must be ≥ 0 |
zn |
Proportion of residential land zoned for lots over 25,000 sq.ft | float64 |
Range: 0–100 |
indus |
Proportion of non-retail business acres per town | float64 |
Must be ≥ 0 |
chas |
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) | int64 |
Binary: 0 or 1 |
nox |
Nitric oxides concentration (parts per 10 million) | float64 |
Range: Typically 0.3–1.0 |
rm |
Average number of rooms per dwelling | float64 |
Typically between 3 and 9 |
age |
Proportion of owner-occupied units built before 1940 | float64 |
Range: 0–100 |
dis |
Weighted distances to five Boston employment centers | float64 |
Must be > 0 |
rad |
Accessibility index to radial highways | int64 |
Discrete integer; values typically range from 1 to 24 |
tax |
Property-tax rate per $10,000 | int64 |
Must be > 0 |
ptratio |
Pupil-teacher ratio by town | float64 |
Typical range: 12–22 |
lstat |
% of lower status population | float64 |
Range: 0–100 |
medv |
Median home value (in $1000s) | float64 |
Typical range: 5–50 (capped at 50) |
Question 1
Read the dataset in Boston.csv into R. Call the loaded data Boston. Make sure that you have the directory set to the correct location for the data.
Read the data set into R memory
Check the dimension of the data frame
## [1] 506 13
Display the column names
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "lstat" "medv"
Display the structure of the data frame
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Display the stastical summary of the data frame
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
Question 2
How many rows are in the data frame? How many columns? What do the rows and columns represent?
- There are 506 number of rows in the data
frame.
- There are 13 number of columns int the data
frame.
- Each row represents suburbs in the Boston area, while each column corresponds to a different attribute of these suburbs.
Question 3
Select the 1st, 100th, and 500th rows with columns tax and medv.
Extract 1st, 100th, and 500th rows with columns ‘tax’ and ‘medv’
## tax medv
## 1 296 24.0
## 100 276 33.2
## 500 391 17.5
Question 4
Look at the data using cor function. Are any of the predictors associated with per capita crime rate? If so, explain the relationship based on correlation coefficents.
Correlation analysis to identify associations with crime rate.
Print the crime rate correlation matrix
## crim zn indus chas nox rm
## 1.00000000 -0.20046922 0.40658341 -0.05589158 0.42097171 -0.21924670
## age dis rad tax ptratio lstat
## 0.35273425 -0.37967009 0.62550515 0.58276431 0.28994558 0.45562148
## medv
## -0.38830461
- The correlation matrix provides valuable insight into the
relationships between different predictors, particularly in relation to
the per capita crime rate (crim). For example:
- Positive correlation: crim appears to be positively correlated with variables like indus (the proportion of non-retail business acres per town), suggesting that areas with more industrial zones tend to have higher crime rates.
- Negative correlation: Conversely, crim may show a negative correlation with dis (the weighted mean of distances to five Boston employment centers), indicating that suburbs farther from employment centers generally have lower crime rates.
- Correlation values close to 1 or -1 signify strong relationships, while values near 0 indicate weak or no significant relationships. A detailed understanding of these correlations is crucial for identifying factors that influence crime rates in the Boston suburbs.
Question 5
Make some pairwise scatterplots of the predictors, crim, rad, tax, indus, and lstat in this data set. Describe your findings.
Pairwise scatter plots for selected predictors
pairs(Boston[, c("crim", "rad", "tax", "indus", "lstat")],
main = "Pairwise Scatterplots of Selected Predictors")
From the pairwise scatterplot matrix, we can examine the relationships between the predictors: crim, rad, tax, indus, and lstat. Below is a description of each pairwise relationship along with key observations:
crim vs. rad:
A concentration of points at lower levels of both variables suggests that many areas have low crime rates and limited access to radial highways (rad). However, some higher-crime areas show a correlation with higher rad values, potentially indicating a relationship between highway accessibility and crime rates.crim vs. tax:
The scatterplot reveals a slight spread of points as tax values increase, suggesting that higher crime rates may coincide with higher tax rates. Still, many observations are clustered around lower tax and crime values, indicating exceptions to this trend.crim vs. indus:
A positive trend is visible, implying that areas with a higher proportion of industrial activities (non-retail business) tend to experience higher crime rates. Notably, there are several significant outliers in the high-crime and high-industry range, which warrant further investigation.crim vs. lstat:
There appears to be a moderate positive correlation between crime rates (crim) and the percentage of the population with lower socioeconomic status (lstat). This suggests that areas with a higher proportion of low-income residents tend to experience higher crime rates, potentially reflecting a socioeconomic influence on crime.rad vs. tax:
A strong positive relationship is evident between radial highway accessibility (rad) and tax rates, suggesting that areas with better highway access tend to have higher tax rates. This could be reflective of urban infrastructure and property value dynamics.rad vs. indus:
There is some clustering where areas with higher rad values are associated with higher industrial activity (indus), suggesting that areas with better access to radial highways are more industrialized.rad vs. lstat:
The relationship between rad and lstat is not immediately clear, though a few cases show higher lstat values paired with mid-range rad values.tax vs. indus:
A positive trend between tax rates and industrial activity suggests that areas with a higher proportion of industrial activities tend to have higher tax rates. This relationship appears relatively strong compared to other pairs.tax vs. lstat:
While no clear linear trend emerges between tax rates and lstat, there seems to be a mild positive association, indicating a weak correlation between higher taxes and a higher proportion of low-income residents.indus vs. lstat:
A moderate positive correlation is apparent between industrial activity and lower socioeconomic status, suggesting that areas with a higher concentration of industry tend to have higher proportions of low-income residents.
Key Takeaways:
- Strong relationships exist between some pairs of variables, such as
rad vs. tax and indus vs. lstat, suggesting potential co-dependencies or
shared influencing factors within the dataset.
- The correlations between crim and variables like lstat, indus, and
rad suggest that socioeconomic factors and infrastructure may play a
significant role in determining crime rates.
- Outliers are present in multiple scatterplots, indicating the presence of extreme values that could potentially skew linear models or require further analysis.
Question 6
Do any of the suburbs of Boston appear to have particularly high crime rates by looking at the histogram of crim? What is the range of crim by using range() function in R?
Histogram of the Crime Rate (crim)
# Histogram of 'crim'
hist(
Boston$crim,
main = "Histogram of Crime Rate (crim)",
xlab = "Crime Rate",
labels = TRUE,
col = "red",
border = "black"
)
Based on the histogram, the distribution of the crime rate (crim) variable is heavily right skewed. Most suburbs have relatively low crime rates, with the highest frequency concentrated in the first bin, indicating that most suburbs in Boston experience very low crime rates.
However, there are bins beyond a crime rate of 20, particularly those exceeding 80, which suggest the presence of suburbs with unusually high crime rates. These are outliers compared to the rest of the data. Specifically, there are about five or six suburbs with notably higher crime rates, with at least one suburb surpassing the crime rate of 80.
Key Observations:
Low crime rates dominate:
Most of the data points are clustered in the first bin, indicating that approximately 452 suburbs have very low crime rates, close to zero.Outliers with high crime rates:
The few higher bins, especially those beyond 20, highlight suburbs with significantly higher crime rates. These suburbs are clear outliers in the distribution.
In summary, while most suburbs exhibit low crime rates, a small number of suburbs with much higher crime rates stand out as distinct outliers in this dataset.
Range of Crime Rate crim
## Range of crime: 0.00632 88.9762
Question 7
How many of the suburbs in this dataset bound the Charles River?
Number of suburbs bordering the Charles River
charles_suburbs <- sum(Boston$chas == 1)
cat("Number of suburbs bordering the Charles River:",
charles_suburbs, "\n")
## Number of suburbs bordering the Charles River: 35
Question 8
What is the median pupil-teach ratio among the towns in this dataset? What’s the mean?
Calculate median and mean pupil-teacher ratio
median_ptratio <- median(Boston$ptratio)
mean_ptratio <- mean(Boston$ptratio)
cat("Median pupil-teacher ratio:", median_ptratio, "\n")
## Median pupil-teacher ratio: 19.05
## Mean pupil-teacher ratio: 18.45553
Question 9
In this dataset, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
Calculate the number of suburbs averaging more than 7 and 8 rooms per dwelling
more_than_7_rooms <- sum(Boston$rm > 7)
more_than_8_rooms <- sum(Boston$rm > 8)
cat("Suburbs averaging more than 7 rooms:", more_than_7_rooms, "\n")
## Suburbs averaging more than 7 rooms: 64
## Suburbs averaging more than 8 rooms: 13
- Suburbs averaging more than 7 rooms: 64
- Suburbs averaging more than 8 rooms: 13
- Suburbs with more than 8 rooms per home are typically associated with greater affluence, as larger homes often correlate with higher property values and wealthier populations. These areas may represent attractive investment opportunities due to their spacious housing and potential for long-term value appreciation.
Question 10
Convert chas
to a factor. Boxplot the
medv
against chas
. Are houses around the
Charles River more expensive?
Convert chas
to a factor
## Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Plot boxplot of medv
against chas
boxplot(
medv ~ chas,
data = Boston,
main = "Boxplot of Median Value (medv) vs Charles River (chas)",
xlab = "Charles River (chas)",
ylab = "Median Value (medv)",
col = c("blue", "green")
)
The boxplot might show that homes near the Charles River (where
chas
= 1) have a higher median value compared to
those farther away (chas
= 0). This makes
sense, as being close to a major water body like the Charles River could
be perceived as an amenity, leading to higher property values in those
suburbs.