# Run this only if you haven't installed the package
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
library(ggplot2)
(a) Which of the predictors are quantitative, and which are qualitative?
Quantitative vs. Qualitative Predictors:
Quantitative predictors include:
mpg
cylinders
displacement
horsepower
weight
acceleration
year
Qualitative predictors include:
origin
name
data(Auto) # Load the dataset
Auto <- na.omit(Auto) # Remove missing values
attach(Auto)
## The following object is masked from package:ggplot2:
##
## mpg
range_mpg <- range(mpg)
range_cylinders <- range(cylinders)
range_displacement <- range(displacement)
range_weight <- range(weight)
range_acceleration <- range(acceleration)
range_model_year <- range(year)
range_origin <- range(origin)
range_mpg
## [1] 9.0 46.6
range_cylinders
## [1] 3 8
range_displacement
## [1] 68 455
range_weight
## [1] 1613 5140
range_acceleration
## [1] 8.0 24.8
range_model_year
## [1] 70 82
range_origin
## [1] 1 3
sapply(Auto[, sapply(Auto, is.numeric)], function(x) c(mean = mean(x), sd = sd(x)))
## mpg cylinders displacement horsepower weight acceleration
## mean 23.445918 5.471939 194.412 104.46939 2977.5842 15.541327
## sd 7.805007 1.705783 104.644 38.49116 849.4026 2.758864
## year origin
## mean 75.979592 1.5765306
## sd 3.683737 0.8055182
Auto_subset <- Auto[-(10:85), ]
sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], function(x) c(range = range(x), mean = mean(x), sd = sd(x)))
## mpg cylinders displacement horsepower weight acceleration
## range1 11.000000 3.000000 68.00000 46.00000 1649.0000 8.500000
## range2 46.600000 8.000000 455.00000 230.00000 4997.0000 24.800000
## mean 24.404430 5.373418 187.24051 100.72152 2935.9715 15.726899
## sd 7.867283 1.654179 99.67837 35.70885 811.3002 2.693721
## year origin
## range1 70.000000 1.000000
## range2 82.000000 3.000000
## mean 77.145570 1.601266
## sd 3.106217 0.819910
pairs(Auto[, sapply(Auto, is.numeric)], main = "Scatterplot Matrix of Auto Dataset")
The scatterplots suggest that weight, horsepower, displacement, and year are strong predictors of mpg. Reducing vehicle weight and engine power, or selecting newer cars, would likely improve fuel efficiency.
library(ggplot2)
ggplot(Auto, aes(x = horsepower, y = mpg)) + geom_point() + geom_smooth(method = "lm", col = "red") + ggtitle("MPG vs. Horsepower")
## `geom_smooth()` using formula = 'y ~ x'
Examine the plots to see which variables might correlate with mpg. For example:
Weight vs. mpg: Cars that are heavier generally exhibit lower mpg.
Horsepower vs. mpg: Vehicles boasting higher horsepower usually have lower mpg.
Year vs. mpg: More recent cars could have improved mpg because of advancements in technology.
To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library.
library(ISLR2)
Now the data set is contained in the object Boston.
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
How many rows are in this data set? How many columns? What do the rows and columns represent?
data(Boston)
str(Boston) # Check structure
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston) # Number of rows and columns
## [1] 506 13
The data contains 506 rows (observations) and 13 columns (predictors + response).
Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
pairs(Boston, main = "Pairwise Scatterplots of Boston Dataset")
Crime rate (crim) and median home value (medv) show a negative correlation (higher crime → lower home value).
Rooms per dwelling (rm) and median home value (medv) show a positive correlation (more rooms → higher home value).
Lower status population (lstat) and median home value (medv) show a strong negative correlation (higher lstat→ lower home value).
Some variables (like tax, ptratio, and crim) show clusters of extreme values, indicating potential outliers.
attach(Boston)
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
cor(crim, Boston[,-1]) # Correlation of crime rate with all other variables
## zn indus chas nox rm age dis
## [1,] -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343 -0.3796701
## rad tax ptratio lstat medv
## [1,] 0.6255051 0.5827643 0.2899456 0.4556215 -0.3883046
Proportion of non-retail business (indus) (+ve correlation) → Industrial areas tend to have more crime.
Lower status population (lstat) (+ve correlation) → Higher poverty levels → More crime.
Tax rate (tax) (+ve correlation) → High-tax areas might be associated with lower-income neighborhoods.
Crime is higher in industrial, polluted, and high-tax areas, especially near highways and in poorer communities. It’s lower in suburban, wealthier, and well-zoned residential areas. Key crime predictors are highway access, poverty levels, and tax rates.
Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
summary(crim) # Crime rate statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
summary(tax) # Tax rate statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
summary(ptratio) # Pupil-teacher ratio statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
Boston[crim > quantile(crim, 0.95), ] # Top 5% crime rate
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 375 18.4982 0 18.1 0 0.668 4.138 100.0 1.1370 24 666 20.2 37.97 13.8
## 376 19.6091 0 18.1 0 0.671 7.313 97.9 1.3163 24 666 20.2 13.44 15.0
## 379 23.6482 0 18.1 0 0.671 6.380 96.2 1.3861 24 666 20.2 23.69 13.1
## 380 17.8667 0 18.1 0 0.671 6.223 100.0 1.3861 24 666 20.2 21.78 10.2
## 381 88.9762 0 18.1 0 0.671 6.968 91.9 1.4165 24 666 20.2 17.21 10.4
## 382 15.8744 0 18.1 0 0.671 6.545 99.1 1.5192 24 666 20.2 21.08 10.9
## 385 20.0849 0 18.1 0 0.700 4.368 91.2 1.4395 24 666 20.2 30.63 8.8
## 386 16.8118 0 18.1 0 0.700 5.277 98.1 1.4261 24 666 20.2 30.81 7.2
## 387 24.3938 0 18.1 0 0.700 4.652 100.0 1.4672 24 666 20.2 28.28 10.5
## 388 22.5971 0 18.1 0 0.700 5.000 89.5 1.5184 24 666 20.2 31.99 7.4
## 399 38.3518 0 18.1 0 0.693 5.453 100.0 1.4896 24 666 20.2 30.59 5.0
## 401 25.0461 0 18.1 0 0.693 5.987 100.0 1.5888 24 666 20.2 26.77 5.6
## 404 24.8017 0 18.1 0 0.693 5.349 96.0 1.7028 24 666 20.2 19.77 8.3
## 405 41.5292 0 18.1 0 0.693 5.531 85.4 1.6074 24 666 20.2 27.38 8.5
## 406 67.9208 0 18.1 0 0.693 5.683 100.0 1.4254 24 666 20.2 22.98 5.0
## 407 20.7162 0 18.1 0 0.659 4.138 100.0 1.1781 24 666 20.2 23.34 11.9
## 411 51.1358 0 18.1 0 0.597 5.757 100.0 1.4130 24 666 20.2 10.11 15.0
## 413 18.8110 0 18.1 0 0.597 4.628 100.0 1.5539 24 666 20.2 34.37 17.9
## 414 28.6558 0 18.1 0 0.597 5.155 100.0 1.5894 24 666 20.2 20.08 16.3
## 415 45.7461 0 18.1 0 0.693 4.519 100.0 1.6582 24 666 20.2 36.98 7.0
## 416 18.0846 0 18.1 0 0.679 6.434 100.0 1.8347 24 666 20.2 29.05 7.2
## 418 25.9406 0 18.1 0 0.679 5.304 89.1 1.6475 24 666 20.2 26.64 10.4
## 419 73.5341 0 18.1 0 0.679 5.957 100.0 1.8026 24 666 20.2 20.62 8.8
## 426 15.8603 0 18.1 0 0.679 5.896 95.4 1.9096 24 666 20.2 24.39 8.3
## 428 37.6619 0 18.1 0 0.679 6.202 78.7 1.8629 24 666 20.2 14.52 10.9
## 441 22.0511 0 18.1 0 0.740 5.818 92.4 1.8662 24 666 20.2 22.11 10.5
Boston[tax > quantile(tax, 0.95), ] # Top 5% tax rate
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 489 0.15086 0 27.74 0 0.609 5.454 92.7 1.8209 4 711 20.1 18.06 15.2
## 490 0.18337 0 27.74 0 0.609 5.414 98.3 1.7554 4 711 20.1 23.97 7.0
## 491 0.20746 0 27.74 0 0.609 5.093 98.0 1.8226 4 711 20.1 29.68 8.1
## 492 0.10574 0 27.74 0 0.609 5.983 98.8 1.8681 4 711 20.1 18.07 13.6
## 493 0.11132 0 27.74 0 0.609 5.983 83.5 2.1099 4 711 20.1 13.35 20.1
Boston[ptratio > quantile(ptratio, 0.95), ] # Top 5% pupil-teacher ratio
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 55 0.01360 75 4.00 0 0.410 5.888 47.6 7.3197 3 469 21.1 14.80 18.9
## 128 0.25915 0 21.89 0 0.624 5.693 96.0 1.7883 4 437 21.2 17.19 16.2
## 129 0.32543 0 21.89 0 0.624 6.431 98.8 1.8125 4 437 21.2 15.39 18.0
## 130 0.88125 0 21.89 0 0.624 5.637 94.7 1.9799 4 437 21.2 18.34 14.3
## 131 0.34006 0 21.89 0 0.624 6.458 98.9 2.1185 4 437 21.2 12.60 19.2
## 132 1.19294 0 21.89 0 0.624 6.326 97.7 2.2710 4 437 21.2 12.26 19.6
## 133 0.59005 0 21.89 0 0.624 6.372 97.9 2.3274 4 437 21.2 11.12 23.0
## 134 0.32982 0 21.89 0 0.624 5.822 95.4 2.4699 4 437 21.2 15.03 18.4
## 135 0.97617 0 21.89 0 0.624 5.757 98.4 2.3460 4 437 21.2 17.31 15.6
## 136 0.55778 0 21.89 0 0.624 6.335 98.2 2.1107 4 437 21.2 16.96 18.1
## 137 0.32264 0 21.89 0 0.624 5.942 93.5 1.9669 4 437 21.2 16.90 17.4
## 138 0.35233 0 21.89 0 0.624 6.454 98.4 1.8498 4 437 21.2 14.59 17.1
## 139 0.24980 0 21.89 0 0.624 5.857 98.2 1.6686 4 437 21.2 21.32 13.3
## 140 0.54452 0 21.89 0 0.624 6.151 97.9 1.6687 4 437 21.2 18.46 17.8
## 141 0.29090 0 21.89 0 0.624 6.174 93.6 1.6119 4 437 21.2 24.16 14.0
## 142 1.62864 0 21.89 0 0.624 5.019 100.0 1.4394 4 437 21.2 34.41 14.4
## 355 0.04301 80 1.91 0 0.413 5.663 21.9 10.5857 4 334 22.0 8.05 18.2
## 356 0.10659 80 1.91 0 0.413 5.936 19.5 10.5857 4 334 22.0 5.57 20.6
Certain census tracts experience notably elevated crime rates (outliers).
Property tax rates differ greatly across regions, with certain areas imposing significantly higher rates.
The pupil-teacher ratio also fluctuates, indicating that some regions have well-funded schools, while others face overcrowded classrooms.
How many of the census tracts in this data set bound the Charles river?
sum(chas == 1)
## [1] 35
What is the median pupil-teacher ratio among the towns in this data set?
median(ptratio)
## [1] 19.05
Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
Boston[medv == min(medv), ]
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 22.98 5
This reveals the neighborhood with the lowest home values.
Compare other variables like crime rate (crim), number of rooms (rm), and poverty levels (ls) to see why home values are low.
In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.
sum(Boston$rm > 7) # Count tracts with more than 7 rooms
## [1] 64
sum(Boston$rm > 8) # Count tracts with more than 8 rooms
## [1] 13
Boston[Boston$rm > 8, ] # Display details of these census tracts
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 98 0.12083 0 2.89 0 0.4450 8.069 76.0 3.4952 2 276 18.0 4.21 38.7
## 164 1.51902 0 19.58 1 0.6050 8.375 93.9 2.1620 5 403 14.7 3.32 50.0
## 205 0.02009 95 2.68 0 0.4161 8.034 31.9 5.1180 4 224 14.7 2.88 50.0
## 225 0.31533 0 6.20 0 0.5040 8.266 78.3 2.8944 8 307 17.4 4.14 44.8
## 226 0.52693 0 6.20 0 0.5040 8.725 83.0 2.8944 8 307 17.4 4.63 50.0
## 227 0.38214 0 6.20 0 0.5040 8.040 86.5 3.2157 8 307 17.4 3.13 37.6
## 233 0.57529 0 6.20 0 0.5070 8.337 73.3 3.8384 8 307 17.4 2.47 41.7
## 234 0.33147 0 6.20 0 0.5070 8.247 70.4 3.6519 8 307 17.4 3.95 48.3
## 254 0.36894 22 5.86 0 0.4310 8.259 8.4 8.9067 7 330 19.1 3.54 42.8
## 258 0.61154 20 3.97 0 0.6470 8.704 86.9 1.8010 5 264 13.0 5.12 50.0
## 263 0.52014 20 3.97 0 0.6470 8.398 91.5 2.2885 5 264 13.0 5.91 48.8
## 268 0.57834 20 3.97 0 0.5750 8.297 67.0 2.4216 5 264 13.0 7.44 50.0
## 365 3.47428 0 18.10 1 0.7180 8.780 82.9 1.9047 24 666 20.2 5.29 21.9
There are only a limited number of census tracts that contain homes with over 8 rooms.
These regions probably indicate affluent neighborhoods featuring spacious residences.