Lab1

  # Run this only if you haven't installed the package
library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

library(ggplot2)

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

(a) Which of the predictors are quantitative, and which are qualitative?

Quantitative vs. Qualitative Predictors:

Quantitative predictors include:

mpg
cylinders
displacement
horsepower
weight
acceleration
year

Qualitative predictors include:

origin
name

What is the range of each quantitative predictor? You can answer this using the range() function. range()

data(Auto)  # Load the dataset
Auto <- na.omit(Auto)  # Remove missing values

attach(Auto)

## The following object is masked from package:ggplot2:
## 
##     mpg

range_mpg <- range(mpg)
range_cylinders <- range(cylinders)
range_displacement <- range(displacement)
range_weight <- range(weight)
range_acceleration <- range(acceleration)
range_model_year <- range(year)
range_origin <- range(origin)
range_mpg

## [1]  9.0 46.6

range_cylinders

## [1] 3 8

range_displacement

## [1]  68 455

range_weight

## [1] 1613 5140

range_acceleration

## [1]  8.0 24.8

range_model_year

## [1] 70 82

range_origin

## [1] 1 3

What is the mean and standard deviation of each quantitative predictor?

sapply(Auto[, sapply(Auto, is.numeric)], function(x) c(mean = mean(x), sd = sd(x)))

##            mpg cylinders displacement horsepower    weight acceleration
## mean 23.445918  5.471939      194.412  104.46939 2977.5842    15.541327
## sd    7.805007  1.705783      104.644   38.49116  849.4026     2.758864
##           year    origin
## mean 75.979592 1.5765306
## sd    3.683737 0.8055182

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

Auto_subset <- Auto[-(10:85), ]
sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], function(x) c(range = range(x), mean = mean(x), sd = sd(x)))

##              mpg cylinders displacement horsepower    weight acceleration
## range1 11.000000  3.000000     68.00000   46.00000 1649.0000     8.500000
## range2 46.600000  8.000000    455.00000  230.00000 4997.0000    24.800000
## mean   24.404430  5.373418    187.24051  100.72152 2935.9715    15.726899
## sd      7.867283  1.654179     99.67837   35.70885  811.3002     2.693721
##             year   origin
## range1 70.000000 1.000000
## range2 82.000000 3.000000
## mean   77.145570 1.601266
## sd      3.106217 0.819910

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

pairs(Auto[, sapply(Auto, is.numeric)], main = "Scatterplot Matrix of Auto Dataset")

The scatterplots suggest that weight, horsepower, displacement, and year are strong predictors of mpg. Reducing vehicle weight and engine power, or selecting newer cars, would likely improve fuel efficiency.

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

library(ggplot2)
ggplot(Auto, aes(x = horsepower, y = mpg)) + geom_point() + geom_smooth(method = "lm", col = "red") + ggtitle("MPG vs. Horsepower")

## `geom_smooth()` using formula = 'y ~ x'

Examine the plots to see which variables might correlate with mpg. For example:

Weight vs. mpg: Cars that are heavier generally exhibit lower mpg.
Horsepower vs. mpg: Vehicles boasting higher horsepower usually have lower mpg.
Year vs. mpg: More recent cars could have improved mpg because of advancements in technology.

This exercise involves the Boston housing data set.

To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library.

library(ISLR2)

Now the data set is contained in the object Boston.

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

How many rows are in this data set? How many columns? What do the rows and columns represent?

data(Boston)
str(Boston)  # Check structure

## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

dim(Boston)  # Number of rows and columns

## [1] 506  13

The data contains 506 rows (observations) and 13 columns (predictors + response).

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
```
pairs(Boston, main = "Pairwise Scatterplots of Boston Dataset")
```
- Crime rate (crim) and median home value (medv) show a negative correlation (higher crime → lower home value).
- Rooms per dwelling (rm) and median home value (medv) show a positive correlation (more rooms → higher home value).
- Lower status population (lstat) and median home value (medv) show a strong negative correlation (higher lstat→ lower home value).
- Some variables (like tax, ptratio, and crim) show clusters of extreme values, indicating potential outliers.
```
attach(Boston)
```
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
```
cor(crim, Boston[,-1])  # Correlation of crime rate with all other variables
```
```
##              zn     indus        chas       nox         rm       age        dis
## [1,] -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343 -0.3796701
##            rad       tax   ptratio     lstat       medv
## [1,] 0.6255051 0.5827643 0.2899456 0.4556215 -0.3883046
```
- Proportion of non-retail business (indus) (+ve correlation) → Industrial areas tend to have more crime.
- Lower status population (lstat) (+ve correlation) → Higher poverty levels → More crime.
- Tax rate (tax) (+ve correlation) → High-tax areas might be associated with lower-income neighborhoods.
- Crime is higher in industrial, polluted, and high-tax areas, especially near highways and in poorer communities. It’s lower in suburban, wealthier, and well-zoned residential areas. Key crime predictors are highway access, poverty levels, and tax rates.

Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

summary(crim)  # Crime rate statistics

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

summary(tax)   # Tax rate statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

summary(ptratio)  # Pupil-teacher ratio statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

Boston[crim > quantile(crim, 0.95), ]  # Top 5% crime rate

##        crim zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv
## 375 18.4982  0  18.1    0 0.668 4.138 100.0 1.1370  24 666    20.2 37.97 13.8
## 376 19.6091  0  18.1    0 0.671 7.313  97.9 1.3163  24 666    20.2 13.44 15.0
## 379 23.6482  0  18.1    0 0.671 6.380  96.2 1.3861  24 666    20.2 23.69 13.1
## 380 17.8667  0  18.1    0 0.671 6.223 100.0 1.3861  24 666    20.2 21.78 10.2
## 381 88.9762  0  18.1    0 0.671 6.968  91.9 1.4165  24 666    20.2 17.21 10.4
## 382 15.8744  0  18.1    0 0.671 6.545  99.1 1.5192  24 666    20.2 21.08 10.9
## 385 20.0849  0  18.1    0 0.700 4.368  91.2 1.4395  24 666    20.2 30.63  8.8
## 386 16.8118  0  18.1    0 0.700 5.277  98.1 1.4261  24 666    20.2 30.81  7.2
## 387 24.3938  0  18.1    0 0.700 4.652 100.0 1.4672  24 666    20.2 28.28 10.5
## 388 22.5971  0  18.1    0 0.700 5.000  89.5 1.5184  24 666    20.2 31.99  7.4
## 399 38.3518  0  18.1    0 0.693 5.453 100.0 1.4896  24 666    20.2 30.59  5.0
## 401 25.0461  0  18.1    0 0.693 5.987 100.0 1.5888  24 666    20.2 26.77  5.6
## 404 24.8017  0  18.1    0 0.693 5.349  96.0 1.7028  24 666    20.2 19.77  8.3
## 405 41.5292  0  18.1    0 0.693 5.531  85.4 1.6074  24 666    20.2 27.38  8.5
## 406 67.9208  0  18.1    0 0.693 5.683 100.0 1.4254  24 666    20.2 22.98  5.0
## 407 20.7162  0  18.1    0 0.659 4.138 100.0 1.1781  24 666    20.2 23.34 11.9
## 411 51.1358  0  18.1    0 0.597 5.757 100.0 1.4130  24 666    20.2 10.11 15.0
## 413 18.8110  0  18.1    0 0.597 4.628 100.0 1.5539  24 666    20.2 34.37 17.9
## 414 28.6558  0  18.1    0 0.597 5.155 100.0 1.5894  24 666    20.2 20.08 16.3
## 415 45.7461  0  18.1    0 0.693 4.519 100.0 1.6582  24 666    20.2 36.98  7.0
## 416 18.0846  0  18.1    0 0.679 6.434 100.0 1.8347  24 666    20.2 29.05  7.2
## 418 25.9406  0  18.1    0 0.679 5.304  89.1 1.6475  24 666    20.2 26.64 10.4
## 419 73.5341  0  18.1    0 0.679 5.957 100.0 1.8026  24 666    20.2 20.62  8.8
## 426 15.8603  0  18.1    0 0.679 5.896  95.4 1.9096  24 666    20.2 24.39  8.3
## 428 37.6619  0  18.1    0 0.679 6.202  78.7 1.8629  24 666    20.2 14.52 10.9
## 441 22.0511  0  18.1    0 0.740 5.818  92.4 1.8662  24 666    20.2 22.11 10.5

Boston[tax > quantile(tax, 0.95), ]  # Top 5% tax rate

##        crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 489 0.15086  0 27.74    0 0.609 5.454 92.7 1.8209   4 711    20.1 18.06 15.2
## 490 0.18337  0 27.74    0 0.609 5.414 98.3 1.7554   4 711    20.1 23.97  7.0
## 491 0.20746  0 27.74    0 0.609 5.093 98.0 1.8226   4 711    20.1 29.68  8.1
## 492 0.10574  0 27.74    0 0.609 5.983 98.8 1.8681   4 711    20.1 18.07 13.6
## 493 0.11132  0 27.74    0 0.609 5.983 83.5 2.1099   4 711    20.1 13.35 20.1

Boston[ptratio > quantile(ptratio, 0.95), ]  # Top 5% pupil-teacher ratio

##        crim zn indus chas   nox    rm   age     dis rad tax ptratio lstat medv
## 55  0.01360 75  4.00    0 0.410 5.888  47.6  7.3197   3 469    21.1 14.80 18.9
## 128 0.25915  0 21.89    0 0.624 5.693  96.0  1.7883   4 437    21.2 17.19 16.2
## 129 0.32543  0 21.89    0 0.624 6.431  98.8  1.8125   4 437    21.2 15.39 18.0
## 130 0.88125  0 21.89    0 0.624 5.637  94.7  1.9799   4 437    21.2 18.34 14.3
## 131 0.34006  0 21.89    0 0.624 6.458  98.9  2.1185   4 437    21.2 12.60 19.2
## 132 1.19294  0 21.89    0 0.624 6.326  97.7  2.2710   4 437    21.2 12.26 19.6
## 133 0.59005  0 21.89    0 0.624 6.372  97.9  2.3274   4 437    21.2 11.12 23.0
## 134 0.32982  0 21.89    0 0.624 5.822  95.4  2.4699   4 437    21.2 15.03 18.4
## 135 0.97617  0 21.89    0 0.624 5.757  98.4  2.3460   4 437    21.2 17.31 15.6
## 136 0.55778  0 21.89    0 0.624 6.335  98.2  2.1107   4 437    21.2 16.96 18.1
## 137 0.32264  0 21.89    0 0.624 5.942  93.5  1.9669   4 437    21.2 16.90 17.4
## 138 0.35233  0 21.89    0 0.624 6.454  98.4  1.8498   4 437    21.2 14.59 17.1
## 139 0.24980  0 21.89    0 0.624 5.857  98.2  1.6686   4 437    21.2 21.32 13.3
## 140 0.54452  0 21.89    0 0.624 6.151  97.9  1.6687   4 437    21.2 18.46 17.8
## 141 0.29090  0 21.89    0 0.624 6.174  93.6  1.6119   4 437    21.2 24.16 14.0
## 142 1.62864  0 21.89    0 0.624 5.019 100.0  1.4394   4 437    21.2 34.41 14.4
## 355 0.04301 80  1.91    0 0.413 5.663  21.9 10.5857   4 334    22.0  8.05 18.2
## 356 0.10659 80  1.91    0 0.413 5.936  19.5 10.5857   4 334    22.0  5.57 20.6

Certain census tracts experience notably elevated crime rates (outliers).
Property tax rates differ greatly across regions, with certain areas imposing significantly higher rates.
The pupil-teacher ratio also fluctuates, indicating that some regions have well-funded schools, while others face overcrowded classrooms.

How many of the census tracts in this data set bound the Charles river?
```
sum(chas == 1)
```
```
## [1] 35
```
What is the median pupil-teacher ratio among the towns in this data set?
```
median(ptratio)
```
```
## [1] 19.05
```
Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
```
Boston[medv == min(medv), ]
```
```
##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5
```
Findings:
- This reveals the neighborhood with the lowest home values.
- Compare other variables like crime rate (crim), number of rooms (rm), and poverty levels (ls) to see why home values are low.
In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

sum(Boston$rm > 7)  # Count tracts with more than 7 rooms

## [1] 64

sum(Boston$rm > 8)  # Count tracts with more than 8 rooms

## [1] 13

Boston[Boston$rm > 8, ]  # Display details of these census tracts

##        crim zn indus chas    nox    rm  age    dis rad tax ptratio lstat medv
## 98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0  4.21 38.7
## 164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7  3.32 50.0
## 205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7  2.88 50.0
## 225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4  4.14 44.8
## 226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4  4.63 50.0
## 227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4  3.13 37.6
## 233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4  2.47 41.7
## 234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4  3.95 48.3
## 254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1  3.54 42.8
## 258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0  5.12 50.0
## 263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0  5.91 48.8
## 268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0  7.44 50.0
## 365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2  5.29 21.9

There are only a limited number of census tracts that contain homes with over 8 rooms.
These regions probably indicate affluent neighborhoods featuring spacious residences.

Lab1

Mounya

2025-01-13

Findings: