9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load data
auto <- read.csv("/Users/saransh/Downloads/Auto.csv", na.strings = "?")

# Remove missing values
auto <- na.omit(auto)
str(auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

(a) Which of the predictors are quantitative, and which are qualitative?

quantitative_vars <- sapply(auto, is.numeric)
qualitative_vars <- !quantitative_vars

list(
  Quantitative = names(auto)[quantitative_vars],
  Qualitative = names(auto)[qualitative_vars]
)
## $Quantitative
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"      
## 
## $Qualitative
## [1] "name"

(b) What is the range of each quantitative predictor? You can answer this using the range() function.

range_values <- sapply(auto[, quantitative_vars], range)
print(range_values)
##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,]  9.0         3           68         46   1613          8.0   70      1
## [2,] 46.6         8          455        230   5140         24.8   82      3

(c) What is the mean and standard deviation of each quantitative predictor?

summary_stats <- data.frame(
  Mean = sapply(auto[, quantitative_vars], mean),
  Std_Dev = sapply(auto[, quantitative_vars], sd)
)
print(summary_stats)
##                     Mean     Std_Dev
## mpg            23.445918   7.8050075
## cylinders       5.471939   1.7057832
## displacement  194.411990 104.6440039
## horsepower    104.469388  38.4911599
## weight       2977.584184 849.4025600
## acceleration   15.541327   2.7588641
## year           75.979592   3.6837365
## origin          1.576531   0.8055182

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

auto_subset <- auto[-c(10:85), ]

range_subset <- sapply(auto_subset[, quantitative_vars], range)
summary_stats_subset <- data.frame(
  Mean = sapply(auto_subset[, quantitative_vars], mean),
  Std_Dev = sapply(auto_subset[, quantitative_vars], sd)
)

list(Range = range_subset, Summary = summary_stats_subset)
## $Range
##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 11.0         3           68         46   1649          8.5   70      1
## [2,] 46.6         8          455        230   4997         24.8   82      3
## 
## $Summary
##                     Mean    Std_Dev
## mpg            24.404430   7.867283
## cylinders       5.373418   1.654179
## displacement  187.240506  99.678367
## horsepower    100.721519  35.708853
## weight       2935.971519 811.300208
## acceleration   15.726899   2.693721
## year           77.145570   3.106217
## origin          1.601266   0.819910

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your fndings.

pairs(auto[, quantitative_vars], main = "Scatterplot Matrix of Quantitative Predictors")

ggplot(auto, aes(x = as.factor(cylinders), y = mpg)) +
  geom_boxplot() +
  labs(title = "MPG vs Cylinders", x = "Cylinders", y = "MPG")

Findings:

  • MPG appears to have an inverse relationship with horsepower and weight.
  • Cars with more cylinders tend to have lower MPG.
  • Displacement is also correlated with MPG.

(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

corr_matrix <- cor(auto[, quantitative_vars])
corr_mpg <- sort(corr_matrix["mpg", ], decreasing = TRUE)
print(corr_mpg)
##          mpg         year       origin acceleration    cylinders   horsepower 
##    1.0000000    0.5805410    0.5652088    0.4233285   -0.7776175   -0.7784268 
## displacement       weight 
##   -0.8051269   -0.8322442

10. This exercise involves the Boston housing data set.

if (!require(ISLR2)) {
  install.packages("ISLR2")
  library(ISLR2)
} else {
  library(ISLR2)
}
## Loading required package: ISLR2
data("Boston")
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

(a) How many rows are in this data set? How many columns? What do the rows and columns represent?

dim(Boston)
## [1] 506  13

Interpretation

  • The dataset contains 506 rows and 13 columns.
  • Rows represent different areas (or neighborhoods) in Boston.
  • Columns represent housing and environmental attributes for each neighborhood.

(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your fndings.

pairs(Boston, main = "Pairwise Scatterplots of Boston Housing Data")

Findings:

  • Several predictors show correlations, such as negative relationships between dis (distance to employment centers) and crime rate.
  • Positive relationships between tax and ptratio (pupil-teacher ratio) are observed.

(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

corr_matrix <- cor(Boston)
corr_matrix["crim", ]
##        crim          zn       indus        chas         nox          rm 
##  1.00000000 -0.20046922  0.40658341 -0.05589158  0.42097171 -0.21924670 
##         age         dis         rad         tax     ptratio       lstat 
##  0.35273425 -0.37967009  0.62550515  0.58276431  0.28994558  0.45562148 
##        medv 
## -0.38830461

Findings:

  • Crime rate has a positive correlation with nox (nitrogen oxide concentration) and ptratio (pupil-teacher ratio).
  • Negative correlation with medv (median home value) and dis (distance to employment centers).

(d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

summary(Boston[, c("crim", "tax", "ptratio")])
##       crim               tax           ptratio     
##  Min.   : 0.00632   Min.   :187.0   Min.   :12.60  
##  1st Qu.: 0.08205   1st Qu.:279.0   1st Qu.:17.40  
##  Median : 0.25651   Median :330.0   Median :19.05  
##  Mean   : 3.61352   Mean   :408.2   Mean   :18.46  
##  3rd Qu.: 3.67708   3rd Qu.:666.0   3rd Qu.:20.20  
##  Max.   :88.97620   Max.   :711.0   Max.   :22.00

Findings:

  • Crime rates and tax values vary significantly across census tracts.
  • Some census tracts have very high ptratio, indicating potentially poorer schooling conditions.

(e) How many of the census tracts in this data set bound the Charles river?

charles_bound <- sum(Boston$chas == 1)
charles_bound
## [1] 35

Findings:

  • 35 census tracts bound the Charles River.

(f) What is the median pupil-teacher ratio among the towns in this data set?

median_ptr <- median(Boston$ptratio)
median_ptr
## [1] 19.05

Findings:

  • The median pupil-teacher ratio among towns is 19.05.

(g) Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your fndings.

lowest_medv_tract <- Boston[which.min(Boston$medv), ]
print(lowest_medv_tract)
##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5

Findings:

  • This census tract has high crime rates, lower distances to employment centers, and higher pupil-teacher ratios.
  • Suggests poorer neighborhood conditions.

(h) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

over_7_rooms <- sum(Boston$rm > 7)
over_8_rooms <- sum(Boston$rm > 8)
list(Over_7 = over_7_rooms, Over_8 = over_8_rooms)
## $Over_7
## [1] 64
## 
## $Over_8
## [1] 13

Findings:

  • 64 census tracts have more than 7 rooms per dwelling.
  • 13 census tracts have more than 8 rooms per dwelling.
  • These census tracts likely correspond to wealthier areas.