9. This exercise involves the Auto data set studied in the lab. Make
sure that the missing values have been removed from the data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load data
auto <- read.csv("/Users/saransh/Downloads/Auto.csv", na.strings = "?")
# Remove missing values
auto <- na.omit(auto)
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
(a) Which of the predictors are quantitative, and which are
qualitative?
quantitative_vars <- sapply(auto, is.numeric)
qualitative_vars <- !quantitative_vars
list(
Quantitative = names(auto)[quantitative_vars],
Qualitative = names(auto)[qualitative_vars]
)
## $Quantitative
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin"
##
## $Qualitative
## [1] "name"
(b) What is the range of each quantitative predictor? You can answer
this using the range() function.
range_values <- sapply(auto[, quantitative_vars], range)
print(range_values)
## mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 9.0 3 68 46 1613 8.0 70 1
## [2,] 46.6 8 455 230 5140 24.8 82 3
(c) What is the mean and standard deviation of each quantitative
predictor?
summary_stats <- data.frame(
Mean = sapply(auto[, quantitative_vars], mean),
Std_Dev = sapply(auto[, quantitative_vars], sd)
)
print(summary_stats)
## Mean Std_Dev
## mpg 23.445918 7.8050075
## cylinders 5.471939 1.7057832
## displacement 194.411990 104.6440039
## horsepower 104.469388 38.4911599
## weight 2977.584184 849.4025600
## acceleration 15.541327 2.7588641
## year 75.979592 3.6837365
## origin 1.576531 0.8055182
(d) Now remove the 10th through 85th observations. What is the
range, mean, and standard deviation of each predictor in the subset of
the data that remains?
auto_subset <- auto[-c(10:85), ]
range_subset <- sapply(auto_subset[, quantitative_vars], range)
summary_stats_subset <- data.frame(
Mean = sapply(auto_subset[, quantitative_vars], mean),
Std_Dev = sapply(auto_subset[, quantitative_vars], sd)
)
list(Range = range_subset, Summary = summary_stats_subset)
## $Range
## mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 11.0 3 68 46 1649 8.5 70 1
## [2,] 46.6 8 455 230 4997 24.8 82 3
##
## $Summary
## Mean Std_Dev
## mpg 24.404430 7.867283
## cylinders 5.373418 1.654179
## displacement 187.240506 99.678367
## horsepower 100.721519 35.708853
## weight 2935.971519 811.300208
## acceleration 15.726899 2.693721
## year 77.145570 3.106217
## origin 1.601266 0.819910
(f) Suppose that we wish to predict gas mileage (mpg) on the basis
of the other variables. Do your plots suggest that any of the other
variables might be useful in predicting mpg? Justify your answer.
corr_matrix <- cor(auto[, quantitative_vars])
corr_mpg <- sort(corr_matrix["mpg", ], decreasing = TRUE)
print(corr_mpg)
## mpg year origin acceleration cylinders horsepower
## 1.0000000 0.5805410 0.5652088 0.4233285 -0.7776175 -0.7784268
## displacement weight
## -0.8051269 -0.8322442
10. This exercise involves the Boston housing data set.
if (!require(ISLR2)) {
install.packages("ISLR2")
library(ISLR2)
} else {
library(ISLR2)
}
## Loading required package: ISLR2
data("Boston")
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
(a) How many rows are in this data set? How many columns? What do
the rows and columns represent?
dim(Boston)
## [1] 506 13
Interpretation
- The dataset contains 506 rows and 13
columns.
- Rows represent different areas (or neighborhoods)
in Boston.
- Columns represent housing and environmental
attributes for each neighborhood.
(b) Make some pairwise scatterplots of the predictors (columns) in
this data set. Describe your fndings.
pairs(Boston, main = "Pairwise Scatterplots of Boston Housing Data")

Findings:
- Several predictors show correlations, such as negative relationships
between dis (distance to employment centers) and crime
rate.
- Positive relationships between tax and ptratio
(pupil-teacher ratio) are observed.
(c) Are any of the predictors associated with per capita crime rate?
If so, explain the relationship.
corr_matrix <- cor(Boston)
corr_matrix["crim", ]
## crim zn indus chas nox rm
## 1.00000000 -0.20046922 0.40658341 -0.05589158 0.42097171 -0.21924670
## age dis rad tax ptratio lstat
## 0.35273425 -0.37967009 0.62550515 0.58276431 0.28994558 0.45562148
## medv
## -0.38830461
Findings:
- Crime rate has a positive correlation with nox (nitrogen
oxide concentration) and ptratio (pupil-teacher
ratio).
- Negative correlation with medv (median home value)
and dis (distance to employment centers).
(e) How many of the census tracts in this data set bound the Charles
river?
charles_bound <- sum(Boston$chas == 1)
charles_bound
## [1] 35
Findings:
- 35 census tracts bound the Charles River.