Question 2

(a)

  • Type: Regression (CEO salary is quantitative)
  • Goal: Inference (we want to understand relationships)
  • n (observations): 500 (firms)
  • p (predictors): 3 (profit, number of employees, industry)

(b)

  • Type: Classification (success or failure is qualitative)
  • Goal: Prediction (we want to predict future product success)
  • n: 20 (products)
  • p: 13 (price, marketing budget, competition price, 10 others)

(c)

  • Type: Regression (predicting % change in exchange rate)
  • Goal: Prediction
  • n: 52 (weeks in 2012)
  • p: 3 (% change in US, British, and German markets)

Question 5

Flexible vs. Less Flexible Approaches

  • Advantages of Flexibility:

    • Can capture complex patterns in the data.
    • May improve prediction accuracy when the true relationship is complex.
  • Disadvantages:

    • Risk of overfitting.
    • Harder to interpret.
    • Requires more data to estimate reliably.
  • Use Flexibility When:

    • The goal is prediction, not inference.
    • The true function is highly non-linear.
    • You have a large dataset.
  • Use Less Flexibility When:

    • The goal is inference.
    • You want interpretable models.
    • The dataset is small or has noise.

Question 6

Parametric vs. Non-Parametric

  • Parametric:

    • Assumes a functional form (e.g., linear regression).

    • Advantages:

      • Simple to interpret.
      • Less data required.
      • Faster to compute.
    • Disadvantages:

      • Model misspecification risk.
      • Less flexible.
  • Non-Parametric:

    • No strict assumption about the form.

    • Advantages:

      • Can model complex patterns.
    • Disadvantages:

      • Requires more data.
      • Harder to interpret.

Question 8: College Dataset

data(College)
View(College)

(i) Summary

summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

(ii) Scatterplot Matrix

pairs(College[, 1:10])

(iii) Boxplot: Outstate vs Private

plot(Outstate ~ Private, data = College)

(iv) Create Elite variable and plot

Elite <- rep("No", nrow(College))
Elite[College$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
College <- data.frame(College, Elite)
summary(College$Elite)
##  No Yes 
## 699  78
plot(Outstate ~ Elite, data = College)

(v) Histograms

par(mfrow = c(2, 2))
hist(College$Apps, breaks = 20, main = "Apps")
hist(College$Accept, breaks = 20, main = "Accept")
hist(College$Outstate, breaks = 20, main = "Outstate")
hist(College$PhD, breaks = 20, main = "PhD")

(vi) Data Summary

  • Elite colleges tend to have higher Outstate tuition.
  • Private colleges usually charge more.
  • Applications vary widely, and there are significant differences in PhD proportions among faculty.

Question 9: Auto Dataset

data(Auto)
Auto <- na.omit(Auto)

(a) Qualitative and Quantitative

  • Quantitative: mpg, displacement, horsepower, weight, acceleration, year
  • Qualitative: origin, name

(b) Range

sapply(Auto[, sapply(Auto, is.numeric)], range)
##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,]  9.0         3           68         46   1613          8.0   70      1
## [2,] 46.6         8          455        230   5140         24.8   82      3

(c) Mean and SD

sapply(Auto[, sapply(Auto, is.numeric)], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year       origin 
##    75.979592     1.576531
sapply(Auto[, sapply(Auto, is.numeric)], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    7.8050075    1.7057832  104.6440039   38.4911599  849.4025600    2.7588641 
##         year       origin 
##    3.6837365    0.8055182

(d) Subset 10:85 removed

Auto_subset <- Auto[-(10:85), ]
sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], range)
##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 11.0         3           68         46   1649          8.5   70      1
## [2,] 46.6         8          455        230   4997         24.8   82      3
sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year       origin 
##    77.145570     1.601266
sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year       origin 
##     3.106217     0.819910

(e) Plots

pairs(Auto[, 1:7])

(f) Predicting mpg

  • Strong inverse relationships between mpg and weight, horsepower, and displacement.
  • Lighter, less powerful cars tend to have better mileage.

Question 10: Boston Dataset

library(ISLR2)
data("Boston")

(a) Rows and Columns

dim(Boston)  # Rows: 506, Columns: 14
## [1] 506  13
  • Each row: a census tract
  • Each column: a variable about housing, crime, etc.

(b) Scatterplots

pairs(Boston[, 1:6])

(c) Predictors associated with crim

cor(Boston$crim, Boston[, -which(names(Boston) == "crim")])
##              zn     indus        chas       nox         rm       age        dis
## [1,] -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343 -0.3796701
##            rad       tax   ptratio     lstat       medv
## [1,] 0.6255051 0.5827643 0.2899456 0.4556215 -0.3883046
  • crim is positively correlated with nox, rad, and tax.

(d) High Values

summary(Boston$crim)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620
summary(Boston$tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0
summary(Boston$ptratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

(e) Tracts on Charles River

sum(Boston$chas == 1)
## [1] 35

(f) Median PTRatio

median(Boston$ptratio)
## [1] 19.05

(g) Lowest Median Home Value

Boston[which.min(Boston$medv), ]
##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5

(h) Rooms per Dwelling

sum(Boston$rm > 7)
## [1] 64
sum(Boston$rm > 8)
## [1] 13
Boston[Boston$rm > 8, ]
##        crim zn indus chas    nox    rm  age    dis rad tax ptratio lstat medv
## 98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0  4.21 38.7
## 164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7  3.32 50.0
## 205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7  2.88 50.0
## 225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4  4.14 44.8
## 226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4  4.63 50.0
## 227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4  3.13 37.6
## 233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4  2.47 41.7
## 234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4  3.95 48.3
## 254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1  3.54 42.8
## 258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0  5.12 50.0
## 263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0  5.91 48.8
## 268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0  7.44 50.0
## 365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2  5.29 21.9
  • Census tracts with more than 8 rooms tend to have low crime and high house values.