Assignment 1

Question 2

(a)

Type: Regression (CEO salary is quantitative)
Goal: Inference (we want to understand relationships)
n (observations): 500 (firms)
p (predictors): 3 (profit, number of employees, industry)

(b)

Type: Classification (success or failure is qualitative)
Goal: Prediction (we want to predict future product success)
n: 20 (products)
p: 13 (price, marketing budget, competition price, 10 others)

(c)

Type: Regression (predicting % change in exchange rate)
Goal: Prediction
n: 52 (weeks in 2012)
p: 3 (% change in US, British, and German markets)

Question 5

Flexible vs. Less Flexible Approaches

Advantages of Flexibility:
- Can capture complex patterns in the data.
- May improve prediction accuracy when the true relationship is complex.
Disadvantages:
- Risk of overfitting.
- Harder to interpret.
- Requires more data to estimate reliably.
Use Flexibility When:
- The goal is prediction, not inference.
- The true function is highly non-linear.
- You have a large dataset.
Use Less Flexibility When:
- The goal is inference.
- You want interpretable models.
- The dataset is small or has noise.

Question 6

Parametric vs. Non-Parametric

Parametric:
- Assumes a functional form (e.g., linear regression).
- Advantages:
  - Simple to interpret.
  - Less data required.
  - Faster to compute.
- Disadvantages:
  - Model misspecification risk.
  - Less flexible.
Non-Parametric:
- No strict assumption about the form.
- Advantages:
  - Can model complex patterns.
- Disadvantages:
  - Requires more data.
  - Harder to interpret.

Question 8: College Dataset

data(College)
View(College)

(i) Summary

summary(College)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

(ii) Scatterplot Matrix

pairs(College[, 1:10])

(iii) Boxplot: Outstate vs Private

plot(Outstate ~ Private, data = College)

(iv) Create Elite variable and plot

Elite <- rep("No", nrow(College))
Elite[College$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
College <- data.frame(College, Elite)
summary(College$Elite)

##  No Yes 
## 699  78

plot(Outstate ~ Elite, data = College)

(v) Histograms

par(mfrow = c(2, 2))
hist(College$Apps, breaks = 20, main = "Apps")
hist(College$Accept, breaks = 20, main = "Accept")
hist(College$Outstate, breaks = 20, main = "Outstate")
hist(College$PhD, breaks = 20, main = "PhD")

(vi) Data Summary

Elite colleges tend to have higher Outstate tuition.
Private colleges usually charge more.
Applications vary widely, and there are significant differences in PhD proportions among faculty.

Question 9: Auto Dataset

data(Auto)
Auto <- na.omit(Auto)

(a) Qualitative and Quantitative

Quantitative: mpg, displacement, horsepower, weight, acceleration, year
Qualitative: origin, name

(b) Range

sapply(Auto[, sapply(Auto, is.numeric)], range)

##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,]  9.0         3           68         46   1613          8.0   70      1
## [2,] 46.6         8          455        230   5140         24.8   82      3

(c) Mean and SD

sapply(Auto[, sapply(Auto, is.numeric)], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year       origin 
##    75.979592     1.576531

sapply(Auto[, sapply(Auto, is.numeric)], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    7.8050075    1.7057832  104.6440039   38.4911599  849.4025600    2.7588641 
##         year       origin 
##    3.6837365    0.8055182

(d) Subset 10:85 removed

Auto_subset <- Auto[-(10:85), ]
sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], range)

##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 11.0         3           68         46   1649          8.5   70      1
## [2,] 46.6         8          455        230   4997         24.8   82      3

sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year       origin 
##    77.145570     1.601266

sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year       origin 
##     3.106217     0.819910

(e) Plots

pairs(Auto[, 1:7])

(f) Predicting `mpg`

Strong inverse relationships between mpg and weight, horsepower, and displacement.
Lighter, less powerful cars tend to have better mileage.

Question 10: Boston Dataset

library(ISLR2)
data("Boston")

(a) Rows and Columns

dim(Boston)  # Rows: 506, Columns: 14

## [1] 506  13

Each row: a census tract
Each column: a variable about housing, crime, etc.

(b) Scatterplots

pairs(Boston[, 1:6])

(c) Predictors associated with `crim`

cor(Boston$crim, Boston[, -which(names(Boston) == "crim")])

##              zn     indus        chas       nox         rm       age        dis
## [1,] -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343 -0.3796701
##            rad       tax   ptratio     lstat       medv
## [1,] 0.6255051 0.5827643 0.2899456 0.4556215 -0.3883046

crim is positively correlated with nox, rad, and tax.

(d) High Values

summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

summary(Boston$tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

(e) Tracts on Charles River

sum(Boston$chas == 1)

## [1] 35

(f) Median PTRatio

median(Boston$ptratio)

## [1] 19.05

(g) Lowest Median Home Value

Boston[which.min(Boston$medv), ]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5

(h) Rooms per Dwelling

sum(Boston$rm > 7)

## [1] 64

sum(Boston$rm > 8)

## [1] 13

Boston[Boston$rm > 8, ]

##        crim zn indus chas    nox    rm  age    dis rad tax ptratio lstat medv
## 98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0  4.21 38.7
## 164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7  3.32 50.0
## 205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7  2.88 50.0
## 225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4  4.14 44.8
## 226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4  4.63 50.0
## 227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4  3.13 37.6
## 233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4  2.47 41.7
## 234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4  3.95 48.3
## 254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1  3.54 42.8
## 258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0  5.12 50.0
## 263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0  5.91 48.8
## 268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0  7.44 50.0
## 365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2  5.29 21.9

Census tracts with more than 8 rooms tend to have low crime and high house values.

Assignment 1

2025-08-04

Question 2

(a)

(b)

(c)

Question 5

Flexible vs. Less Flexible Approaches

Question 6

Parametric vs. Non-Parametric

Question 8: College Dataset

(i) Summary

(ii) Scatterplot Matrix

(iii) Boxplot: Outstate vs Private

(iv) Create Elite variable and plot

(v) Histograms

(vi) Data Summary

Question 9: Auto Dataset

(a) Qualitative and Quantitative

(b) Range

(c) Mean and SD

(d) Subset 10:85 removed

(e) Plots

(f) Predicting mpg

Question 10: Boston Dataset

(a) Rows and Columns

(b) Scatterplots

(c) Predictors associated with crim

(d) High Values

(e) Tracts on Charles River

(f) Median PTRatio

(g) Lowest Median Home Value

(h) Rooms per Dwelling

(f) Predicting `mpg`

(c) Predictors associated with `crim`