Conceptual

1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.

Flexible is better because we have lots of data.

(b) The number of predictors p is extremely large, and the number of observations n is small.

Inflexible is better because it is hard for a flexible model to find the correlation between data with low observations.

(c) The relationship between the predictors and response is highly non-linear.

Flexible is better because they will be better at adapting to nonlinear data.

(d) The variance of the error terms, i.e. σ2 = Var(ε), is extremely high.

Inflexible is better as we don’t want a flexible model that might over-fit to the error.

2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Regression, Inference, n = 500, p = 3

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Classification, Prediction, n = 20, p = 13

(c) We are interest in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Regression, Prediction n = 52, p = 4

3. We now revisit the bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

Bias would start high, and decrease as flexibility increases.

Variance would start low, and increase as flexibility increases.

Training Error would start high, and decrease as flexibility increases. Near the end of the graph the decrease would become much more dramatic and reach zero at the highest point of flexibility.

Test error would start high, decrease and reach a minimum around the mid point of the graph, before increasing again and return high.

Irreducible error is a constant flat line.

(b) Explain why each of the five curves has the shape displayed in part (a).

Bias: an inflexible model will barely adjust to complex data, resulting in high bias. On the other hand, as the flexibility increases, so does its ability to capture a complex trend, thus reducing bias.

Variance: an inflexible model will change very little based on the data, making the variance (variability in predicted values) low. However with a flexible model, the model will change a lot for any change in the data, making the variance high.

Training Error: With a low flexibility model, the model will barely adjust to the data, resulting in high error, even in the training data. As flexibility increases, the model gets closer and closer to the training data, decreasing error. At extreme levels of flexibility the model perfectly matches the training data and has zero training error.

Test Error: With an inflexible model, the trend is barely captured and so error is high. As flexibility increases the model captures the trend, decreasing flexibility. At some point the model will over-fit, and capture more of the trend of the training data, rather than the actual underlying trend, which will cause test error to increase again.

Irreducible Error: It is unavoidable and thus stays constant.

4. You will now think of some real-life applications for statistical learning.

(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

Spam email detection. The email, subject and contents are the predictors. The response is whether it is a spam email or not. This is a prediction task (spam or not spam).

Similarly, Fraud credit card transactions. The contents of the purchase, the cost, the location, the store, the time and more are the predictors. The response is a prediction of whether the transaction is fraudulent or not.

Diagnosis of medical conditions is a classification task. The predictors are any number of health factors, and the goal is to try to infer how these factor contribute to the response of a medical condition.

(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

Stock trading. The predictors are the price action as well as any other related data. The response is the estimate stock price. This can be prediction and inference. We can use past price data to infer how our predictors relate to the price, or estimate the future price.

Similarly, housing price prediction. The predictors might include the size of the house, the location, and the number of bedrooms. The response would be the predicted sale price of the house.

Electricity demand forecasting: The predictors might include the time of day, the day of the week, and the weather. The response would be the predicted electricity demand.

(c) Describe three real-life applications in which cluster analysis might be useful.

Genome analysis. Each gene is a predictor, and ideally you could find trends in cluster of similar genes.

Species identification. Predictors might be location, time of activities, behaviors, diet, and other factors and clusters would should different species of animals.

Finding the best position for a sport player given their stats. Clusters of players with similar stats would be best suited for similar positions.

5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

A very flexible approach will be much more sensitive to trends in the training data compared to a less flexible approach. This will mean lower training error, and possibly lower test error (as long as you don’t over-fit). Inflexible models may not reach the same ceiling as flexible models, but also are much less likely to over-fit. Flexible models are preferred when you have lots of observations and a complex trend. Inflexible models are preferred when you have low observations or assume a simple trend, or want to avoid over-fitting.

6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

Parametric approaches make assumptions about the underlying relationship between in the inputs and outputs and use a fixed amount of parameters to define this relationship (linear regression, parameters are m and b). Non-parametric approaches do not make assumptions and instead use their flexibility to adapt to many type of relationships (decision tree). Parametric approaches are easier to interpret (such as being a easy to visualize line), whereas non-parametric could be a could be a complex tree that’s impossible to understand. If a parametric approach is chosen, and the assumed relationship is incorrect the model may not fit well to the data.

7. The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

Observation X1 X2 X3 Y
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 -1 0 1 Green
6 1 1 1 Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

Observation Distance to (0, 0, 0)
1 3
2 2
3 3.162278
4 2.236068
5 1.414214
6 1.732051

(b) What is our prediction with K = 1? Why?

Green. When K = 1, we only look at the closest neighbor, in this case observation 5. It has Y = Green, so our prediction is Green.

(c) What is our prediction with K = 3? Why?

Red. When K = 1, we only look at the 3 closest neighbors, in this case observations, 5, 6 and 2. They have Y = Green, Red, and Red respectively. Because there are more Reds than Greens (2:1) we predict Red.

(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

Small. A high value for K would average out the effects of a nonlinear true boundary and produce a less accurate model.

Applied

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ISLR)
library(moderndive)
library(skimr)

8.

glimpse(College)
## Rows: 777
## Columns: 18
## $ Private     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes…
## $ Apps        <dbl> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, 173…
## $ Accept      <dbl> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 1425…
## $ Enroll      <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484,…
## $ Top10perc   <dbl> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44, 23…
## $ Top25perc   <dbl> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73, 46…
## $ F.Undergrad <dbl> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 1830…
## $ P.Undergrad <dbl> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, 638…
## $ Outstate    <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1559…
## $ Room.Board  <dbl> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 3380…
## $ Books       <dbl> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, 400…
## $ Personal    <dbl> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800, 60…
## $ PhD         <dbl> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79, 36…
## $ Terminal    <dbl> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87, 6…
## $ S.F.Ratio   <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11.5, …
## $ perc.alumni <dbl> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, 26,…
## $ Expend      <dbl> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 11644,…
## $ Grad.Rate   <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 55…

(c)

iii.

ggplot(College, aes(y = Outstate)) +
  geom_boxplot() +
  facet_wrap(~Private)

iv.

College |> 
  mutate(elite = Top10perc > 50) |> 
  group_by(elite) -> df_elite
df_elite |> count()
ggplot(df_elite, aes(y = Outstate)) +
  geom_boxplot() +
  facet_wrap(~elite)

v.

ggplot(College, aes(PhD)) +
  geom_histogram(binwidth = 5)

ggplot(College, aes(S.F.Ratio)) +
  geom_histogram(binwidth = 1)

ggplot(College, aes(perc.alumni)) +
  geom_histogram(binwidth = 4) +
  facet_grid(rows = vars(Private))

vi.

Question: Are Private colleges (on average) harder to get in to?

College |> 
  ggplot(aes(Apps, Accept, color = Private)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

College |>
  group_by(Private) |>
  summarise(avgTopTen = mean(Top10perc)) |>
ggplot(aes(Private, avgTopTen)) +
  geom_col()

College |>
  group_by(Private) |>
  summarise(avgTop25 = mean(Top25perc)) |>
ggplot(aes(Private, avgTop25)) +
  geom_col()

I conclude that in general Private schools are harder to get into. In the first plot the ratio of x : y is the acceptance rate. We can see that the private school regression line has a lower slope, representing a lower average acceptance rate. In the following 2 plots we can see that on average, Private schools take more ‘high level’ (top 10% and 25%) students. These combined provide the view that Private schools are harder to get into.

9.

(a)

glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…

Quantitative: mpg, displacement, horsepower, weight, acceleration, and year. Qualitative: origin and name.

(b)

Auto |> 
  select(-origin, -name) |> 
  skim() |> 
  group_by(skim_variable) |> 
  summarise(range = numeric.p100 - numeric.p0)

(c)

Auto |> 
  select(-origin, -name) |> 
  skim() |>
  select(skim_variable, numeric.mean, numeric.sd)

(d)

Auto |> 
  mutate(id = row_number()) |>
  filter(id < 10) -> x
Auto |> 
  mutate(id = row_number()) |>
  filter(id > 85) -> y

bind_rows(x, y) |> 
  select(-origin, -name) |> 
  skim() |>
  select(skim_variable, numeric.mean, numeric.sd)

(e)

Auto |> group_by(cylinders) |> mutate(avgMpg = mean(mpg)) |>
ggplot(aes(cylinders, avgMpg)) +
  geom_col()

ggplot(Auto, aes(horsepower, displacement)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(Auto, aes(horsepower, acceleration)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can see that 4 cylinder engines are the most efficient. We also see that horsepower is very positively correlated with horsepower, the opposite of acceleration.

(f)

I think that year, displacement and weight will be useful in predicting mpg. Year because cars tend to get more efficient as technology improves over time, so with almost certainty a 2010 car will have a higher mpg than a 1960 car. Displacement because larger engines are more powerful, but at the cost of efficiency. Finally, weight because a heavier car requires more power to go the same distance as a lighter car, hurting mpg.