Introduction to Statistical Learning with Applications in R (ISLR)
1 Introduction
2 Statistical Learning
3 Linear Regression

Introduction to Statistical Learning with Applications in R (ISLR)

This is one of the most important books on statistical analysis with R. The goal of this document is to extract relevant details from it for study, and future reference.

Book website: https://hastie.su.domains/ISLR2/ISLRv2_website.pdf

The book is divided on the following chapters:

Introduction
Statistical Learning

2.1   What is Statistical Learning

2.2   Assessing Model Accuracy

2.3   Lab: Introduction to R

2.4   Exercises

Linear Regression

3.1   Simple Linear Regression

3.2   Multiple Linear Regression

3.3   Other Considerations in the Regression Model

3.4   The Marketing Plan

3.5   Comparison of Linear Regression with K-Nearest Neighbours

3.6   Lab: Linear Regression

3.7   Exercises

Classification

4.1   An Overview of Classification

4.2   Why Not Linear Regression?

4.3   Logistic Regression

4.4   Generative Models for Classification

4.5   A Comparison of Classification Methods

4.6   Generalized Linear Models

4.7   Lab: Classification Methods

4.8   Exercises

Ressampling Methods

5.1   Cross-Validation

5.2   The Bootstrap

5.3   Lab: Cross-Validation and the Bootstrap

5.4   Exercises

Linear Model Selection and Regularisation

6.1   Subset Selection

6.2   Shrinkage Methods

6.3   Dimension Reduction Methods

6.4   Considerations in High Dimensions

6.5   Lab: Linear Models and Regularisation Methods

6.6   Exercises

Moving Beyond Linearity

7.1   Polynomial Regression

7.2   Step Functions

7.3   Basis Functions

7.4   Regression Splines

7.5   Smoothing Splines

7.6   Local Regression

7.7   Generalised Additive Models

7.8   Lab: Non-linear Modeling

7.9   Exercises

Tree-Based Methods

8.1   The Basics of Decision Trees

8.2   Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees

8.3   Lab: Decision Trees

8.4   Exercises

Support Vector Machines

9.1   Maximal Margin Classifier

9.2   Support Vector Classifiers

9.3   Support Vector Machines

9.4   SVMs with More than Two Classes

9.5   Relationship to Logistic Regression

9.6   Lab: Support Vector Machines

9.7   Exercises

Deep Learning

10.1  Single Layer Neural Networks

10.2  Multilayer Neural Networks

10.3  Convolutional Neural Networks

10.4  Document Classification

10.5  Recurrent Neural Networks

10.6  When to Use Deep Learning

10.7  Fitting a Neural Network

10.8  Interpolation and Double Descent

10.9  Lab: Deep Learning

10.10 Exercises

Survival Analysis and Censored Data

11.1  Survival and Censoring Times

11.2  A Closer Look at Censoring

11.3  The Kaplan-Meier Survival Curve

11.4  The Log-Rank Test

11.5  Regression Models With a Survival Response

11.6  Shrinkage for the Cox Model

11.7  Additional Topics

11.8  Lab: Survival Analysis

11.9  Exercises

Unsupervised Learning

12.1  The Challenge of Unsupervised Learning

12.2  Principal Components Analysis

12.3  Missing Values and Matrix Completion

12.4  Clustering Methods

12.5  Lab: Unsupervised Learning

12.6 Exercises

Multiple Testing

13.1  A Quick Review of Hypothesis Testing

13.2  The Challenge of Multiple Testing

13.3  The Family-Wise Error Rate

13.4  The False Discovery Rate

13.5  A Re-Sampling Approach to p-values and False Discovery

13.6  Lab: Multiple Testing

13.7  Exercises

1 Introduction

The book begins setting a difference between supervised or unsupervised statistical learning: Supervised statistical learning - to build a statistical model for predicting or estimating an output based on one or more inputs. Unsupervised statistical learning - to build a statistical model to learn relationships and structure from the date with inputs but without supervised outputs.

The book follows two examples of supervised statistical learning using two datasets, the Wage data, and the Smarket data. On the two examples, the output data was part of the dataset and the goal was to predict (1st example) and classify (2nd example) something.

Then, it is presented a third example, using the NCI60 dataset. The goal was to not to predict an output, but rather understand the data. It used a clustering approach to assert whether there were identifiable groups.

Finally, the book covers what it will be dealt with in the following chapters.

2 Statistical Learning

2.1 What is Statistical Learning?

This chapter begins with a typical prediction case. Using the Advertising data set, it is intended to predict the output variable, sales, using the input variables, the advertisement on different markets, TV, radio, and newspaper.

With this example, it is presented the general case for a prediction formula: $Y = f(X) + \epsilon$ being Y the quantitative response we want to assess; $f(X)$ an unknown function of $X_{1}, X_{2}, ..., X_{p}$, which $p$ is the number of input variables; and e a random error independent of $X$ with a mean equal to zero. The goal of $f(X)$ is to provide systematic information that $X$ gives to $Y$.

2.1.1 Why Estimate $f()$?

It is outlined that there are two reasons to estimate $f()$: to predict something or to infer something.

Prediction - understand $Y$

Used when a set of input variables $X_{1}, X_{2}, ..., X_{p}$ is available, but the output $Y$ is not easily available. As the error e averages to zero, it is used the following formula: $Y^ = f^(X)$ It is important to notice that this method is treated as a black box, e.g. we are not concerned with the actual value of $f^()$ if it yields and accurate value of $X$. Additionally, the accuracy of the model will ultimately depend on two figures: the reducible error and the irreducible error.

Reducible error: associated with the natural inaccuracy of $f^()$, and the goal is to improve $\hat{f}()$ to better predict $Y$. It is defined by the squared difference between $f(X)$ and $\hat{(X)}$.
Irreducible error: associated with the variance of the error $\epsilon$. It can mean many things, including input variables not evaluated by the model and other non-measurable factors. It is defined by the variance of $\epsilon$, $Var(\epsilon)$. The goal of the book is to minimize the reducible error. As for the irreducible error, it is important to notice that it will always get in the way of the statistical models and it is almost always unknown in practice.

Inference - understand $f(X)$

By contrast, this approach is not trying to make predictions of the value of $Y$, but rather assert the actual value of $f(X)$. Thus, the black box approach used by the prediction method, $\hat{Y} = \hat{f}(X)$, cannot be used. The inference method might be interested in answering the following questions:

Which predictors are associated with the response? To understand the important predictors.
What is the relationship between the response and each predictor? To understand positive or negative relationships between predictors and Y.
Can the relationship between $Y$ and each predictor be adequately summarised using a linear equation, or is the relationship more complicated? To understand whether the traditional linear approach is suitable.

Depending on the model of choice, the goal will be of prediction, inferring, or a mix of the two.

2.1.2 How Do We Estimate $f()$?

There are many techniques that will estimate $f()$. However, they all will try to gather training data points which will be inserted into a statistical learning method. Thus, the goal is to find a function $\hat{f}(X)$ that will get closer to the actual value of $Y$ for any observation $(X, Y)$. Most statistical learning methods can be thus classified in parametric or non-parametric:

Parametric - two-step model-based approach

To make an assumption of what would be the shape or form of $f()$. For instance, one might assume that $f()$ is a linear formula.
To use the training data to fit or train a model. That is to estimate the parameters $B_{o}, B_{1}, ..., B_{p}$ such that

$Y \sim B_{o} + B_{1}X_{1} + B_{2}X_{2} + ... + B_{p}X_{p}$

This approach is called parametric because the goal is to estimate the parameters $B_{o}, B_{1}, ..., B_{p}$. It is widely used because it is more easy to estimate the parameters than it is to estimate $f()$. The main issues are that the model used might not fit to the actual value of $f()$, and that the parameters could be exaggerated and the model might follow the noise too closely, thus creating the overfitting problem.

Non-parametric - estimate $f()$ that gets as close to data points as possible

No assumption of $f()$ is made, thus there is great advantage towards parametric approaches. However, to succeed it is necessary to have a large amount of data. Additionally, it is necessary to define the smoothness of the fit, which will determine how closely the estimated $f()$ follows data points. A overly smooth decision might lead to an overfitting problem.

2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability

Sometimes it is necessary to choose an approach that might be counter-intuitive. There is something called the trade-off between flexibility and interpretability. It can be the case that some more rigid models might be better than some flexible ones, even if it means that it is less accurate. The reason is because there are times when to be able to interpret a result will be more important than the accuracy. This will be the case for inference issues, where a more restrictive model will explain the results better. The flexible approach, even though potentially more accurate, will account for multiple variables, thus complicating the interpretation.

However, when the goal is prediction, the interpretability of the model will not be of interest. In those cases, the accuracy will be the most important figure, and the use of a more flexible approach is encouraged.

Additionally, it is worth mentioning that overly flexible models can end up in overfitting. Therefore, sometimes a more inflexible model in a prediction issue will be more accurate than the more flexible counterpart.

2.1.4 Supervised Versus Unsupervised Learning

The goal of most models is to use available data to predict or to infer something. This will be the case of many different approaches, including linear models, logistic regressions, etc. These models need a supervision, which is to input response data for a given number of predictors. The model will learn what is the expected response and thus it will be possible to predict or infer something.

However, in the unsupervised setting there are no inputs for responses. There are only data about the predictor variables. Hence, it is impossible to predict or to infer something because the model cannot learn what the expected response is. Then, the goal of the unsupervised models is to learn relationships between the predictors. Good examples are the clustering methods, which will try to set apart the data into different groups based on the available information.

2.1.5 Regression Versus Classification Problems

These two groups of methods fall into different purposes:

Regression: Used when the response variable is quantitative.
Classification: Used when the response variable is qualitative. It is important to understand first what are we trying to measure, to check if it is a regression or a classification problem, and then select the appropriate statistical learning method.

2.2 Assessing Model Accuracy

As there is no one best statistical learning method, it is necessary to evaluate a model accuracy for a given dataset.

2.2.1 Measuring the Quality of Fit

One widely-used measurement is the mean squared error (MSE), which is given by the following formula:

$MSE = \frac{1}{n}\sum_{i = 1}^{n}(y_{i} - \hat{f}(x_{i}))^{2}$

Thus, $MSE$ is calculated by the squared average between the actual value, $y_{i}$, and the estimated value, $\hat{f}(x_{i})$.

It is important to notice that the $MSE$ value only matters for the test set, which is the partition of the dataset that the model has not yet seen. The reason is that it does not matter if the model can predict data it has already seen, as we already know their respective $y_{i}$

This point yields another topic: the degrees of freedom (df). Essentially, $df$ is the level of flexibility a model has. A linear regression, for example, has a very low $df$, whereas more flexible models will have higher $df$. Then, it is relevant to notice that a model with a higher $df$ will have a lower $training$ $MSE$. The reason why is that with larger flexibility, the model will be really close to the data points contained in the training dataset, thus the $training$ $MSE$ will be quite low. However, the scenario will be really different with the $test$ $MSE$. As the model is following the noise, it will not accurately predict unseen values. Then, there will be a $df$ that will reduce the $test$ $MSE$, and will probably not be the highest, but somewhere in between.

When a model has a low $training$ $MSE$, but a high $test$ $MSE$, the model is considered to suffer from overfitting. The overfitting condition also implies the existence of a less flexible model that would have yielded a lower $test$ $MSE$. Sometimes, test data is not readily available, and some techniques are necessary to attempt to estimate the $test$ $MSE$.

Finally, there are methods that will seek to optimise $df$, as though to yield the model which will have the most effective $test$ $MSE$.

2.2.2 The Bias-Variance Trade-Off

To reach the most effective $test$ $MSE$, we have to evaluate our models based on the variance of the training dataset, and on the bias introduced by estimating a real-life problem.

Variance: How much $\hat{f}$ would change if we used a different training dataset. An ideal model would accurately predict a test dataset, thus a different training dataset would not make much of a difference.
Bias: How much error is accounted for $\hat{f}$. This will increase by as much as our estimate of $f$ is far from the actual $f$. As a general rule of thumb, the more flexible a model is, the more the variance will increase, and the less the bias will decrease. As models are evaluated, an increase in flexibility will reduce the bias faster than the variance is increased. However, at some point an increase in flexibility will significantly increase the variance with little change in bias. This can be understood with the following formula:

$test$ $MSE = Var(\hat{f}(x_{0})) + [Bias(\hat{f}(x_{0}))] + Var(\epsilon)$

Hence, we can decompose the $test$ $MSE$ by adding the variance, the bias and the $Var(\epsilon)$. This will be applied to every dataset and every model. As we have seen, as the $df$ increases, variance will increase and bias will decrease. The ultimate goal is to find a balance between the two that yields the best $test$ $MSE$.

2.2.3 The Classification Setting

With a classification method, there are a few different concepts that are applied. As the goal is to accurately classify a data point, we can evaluate our estimation of $f$ with the average of correct classifications. This is exemplified in the following formula:

$\frac{1}{nb}\sum^{n}_{i = 1}I(y_{i}\neq \hat{y}_{i})$

The more correct classifications there are, the more zeros this formula will yield, and thus the more accurate the model will be.

The Bayes Classifier - minimizing the average error.

The test error can be minimized if we assume the classification of a given observation based on the most likely class it could belong to given its predictor values. This conditional method is given by the following formula:

$Pr(Y = j|X = x_{0})$

This method, called the Bayes Classifier, is used in situations where there are only two categories, and it will predict class one if $Pr(Y = j|X = x_{0}) > 0.5$, and class two otherwise. When this equation equals exactly 0.5, it means that the given observation will fall into the Bayes decision boundary. The error that this method produces is analogous to the irreducible error and is given by the following formula:

$1 - E(max_{j}Pr(Y = j|X))$

K-Nearest Neighbours (KNN) - approximation of the Bayes Classifier

The Bayes Classifier is always impossible to use because, for its application, it is necessary to know the conditional distribution of $Y$ given $X$. Then, as it is the most effective method to reduce the average error, it is considered as an unattainable gold standard. The KNN approach uses a given positive $K$ integer and a test observation $x_{0}$; identifies the $K$ points that are closer to $x_{0}$, represented by $N_{0}$; estimates the conditional probability for class $j$ as the percentage of points in $N_{0}$ that belong to class $j$; and finally classifies the test observation $x_{0}$ as the class with the largest probability. This relationship is determined by the following formula:

$Pr(Y = j|X = x_{0}) = \frac{1}{K}\sum_{i\epsilon N_{0}} I(y_{i} = j)$

For example, for a given dataset and the value of K = 3, to classify a given data point, the model will search for the 3 nearest data points. Suppose 2 out of 3 are class a, and 1 is class b; then the model will predict that this data point belongs to class a.

It is important to notice that, as in the $df$ for the regression methods, the value of $K$ will determine the flexibility of the model. And, as it was true then, it is also true that the more flexible, the less bias and more variance the model will account for.

2.3 Lab: Introduction to R

The goal of this lab is to introduce the R language.

2.3.1 Basic Commands

Vectors

# creating a vector
x <- c(1, 6, 2)
x

## [1] 1 6 2

y <- c(1, 4, 3)
y

## [1] 1 4 3

# checking the length of a vector
length(x)

## [1] 3

length(y)

## [1] 3

x + y

## [1]  2 10  5

# checking and deleting objects
ls() #looks for all objects registered on the environment

## [1] "x" "y"

# rm() remove objects from the environment
rm(list = ls()) #removes a list of objects, which is set to all registered objects

Matrices

# creating a matrix
# the function matrix() takes at least three inputs: data, number of rows, and number of columns
x <- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
x

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

# the input of the arguments data =, nrow =, and ncol = is not necessary
# the reason is because R will take the order of the input data as the order defined on the function's documentation
x <- matrix(c(1, 2, 3, 4), 2, 2)
x

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

# the matrix() function will fill columns first, but the argument byrow = TRUE changes it to rows first
# it is also not necessary to assing the function to a value
# this will only yield the result of the function and will not store the result for future reference
matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

# calculations with matrices
# it is possible to compute several calculations, including square root and power
sqrt(x)

##          [,1]     [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000

x^2

##      [,1] [,2]
## [1,]    1    9
## [2,]    4   16

Correlation of vectors

# the function rnorm() will generate a vector of random normal variables, being the first argument the sample size
# by definition, rnorm() will create random variables with mean of zero and standard deviation of 1
# as it is random, every time we call it we will get a different answer
x <- rnorm(50)

# it is possible to set other arguments, including mean and standard deviation
y <- x + rnorm(50, mean = 50, sd = 0.1)

# the function cor() will yield the correlation between the vectors
cor(x, y)

## [1] 0.9961915

# to make sure the result will always be the same, it is possible to set a seed with the function set.seed()
# this makes possible the reproduction of results
set.seed(1303)
rnorm(50)

##  [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
##  [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
## [11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
## [16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
## [21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
## [26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
## [31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
## [36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
## [41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
## [46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557

Other calculations

# we can also compute the mean, variance, square root of variance (standard deviation), and directly the standard deviation
set.seed(3)
y <- rnorm(100)
mean(y)

## [1] 0.01103557

var(y)

## [1] 0.7328675

sqrt(var(y))

## [1] 0.8560768

sd(y)

## [1] 0.8560768

2.3.2 Graphics

The plot() function

x <- rnorm(100)
y <- rnorm(100)
plot(x, y)

# we can set x-axis and y-axis labels with xlab = and ylab =
# we can also set the title with main =
plot(x, y,
     xlab = 'this is the x-axis',
     ylab = 'this is the y-axis',
     main = 'Plot of X vs Y')

The seq() function

# we can use seq() to generate a vector between two numbers, imputing numbers equally separated
x <- seq(1, 10)
x

##  [1]  1  2  3  4  5  6  7  8  9 10

# we can add an argument to specify the length of the vector
# if length is not specified, the vector will contains values separated by 1
x <- seq(-pi, pi, length = 50)
x

##  [1] -3.14159265 -3.01336438 -2.88513611 -2.75690784 -2.62867957 -2.50045130
##  [7] -2.37222302 -2.24399475 -2.11576648 -1.98753821 -1.85930994 -1.73108167
## [13] -1.60285339 -1.47462512 -1.34639685 -1.21816858 -1.08994031 -0.96171204
## [19] -0.83348377 -0.70525549 -0.57702722 -0.44879895 -0.32057068 -0.19234241
## [25] -0.06411414  0.06411414  0.19234241  0.32057068  0.44879895  0.57702722
## [31]  0.70525549  0.83348377  0.96171204  1.08994031  1.21816858  1.34639685
## [37]  1.47462512  1.60285339  1.73108167  1.85930994  1.98753821  2.11576648
## [43]  2.24399475  2.37222302  2.50045130  2.62867957  2.75690784  2.88513611
## [49]  3.01336438  3.14159265

The contour() function

# generates a contour plot, which represents three-dimensional data
# it is similar to a topographical map
# it takes three arguments: vector of x values; vector of y values; a matrix that corresponds to a z value for every correlation of (x, y)
y <- x
f <- outer(x, y, function(x, y) cos(y) / (1 + x^2))
contour(x, y, f)
contour(x, y, f, nlevels = 45, add = TRUE)

fa <- (f - t(f)) / 2
contour(x, y, fa, nlevels = 15)

The image() and persp() functions

# image() works similarly to contour(), but adds colours to it, creating a heatmap
# the colours will ultimately depend on the z value
# alternatively, persp() produces three-dimensional plots
# it uses the arguments theta and phi to control the viewing angles
image(x, y, fa)

persp(x, y, fa)

persp(x, y, fa, theta = 30)

persp(x, y, fa, theta = 30, phi = 20)

persp(x, y, fa, theta = 30, phi = 70)

persp(x, y, fa, theta = 30, phi = 40)

2.3.3. Indexing Data

Indexing a matrix

# we can select parts of the matrix for inspection purposes
# the first number inside the brackets will always represent row, and the second the column
A <- matrix(1:16, 4, 4)
A

##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16

# a single element
A[2, 3]

## [1] 10

# a collection of elements
A[c(1, 3), c(2, 4)]

##      [,1] [,2]
## [1,]    5   13
## [2,]    7   15

# a partition of the matrix
A[1:3, 2:4]

##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15

# it is also possible to leave the row or column argument empty
# this will select all the rows or all the columns
A[1:2, ]

##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14

A[, 1:2]

##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8

# the use of the negative symbol (-) will tell R to include all rows and columns except those indicated
A[-c(1, 3), ]

##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
## [2,]    4    8   12   16

A[-c(1, 3), -c(1, 3, 4)]

## [1] 6 8

# the dim() function will yields the total number of rows and columns
dim(A)

## [1] 4 4

2.3.4 Loading Data

Reading a table

# loading the book's R package
library(ISLR2)

# reading a dataset
# it is necessary to input the correct directory
# Auto <- read.table(Auto)

# viewing the dataset in a separate window
View(Auto)

# viewing the first few rows of the data
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

2.3.5 Additional Graphical and Numerical Summaries

A few techniques for plotting

# plotting cylinders and mpg
# as we are plotting from a dataframe, it is necessary to use $ to specify variables
plot(Auto$cylinders, Auto$mpg)

# it is possible to tell R to use Auto as a reference with the attach() function
# this will make possible using only the variables in functions
attach(Auto)
plot(cylinders, mpg)

Boxplots

# R will plot boxplots by default if the variable on the x-axis is qualitative
# when the Auto dataframe is loaded, R will understand cylinders as a numeric variable
# because it is represented by numbers
# however, in reality cylinders can be understood with different categories
# thus, we can tell R to treat it as a qualitative variable instead with as.factor()
cylinders <- as.factor(cylinders)

# plotting again
plot(cylinders, mpg)

# changing the colour
plot(cylinders, mpg, col = 'red')

# adjusting the size of each 'box' to the amount of observations in that category
plot(cylinders, mpg, col = 'red', varwidth = TRUE)

# changing the plot to horizontal
plot(cylinders, mpg, col = 'red', varwidth = TRUE, horizontal = TRUE)

# adding labels
plot(cylinders, mpg, col = 'red', varwidth = TRUE, xlab = 'cylinders', ylab = 'MPG')

Histograms

# instead of plot(), we use hist()
hist(mpg)

# setting col = 2 will also turn the colour red
hist(mpg, col = 2)

# adjusting the size of the bins
hist(mpg, col = 2, breaks = 15)

Scatterplots

# to plot a matrix of scatterplots, we use the pairs() function
# it will generate pairs of scatterplots for all the variables contained in the dataframe
pairs(Auto)

# we can also set which variables we want to see
pairs( ~ mpg + displacement + horsepower + weight + acceleration,
       data = Auto)

2.4 Exercises

2.4.1 Conceptual

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

The sample size $n$ is extremely large, and the number of predictors $p$ is small.

Answer: The flexible method is expected to outperform the inflexible because there are many observations. This will hinder overfitting, the main problem of flexible methods.

The number of predictors $p$ is extremely large, and the number of observations $n$ is small.

Answer: The flexible method is expected to perform worse than the inflexible. The reason is because there are not many observations and the method might follow the noise in the data, instead of the actual shape of $f()$.

The relationship between the predictors and response is highly non-linear.

Answer: The flexible method is expected to outperform the inflexible one. The reason is because $f()$ is highly non-linear, which means the model will likely benefit from a higher $df$.

The variance of the error terms, i.e. $\sigma^{2} = Var(\epsilon)$, is extremely high.

Answer: The $Var(\epsilon)$, also called the irreducible error, will be there independent of what the actual shape of $f()$ really is. Thus, we cannot be sure what to expect.

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide $n$ and $p$.

Personal note: Classification is used when $Y$ is qualitative; regression is used when $Y$ is quantitative. Inference is about understanding the true $f()$, whereas prediction is about accurately predicting $Y$. The value of $n$ is the size of the data, whereas the value of $p$ is the total of predictors.

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry, and the CEO salary. We are interested in understanding which factors affect CEO salary.

Answer: It is a regression problem because $Y$, the CEO salary, is quantitative. We are more interested in inferring, because we want to know the true $f()$, the relationships between the variables that will determine $Y$. The value of $n$ is 500, because there are 500 firms. The value of $p$ is profit, number of employees, and industry, which are all the predictors of $Y$, the CEO salary.

We are considering launching a new product and wish to know whether it will be a $success$ or a $failure$. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or a failure, price charged for the product, marketing budget, competition price, and ten other variables.

Answer: It is a classification problem because $Y$, $success$ or $failure$ is qualitative. We are more interested in prediction, because we want to know whether it will be a success or not. The value of $n$ is 20, because we have 20 products. The value of $n$ is price charged, marketing budget, competition price, and the other ten variables.

We are interested in predicting the % change in the USD/EUR exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/EUR, the % change in the US market, the % change in the British market, and the % change in the German market.

Answer: It is a regression problem because $Y$, the USD/EUR exchange rate is quantitative. We are more interested in prediction, because we want to predict the % change of $Y$. The value of $n$ is 52, because there the dataset has weekly information about a single year, 2012, and the year has 52 weeks. The value of $p$ is the % change in the US market, the % change in British market, and the % change in the German market.

We now revisit the bias-variance decomposition.

Provide a sketch of a typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The $x$-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

Answer: Drawn separately

Explain why each of the five curves has the shaped displayed in part a).

Answer: The bias begins high and falls rapidly, as the $df$ is increased. The variance starts low and increases slowly as $df$ increases, but then exponentially increases as $df$ gets higher. The training error will start somewhere in the middle and then falling progressively. The test error will start a little above the train error, reducing a little as $df$ increases, bu then rising exponentially as the variance will get higher. The irreducible error, $Var(\epsilon)$ is a constant line, it remains unchanged.

You will now think of some real-life applications for statistical learning.

Describe three real-life applications in which a $classification$ might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

Answer: i) whether it will rain tomorrow or not; the response $Y$ is whether it will rain tomorrow or not; the predictors $p$ are humidity, average rainfall for tomorrow’s date based on historical data, wind speed, and cloud coverage. ii) whether a credit card sale will be authorised or not; the response $Y$ is whether the sale will be approved or not; the predictors $p$ are time of transaction, average sale value for the credit card’s holder, whether the location of the transaction is rare for the credit card’s holder, average sale value of previous five credit card transactions. iii) whether a president will be reelected or not; the response $Y$ is the likely outcome of the election, if the politician is reelected or not; the predictors $p$ are unemployment rate, year-on-year % change of the population’s purchase power, and the president’s popularity.

Describe three real-life applications in which $regression$ might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

Answer: i) predicting the value of a stock; the response $Y$ is the value of the stock; the predictors $p$ are the company’s earnings, its market share, and the average value of a stock in the company’s industry; ii) a data scientist salary; the response $Y$ is the salary of the data scientist; the predictors $p$ are years of experience, industry, city where the company is located, and the market valuation of the company. iii) credit score; the response $Y$ is the value of the credit score of someone; the predictors $p$ are whether the person has defaulted in the past five years, how many times the person has not payed their credit card on time in the past five years, monthly income, and the amount of debt the person has.

Describe three real-life applications in which $cluster analysis$ might be useful.

Answer: i) segmentation of a company’s customers. ii) segmentation of cancer cells. iii) segmentation of particles in a polluted river.

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for a regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Answer: The best advantage of a very flexible model is that it has a very low bias compared to a less flexible one. On the other side, a major disadvantage is that overfitting is very common for flexible models. A more flexible approach will be best when the true shape of $f()$ is non-linear, whereas the less flexible will be best when the true shape of $f()$ is linear.

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

Answer: A parametric approach estimates the value of $f()$, and then looks after what the parameters will be. The advantage is that it makes the problem a lot more easy and is able to perform well with less data. The disadvantage is that the estimation of $f()$ might be distant of the reality, thus resulting in great error. A non-parametric approach will not try to estimate $f()$, thus generating more accurate results. Its disadvantage is that it needs a lot of data, which in many cases is not possible.

The table below provides a training dataset containing six observations, three predictors, and one qualitative response variable.

Obs|$X_{1}$|$X_{2}$|$X_{3}$|$Y$

1 | 0 | 3 | 0 | Red

2 | 2 | 0 | 0 | Red

3 | 0 | 1 | 3 | Red

4 | 0 | 1 | 2 | Green

5 |-1 | 0 | 1 | Green

6 | 1 | 1 | 1 | Red

Suppose we wish to use this dataset to make a prediction for $Y$ when $X_{1} = X_{2} = X_{3} = 0$ using $K$-nearest neighbours.

Compute the Euclidean distance between each observation and the test point, $X_{1} = X_{2} = X_{3} = 0$.

Answer: 1, d = 3; 2, d = 2; 3, d = $\sqrt{10}$, roughly 3.162; 4, d = $\sqrt{5}$, roughly 2.236; 5, d = $\sqrt{2}$, roughly 1.414; 6, d = $\sqrt{3}$, roughly 1.732.

What is our predictor with $K = 1$? Why?

Answer: Green, as with $K = 1$ we gather only the nearest point, which will be observation 5, green.

What is our prediction with $K = 3$? Why?

Answer: Green, as with $K = 3$ we gather the three nearest points, which are observations 5, 6, and 2. As 5 and 6 are green, and 2 is red, the observation will be assigned green as green makes $\frac{2}{3}$ of the observations.

If the Bayes decision boundary in this problem is highly non-linear, then would we expect the $best$ value for $K$ to be large or small? Why?

Answer: We will expected it to be large. The more non-linear a problem is, the larger the adequate number of $K$ is expected to be.

2.4.2 Applied

This exercise relates to the College dataset, which can be found in the file College.csv on the book website. It contains a number of variables for 777 different universities and colleges in the US. The variables are:

Private: Public/private indicator
Apps: Number of applications received
Accept: Number of applicants accepted
Enroll: Number of new students enrolled
Top10perc: New students form top 10% high school class
Top25perc: New students from top 25% of high school class
F.Undergrad: Number of full-time undergraduates
P.Undergrad: Number of part-time undergraduates
Outstate: Out-of-state tuition
Room.Board: Room and Board costs
Books: Estimated book costs
Personal: Estimated personal spending
PhD: Percent of faculty with Ph.D.’s
Terminal: Percent of faculty with terminal degree
S.F.Ratio: Student/faculty ratio
perc.alumni: Percent of alumni who donate
Expend: Instructional expenditure per student
Grad.Rate: Graduation rate

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

Answer:

college <- College

Look at the data using the View() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Use the following command: rownames(college) <- college[, 1]. You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try: college <- college[, -1]. Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

Answer:

# rownames(college) <- college[, 1]
# college <- college[, -1]
head(college)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

1. Use the summary() function to produce numerical summary of the variables in the dataset.
2. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
3. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
4. Create a new qualitative variable, called Elite, by $binning$ the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%. Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
5. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will dive the screen in other ways.
6. Continue exploring the data, and provide a brief summary of what you discover.

Answer:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

pairs(~ Private + Apps + Accept + Enroll + Top10perc + Top25perc + F.Undergrad +
        P.Undergrad + Outstate + Room.Board, college)

plot(college$Private, college$Outstate)

college <- college %>%
  mutate(Elite = case_when(Top10perc >= 50 ~ 'Yes',
                           Top10perc < 50 ~ 'No')) %>%
  mutate(Elite = as.factor(Elite)); summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate      Elite    
##  Min.   : 10.00   No :694  
##  1st Qu.: 53.00   Yes: 83  
##  Median : 65.00            
##  Mean   : 65.46            
##  3rd Qu.: 78.00            
##  Max.   :118.00

plot(college$Elite, college$Outstate)

par(mfrow = c(2, 2))
hist(college$Apps, breaks = 100)
hist(college$Accept, breaks = 100)
hist(college$Enroll, breaks = 100)
hist(college$PhD, breaks = 100)

This exercise involves the Auto dataset studied in the lab. Make sure that the missing values have been removed from the data.

Which of the predictors are quantitative, and which are qualitative?

Answer: Quantitative: mpg, displacement, horsepower, weight, acceleration; Quantitative: cylinders, year, origin, name.

auto <- Auto
summary(auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

What is the $range$ of each predictor? You can answer this using the range() function.

Answer: mpg, 9.0-46.6; cylinders, 3-8; displacement 68.0-455.0; horsepower, 46.0-230.0; weight, 1613-5140; acceleration, 8.00-24.80; year, 70-82; origin 1-3. Predictor name, as it is a factor, does not have a meaningful range.

range(auto$mpg)

## [1]  9.0 46.6

range(auto$cylinders)

## [1] 3 8

range(auto$displacement)

## [1]  68 455

range(auto$horsepower)

## [1]  46 230

range(auto$weight)

## [1] 1613 5140

range(auto$acceleration)

## [1]  8.0 24.8

range(auto$year)

## [1] 70 82

range(auto$origin)

## [1] 1 3

What is the mean and standard deviation of each quantitative predictor?

Answer: mpg, 23.44592, 7.805007; displacement, 194.412, 104.644; horsepower, 104.4694, 38.49116; weight, 2977.584, 849.4026; acceleration, 15.54133, 2.758864.

mean(auto$mpg)

## [1] 23.44592

sd(auto$mpg)

## [1] 7.805007

mean(auto$displacement)

## [1] 194.412

sd(auto$displacement)

## [1] 104.644

mean(auto$horsepower)

## [1] 104.4694

sd(auto$horsepower)

## [1] 38.49116

mean(auto$weight)

## [1] 2977.584

sd(auto$weight)

## [1] 849.4026

mean(auto$acceleration)

## [1] 15.54133

sd(auto$acceleration)

## [1] 2.758864

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

Answer: mpg, 24.40443, 7.867283; displacement, 187.2405, 99.67837; horsepower, 100.7215, 35.70885; weight, 2935.972, 811.3002; acceleration, 15.72.7269, 2.693721.

auto2 <- auto[-c(10:85),]; head(auto2)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

mean(auto2$mpg)

## [1] 24.40443

sd(auto2$mpg)

## [1] 7.867283

mean(auto2$displacement)

## [1] 187.2405

sd(auto2$displacement)

## [1] 99.67837

mean(auto2$horsepower)

## [1] 100.7215

sd(auto2$horsepower)

## [1] 35.70885

mean(auto2$weight)

## [1] 2935.972

sd(auto2$weight)

## [1] 811.3002

mean(auto2$acceleration)

## [1] 15.7269

sd(auto2$acceleration)

## [1] 2.693721

Using the full dataset, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

Answer: There are some interesting insights. From a consumer perspective, the mpg is an important figure, because it is a cost a consumer will have for how long they will have the vehicle. Then, it is clear that higher weight, displacement, and horsepower will lower the mpg. A surprising relationship, however, is between horsepower and acceleration. At least from my perspective, I thought that higher horsepower would mean higher acceleration, but it is actually the opposite.

pairs(~ auto$mpg + auto$cylinders + auto$displacement + auto$horsepower +
        auto$weight + auto$acceleration + auto$year + auto$origin)

Suppose that we wish to predict gas mileage (mpg) on the basis of other variables. Do your plots suggest that any of the other variables might be useful in predictor mpg? Justify your answer.

Answer: Based on the plots, the three best predictors would be displacement, horsepower, and weight. Year would somewhat help, but might year does not have an as strong relationship with mpg as the other three.

This exercise involves the Boston housing dataset.

To begin, load the Boston dataset. The Boston dataset is part of the ISLR2 library. Read about the dataset with ?Boston. How many rows are in this dataset? How many columns? What do rows and columns represent?

Answer: There are 506 rows and 13 columns. Rows represent suburbs and columns represent information about predictors of house values, being the 13th column the median value of owner-occupied homes in $1000s.

Make some pairwise scatterplots of the predictors (columns) in this dataset. Describe your findings.

Answer: There are two variables that completely split the dataset, which are chas, which indicates position relative to Charles River; and rad, which indicates accessibility of radial highways.

boston <- Boston
summary(boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

pairs(~ boston$crim + boston$zn + boston$indus + boston$chas + boston$nox +
        boston$rm + boston$age + boston$dis + boston$rad + boston$tax +
        boston$ptratio + boston$lstat + boston$medv)

Are any of the predictors associated with per capta crime rate? If so, explain the relationship.

Answer: Being in a location favored by the river will dim crime rate, whereas proximity to radial highways will spike crime rate. Higher crime rates will also decrease the median value of owner-occupied homes.

Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

Answer: Predictors rad (1-24), age(2.9-100.0), tax (187-711), and ptratio (12.6-22.0) will increase the crime rate.

range(boston$rad)

## [1]  1 24

range(boston$age)

## [1]   2.9 100.0

range(boston$tax)

## [1] 187 711

range(boston$ptratio)

## [1] 12.6 22.0

How many of the census tracts in this dataset bound the Charles river?

Answer: 35 suburbs.

library(dplyr)
boston %>%
  filter(chas == 1) %>%
  count()

##    n
## 1 35

What is the median pupil-teacher ratio among the towns in this dataset?

Answer: The median is 19.05.

median(boston$ptratio)

## [1] 19.05

Which census tract of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

Answer: There are two: rows 399 and 406. They both have elevated crime rates, which row #406 very high, high proportion of non-retail business acres, high nitrogen oxides concentration rates, maximum proportion of owner-occupied units built prior to 1940, low mean distance to five Boston employment centres, maximum rate of accessibility to radial highways, high property-tax rates, very high pupil-teacher ratio, and elevated lower percentage status of the population.

library(dplyr)
boston %>%
  arrange(medv) %>%
  head(5)

##       crim zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv
## 1 38.35180  0  18.1    0 0.693 5.453 100.0 1.4896  24 666    20.2 30.59  5.0
## 2 67.92080  0  18.1    0 0.693 5.683 100.0 1.4254  24 666    20.2 22.98  5.0
## 3 25.04610  0  18.1    0 0.693 5.987 100.0 1.5888  24 666    20.2 26.77  5.6
## 4  9.91655  0  18.1    0 0.693 5.852  77.8 1.5004  24 666    20.2 29.97  6.3
## 5 45.74610  0  18.1    0 0.693 4.519 100.0 1.6582  24 666    20.2 36.98  7.0

range(boston$crim)

## [1]  0.00632 88.97620

range(boston$indus)

## [1]  0.46 27.74

range(boston$nox)

## [1] 0.385 0.871

range(boston$rm)

## [1] 3.561 8.780

range(boston$age)

## [1]   2.9 100.0

range(boston$dis)

## [1]  1.1296 12.1265

range(boston$rad)

## [1]  1 24

range(boston$tax)

## [1] 187 711

range(boston$ptratio)

## [1] 12.6 22.0

range(boston$lstat)

## [1]  1.73 37.97

In this dataset, how many of the census tracts average more than seven rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

Answer: 64 have an average higher than 7. 13 have an average higher than 8.

library(dplyr)
boston %>%
  filter(rm > 8) %>%
  count()

##    n
## 1 13

3 Linear Regression

Linear regression has been widely used for many years. Even though it is more simple in comparison to newer models, the concepts used by linear regression models are applied in the newer ones. Thus, it is really important to understand the concept behind linear regression to follow the study on statistical learning.

When we approach an issue, there are many statistical-related questions that are necessary to be answered before a model is developed:

Is there a relationship between the predictors $X_{1}, ..., X_{n}$ and the response $Y$?
How strong is this relationship?
Which predictor is associated with the response variable?
How large is the association between each predictor and response variable?
How accurately can we predict $Y$?
Is this relationship linear?
Is there any $synergy$ or an $interaction$ $effect$ among the predictors?

Often, a linear regression can answer all of these questions.

3.1 Simple Linear Regression

The simple linear regression is the most straightforward form of the linear regression. It is defined by a linear model where there is only one predictor variable:

$Y \sim \beta_{0} + \beta_{1}X$

Whereas $\beta_{0}$ and $\beta_{1}$ are two unknown constants, and the former is called the intercept, and the latter the slope. After we fit a model to the training data, we are left with the following equation:

$\hat{Y} \sim \hat{\beta}_{0} + \hat{\beta}_{1}x$

3.1.1 Estimating the Coefficients

In order to fit a linear regression model, it is necessary to estimate the coefficients $\beta_{0}$ and $\beta_{1}$. To do so, we collect data points from the training dataset $(x_{1}, y_{1}), ..., (x_{n}, y_{n})$ where each pair consists of a different observation of the dataset. Then, we calculate a intercept and a slope that would generate a line that would be closest to all data points. The most common method for that is the least squares.

The least squares method will use the $residual$ $sum$ $of$ $squares$ (RSS) to determine the intercept and the slope. In essence, the calculation will square the difference between $y_{i}$ and $\hat{y}_{i}$, where $y_{i}$ is the $i$th observed response value, and $\hat{y}_{i}$ is the predicted response of the model for the $i$th observation. Thus, the squared residual error for the $i$th observation will be: $\epsilon_{i}^{2} = (y_{i} - \hat{y}_{i})^{2}$.

Finally, the RSS will be calculated by all the squared residual errors:

$RSS = \epsilon_{1}^{2} + ... + \epsilon_{n}^{2}$

3.1.2 Assessing the Accuracy of the Coefficient Estimates

To understand the linear model, we can take as an example the following formula:

$Y = \beta_{0}X^0 + \beta_{1}X^{1} + ... + \beta_{n}X^{n} + \epsilon$

Where $X^{0}$ equals 1, meaning the coefficient $\beta_{0}$ is the intercept; the pairs of $\beta_{1}X_{1} + ... + \beta_{n}X_{n}$ are the coefficients and $X_{i}$ the value of $X$ elevated to the ${i}$th power; and $\epsilon$ the irreducible error. $\beta_{0}$ is called the intercept because it is the value of $Y$ when $X = 1$. This formula will be used for all linear models, and the simple linear model is the example when there are only two coefficients, $\beta_{0}$ and $\beta_{1}$.

The result of this equation is the $population$ $regression$ $line$, which is the best linear approximation of the true relationship between $X$ and $Y$, unknown in real applications - as it is the case with the coefficients. When we fit a linear model to a given dataset, using the $RSS$ to estimate the coefficients, our model’s regression line will be an approximation of the actual population regression line. If we could have access to multiple datasets, and thus generate multiple regression lines, the average of these models would be equal to the actual population regression line.

This concept originates from the idea that when we don’t have access to the whole population of $Y$, but only a sample, we could estimate the the population mean $\mu$ of $Y$. The sample mean will be $\hat{\mu}$ and unbiased, as we expect that the value of $\hat{\mu}$ will be a good approximation of $\mu$. As $\hat{\mu}$ is an estimation of $\mu$, it can overestimate or underestimate $\mu$. However, if we could have various datasets, the average of all $\hat{\mu}$ found in these samples would ultimately be a very close approximation of $\mu$.

It is possible to assert how good a single $\hat{\mu}$ is in terms of closeness to the real $\mu$ using the $standard$ $error$ of $\hat{\mu}$, $SE(\hat{\mu})$, given by the following formula:

$Var(\hat{\mu}) = SE(\hat{\mu})^{2} = \frac{\sigma^{2}}{n}$

Where $\sigma$ is the standard deviation of each observation $y_{i}$ of $Y$, and holds true provided that all $n$ observations are uncorrelated. Essentially, it tells us the average by which $\hat{\mu}$ differs from $\mu$, and we can conclude that the difference will be reduced as the number of observations $n$ increases.

To estimate coefficients, we have a similar formula to compute the respective standard errors:

$SE(\hat{\beta_{0}})^{2} = \sigma^{2}[\frac{1}{n} + \frac{\overline{x}^{2}}{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{2}}]$

$SE(\hat{\beta_{1}})^{2} = \frac{\sigma^{2}}{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{2}}$

Where $\sigma^{2} = Var(\epsilon)$. To use these formulas, we have to guarantee that the errors $\epsilon_{i}$ have a common variance $\sigma^{2}$ and are uncorrelated. Sometimes we cannot guarantee this relationship, but estimates might still be useful. Analysing the formulas, we can see that for $SE(\hat{\beta_{1}})^{2}$ we can reduce the error if the values of $x_{i}$ are more spread out. Additionally, $SE(\hat{\beta_{0}})^{2}$ would equal $SE(\hat{\mu})$ if $\overline{x}$ were zero, a situation where the coefficient $\hat{\beta}_{0}$ would equal $\overline{y}$. Finally, generally we do not know the value of $\sigma^{2}$, but we can estimate it calculating the $residual$ $standard$ $error$ $RSE$, given by $\sqrt{\frac{RSS}{n - 2}}$.

With the standard errors, we get the notion of a confidence interval. In essence, a confidence interval is the chance that the true value is within the range of the values we present with our model. For our coefficients $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$, we use the following formulas to discover the 95% confidence interval:

$\hat{\beta}_{i} ± 2 \cdot SE(\hat{\beta_{i}})^{2}$

Another important aspect of the standard error is to perform hypothesis tests. Usually, the null hypothesis ($H_{0}$) will say that $X$ has no relationship with $Y$, and the alternative hypothesis ($H_{\alpha}$) will say that $X$ has a relationship with $Y$. To test the hypothesis, we will try to debunk $H_{0}$, that is, we will try to prove our estimation of the coefficient $\beta_{1}$, $\hat{\beta}_{1}$, is sufficiently far from zero that $\hat{\beta}_{1}X_{1}$ will yield a significant value that proves the relationship between $X$ and $Y$. To understand how far from zero $\hat{\beta}_{1}$ has to be, we look at the value of $SE(\hat{\beta_{1}})$: if it is small, small values of $\hat{\beta}_{1}$ will be relevant, and also if the standard error is large, we need $\hat{\beta}_{1}$ to be large too. To calculate this, we compute a $t-statistic$, given by:

$t = \frac{\hat{\beta}_{1} - 0}{SE(\hat{\beta_{1}})}$

This will measure how many standard deviations $\hat{\beta}_{1}$ is far from zero. With this result, we can reach the $p$-value. If the $p$-value is small enough, we can conclude that it is highly unlikely that the data is a result of chance, thus rejecting $H_{0}$ and stating that there is a relationship between $X$ and $Y$.

3.1.3 Assessing the Accuracy of the Model

Once we are able to reject the null hypothesis, it is ideal to understand how good the model explains the data. For simple linear regressions, two methods stand out: Residual Standard Error ($RSE$), and $R^{2}$.

$RSE$: Estimation of the standard deviation of $\epsilon$. It will estimate the average amount that the response will deviate from the true regression line. In other words, it will tell us how many units of measurement the model’s prediction will deviate on average. It is given by the following formula:

$RSE = \sqrt{\frac{1}{n - 2}RSS} = \sqrt{\frac{1}{n - 2}\sum_{i = 1}^{n}(y_{i} - \hat{y_{i}})^{2}}$

$R^{2}$ Statistic: Proportion of much of the data is explained by the model. It takes a value between 0 and 1, where the 1 indicates the model explains 100% of the data. When its value is closer to 0, it is an indication that something is wrong and needs to be investigated. To calculate it, we take the difference between the total sum of squares ($TSS$) and $RSS$, and divide it by $TSS$. In essence, we are left with the sum of squares that are able to explain the data. The $R^{2}$ is given by the following formula:

$R^{2} = \frac{TSS - RSS}{TSS}$

$r$ or $correlation$ between $X$ and $Y$: As $R^{2}$ is a measurement of the linear relationship between $X$ and $Y$, it is possible to use $Cor(X,Y)$ to assess the fit of a linear model. In simple linear regression settings, we find that $R^{2} = r^{2}$. It is given by the following formula:

$r = Cor(X,Y) = \frac{\sum_{i = 1}^{n}(x_{i} - \overline{x})(y_{i} - \overline{y})}{\sqrt{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{2}}\sqrt{\sum_{i = 1}^{n}(y_{i} - \overline{y})^{2}}}$

3.2 Multiple Linear Regression

The multiple linear regression will take into account multiple coefficients, and it is a generally more used approach. It uses the following formula:

$Y = \beta_{0}X^0 + \beta_{1}X^{1} + ... + \beta_{n}X^{n} + \epsilon$

Where each $\beta_{j}$ is defined as the average effect on $Y$ of a single unit increase in the predictor $X_{j}$, given that all the other predictors remain unchanged.

3.2.1 Estimating the Regression Coefficients

The method for estimating the coefficients are the same of the simple linear regression, to use the least squares approach. Thus, we will choose coefficients $\beta_{0}, \beta_{1}, ..., \beta_{p}$ that will minimize the sum of the squared residuals. This will use the following formula:

$RSS = \sum_{i = 1}^{n}(y_{i} - \hat{y}_{i})^{2} = \sum_{i = 1}^{n}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - ... - \hat{\beta}_{p}x_{ip})^{2}$

One factor that will come up when comparing simple linear regressions with a multiple linear regression is that a predictor might appear to correlate with the outcome, but when there are more predictors its correlation disappears. The reason is that when we choose a simple linear regression, we might be ignoring another relationship that is only possible when we consider other predictors. An explanation to this is that the variable that seemed to perform well does not correlates with the outcome, but with another predictor that does correlates. To assess this, we can run a correlation matrix that comprises the outcome and the predictors.

A common example of this situation is the supposed correlation between ice cream consumption and shark attacks. The explanation is that there is another predictor, temperature, that explains the issue. Due to the increase of temperature, there is an increased consumption of ice creams, and more people go to the beach, thus increasing the occurrence of shark attacks.

3.2.2 Some Important Questions

Performing a multiple linear regression often prompts four questions:

1: Is there a relationship between the response and predictors? As it was the case with the simple linear regression, we will try to test that the estimation of the coefficients is sufficiently far from zero that we can assert that there is a relationship between $X$ and $Y$. Essentially, we will have that the null hypothesis, $H_{0}$, is saying that all coefficients $\beta_{1}, ..., \beta_{p}$ are equal to $0$. Alternatively, the hypothesis $H_{a}$ will say that at least one coefficient $\beta_{j}$ is non-zero. To compute this, we calculate the $F-statistic$, given by the following formula:

$F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)}$

If the null hypothesis $H_{0}$ is correct, the value of $F$ will be closer to one, and, if the hypothesis $H_{a}$ is correct, the value of $F$ will be greater than one The amount by how much $F$ has to be larger than one will depend on the number of observations $n$ and the number of predictors $p$. If $n$ is large, the value can be just a little larger than one; but if $n$ is small, $F$ has to be very larger than one. At the end of the day, the value of $F$ will be used together with the $p-value$ to reject the null hypothesis. This is specially useful when the number of predictors $p$ is large and the $p-values$ are small by chance, which would wrongly suggest a relationship between a predictor and the response variable.
2: Deciding on important variables After checking for $F-statistic$ and $p-values$, the next step is to select which variables out of the total of variables are the best to be included into the model. The ideal approach would be to test every single possibility, but this becomes impractical with more predictors, as the total of necessary model fits is equal to $2^{p}$, which means that with $p = 2$, we would have 4 models; and with $p = 30$ we would have 1,073,741,824 models. Thus, we turn into what is called model selection, and there are three ways to do so.
The first is called forward selection, where we start with the intercept and add coefficients one by one starting on whichever has the smallest $p-value$; the second is called backward selection, where we take all coefficients and take out coefficients one by one, starting with whichever has the highest $p-value$; the third one is called mixed selection, and it is a mix of the two approaches, starting with just the intercept.
3: The model fit There are two main ways to assert the usefulness of a linear model, $RSE$ and $R^{2}$, which is the same as it was for the simple linear model. Essentially, we are targeting for a $R^{2}$ closer to 1, as it is the measure that tells how much the data is explained by the model; and a low $RSE$, which measures how many units the fitted line deviates from the true regression line.
With this measurements, we can evaluate if a predictor adds to the model. For example, if we have two predictors that together have an $R^{2}$ of 0.850, and the addition of a third predictor only elevates this number to 0.851, this means that the predictor is not important for the model. We can make similar assumptions taking into consideration the $p-value$ and the result in the $RSE$ that the addition of the variable would generate.
Lastly, it is important to plot the data, as some patterns are only spotted visually.
4: Predictions To cast a prediction about unseen data, there is a great uncertainty about the actual value that $Y$ is. This is explained by the fact that the coefficients $\hat{\beta_{1}}, ..., \hat{\beta_{p}}$ are only an estimation of the true coefficients $\beta_{1}, ..., \beta_{p}$, that is, there is the reducible error. that has to be accounted for. Additionally, a linear model is an approximation of reality, so the bias of the model also needs to be accounted for. Lastly, there is the irreducible error, $\epsilon$, that has to be accounted for. Hence, there needs to be an interval where the true value of $f(X)$ lies, which will be bigger than, but centered at the same number of, the confidence interval. This is called the prediction interval.

Study - Introduction to Statistical Learning with R (ISLR)

Arthur Braga