This is one of the most important books on statistical analysis with R. The goal of this document is to extract relevant details from it for study, and future reference.
Book website: https://hastie.su.domains/ISLR2/ISLRv2_website.pdf
The book is divided on the following chapters:
Introduction
Statistical Learning
2.1 What is Statistical Learning
2.2 Assessing Model Accuracy
2.3 Lab: Introduction to R
2.4 Exercises
3.1 Simple Linear Regression
3.2 Multiple Linear Regression
3.3 Other Considerations in the Regression Model
3.4 The Marketing Plan
3.5 Comparison of Linear Regression with K-Nearest Neighbours
3.6 Lab: Linear Regression
3.7 Exercises
4.1 An Overview of Classification
4.2 Why Not Linear Regression?
4.3 Logistic Regression
4.4 Generative Models for Classification
4.5 A Comparison of Classification Methods
4.6 Generalized Linear Models
4.7 Lab: Classification Methods
4.8 Exercises
5.1 Cross-Validation
5.2 The Bootstrap
5.3 Lab: Cross-Validation and the Bootstrap
5.4 Exercises
6.1 Subset Selection
6.2 Shrinkage Methods
6.3 Dimension Reduction Methods
6.4 Considerations in High Dimensions
6.5 Lab: Linear Models and Regularisation Methods
6.6 Exercises
7.1 Polynomial Regression
7.2 Step Functions
7.3 Basis Functions
7.4 Regression Splines
7.5 Smoothing Splines
7.6 Local Regression
7.7 Generalised Additive Models
7.8 Lab: Non-linear Modeling
7.9 Exercises
8.1 The Basics of Decision Trees
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees
8.3 Lab: Decision Trees
8.4 Exercises
9.1 Maximal Margin Classifier
9.2 Support Vector Classifiers
9.3 Support Vector Machines
9.4 SVMs with More than Two Classes
9.5 Relationship to Logistic Regression
9.6 Lab: Support Vector Machines
9.7 Exercises
10.1 Single Layer Neural Networks
10.2 Multilayer Neural Networks
10.3 Convolutional Neural Networks
10.4 Document Classification
10.5 Recurrent Neural Networks
10.6 When to Use Deep Learning
10.7 Fitting a Neural Network
10.8 Interpolation and Double Descent
10.9 Lab: Deep Learning
10.10 Exercises
11.1 Survival and Censoring Times
11.2 A Closer Look at Censoring
11.3 The Kaplan-Meier Survival Curve
11.4 The Log-Rank Test
11.5 Regression Models With a Survival Response
11.6 Shrinkage for the Cox Model
11.7 Additional Topics
11.8 Lab: Survival Analysis
11.9 Exercises
12.1 The Challenge of Unsupervised Learning
12.2 Principal Components Analysis
12.3 Missing Values and Matrix Completion
12.4 Clustering Methods
12.5 Lab: Unsupervised Learning
12.6 Exercises
13.1 A Quick Review of Hypothesis Testing
13.2 The Challenge of Multiple Testing
13.3 The Family-Wise Error Rate
13.4 The False Discovery Rate
13.5 A Re-Sampling Approach to p-values and False Discovery
13.6 Lab: Multiple Testing
13.7 Exercises
The book begins setting a difference between supervised or unsupervised statistical learning: Supervised statistical learning - to build a statistical model for predicting or estimating an output based on one or more inputs. Unsupervised statistical learning - to build a statistical model to learn relationships and structure from the date with inputs but without supervised outputs.
The book follows two examples of supervised statistical learning using two datasets, the Wage data, and the Smarket data. On the two examples, the output data was part of the dataset and the goal was to predict (1st example) and classify (2nd example) something.
Then, it is presented a third example, using the NCI60 dataset. The goal was to not to predict an output, but rather understand the data. It used a clustering approach to assert whether there were identifiable groups.
Finally, the book covers what it will be dealt with in the following chapters.
This chapter begins with a typical prediction case. Using the Advertising data set, it is intended to predict the output variable, sales, using the input variables, the advertisement on different markets, TV, radio, and newspaper.
With this example, it is presented the general case for a prediction formula: \(Y = f(X) + \epsilon\) being Y the quantitative response we want to assess; \(f(X)\) an unknown function of \(X_{1}, X_{2}, ..., X_{p}\), which \(p\) is the number of input variables; and e a random error independent of \(X\) with a mean equal to zero. The goal of \(f(X)\) is to provide systematic information that \(X\) gives to \(Y\).
It is outlined that there are two reasons to estimate \(f()\): to predict something or to infer something.
Prediction - understand \(Y\)
Used when a set of input variables \(X_{1}, X_{2}, ..., X_{p}\) is available, but the output \(Y\) is not easily available. As the error e averages to zero, it is used the following formula: \(Y^ = f^(X)\) It is important to notice that this method is treated as a black box, e.g. we are not concerned with the actual value of \(f^()\) if it yields and accurate value of \(X\). Additionally, the accuracy of the model will ultimately depend on two figures: the reducible error and the irreducible error.
Inference - understand \(f(X)\)
By contrast, this approach is not trying to make predictions of the value of \(Y\), but rather assert the actual value of \(f(X)\). Thus, the black box approach used by the prediction method, \(\hat{Y} = \hat{f}(X)\), cannot be used. The inference method might be interested in answering the following questions:
Depending on the model of choice, the goal will be of prediction, inferring, or a mix of the two.
There are many techniques that will estimate \(f()\). However, they all will try to gather training data points which will be inserted into a statistical learning method. Thus, the goal is to find a function \(\hat{f}(X)\) that will get closer to the actual value of \(Y\) for any observation \((X, Y)\). Most statistical learning methods can be thus classified in parametric or non-parametric:
Parametric - two-step model-based approach
\(Y \sim B_{o} + B_{1}X_{1} + B_{2}X_{2} + ... + B_{p}X_{p}\)
This approach is called parametric because the goal is to estimate the parameters \(B_{o}, B_{1}, ..., B_{p}\). It is widely used because it is more easy to estimate the parameters than it is to estimate \(f()\). The main issues are that the model used might not fit to the actual value of \(f()\), and that the parameters could be exaggerated and the model might follow the noise too closely, thus creating the overfitting problem.
Non-parametric - estimate \(f()\) that gets as close to data points as possible
No assumption of \(f()\) is made, thus there is great advantage towards parametric approaches. However, to succeed it is necessary to have a large amount of data. Additionally, it is necessary to define the smoothness of the fit, which will determine how closely the estimated \(f()\) follows data points. A overly smooth decision might lead to an overfitting problem.
Sometimes it is necessary to choose an approach that might be counter-intuitive. There is something called the trade-off between flexibility and interpretability. It can be the case that some more rigid models might be better than some flexible ones, even if it means that it is less accurate. The reason is because there are times when to be able to interpret a result will be more important than the accuracy. This will be the case for inference issues, where a more restrictive model will explain the results better. The flexible approach, even though potentially more accurate, will account for multiple variables, thus complicating the interpretation.
However, when the goal is prediction, the interpretability of the model will not be of interest. In those cases, the accuracy will be the most important figure, and the use of a more flexible approach is encouraged.
Additionally, it is worth mentioning that overly flexible models can end up in overfitting. Therefore, sometimes a more inflexible model in a prediction issue will be more accurate than the more flexible counterpart.
The goal of most models is to use available data to predict or to infer something. This will be the case of many different approaches, including linear models, logistic regressions, etc. These models need a supervision, which is to input response data for a given number of predictors. The model will learn what is the expected response and thus it will be possible to predict or infer something.
However, in the unsupervised setting there are no inputs for responses. There are only data about the predictor variables. Hence, it is impossible to predict or to infer something because the model cannot learn what the expected response is. Then, the goal of the unsupervised models is to learn relationships between the predictors. Good examples are the clustering methods, which will try to set apart the data into different groups based on the available information.
These two groups of methods fall into different purposes:
As there is no one best statistical learning method, it is necessary to evaluate a model accuracy for a given dataset.
One widely-used measurement is the mean squared error (MSE), which is given by the following formula:
\(MSE = \frac{1}{n}\sum_{i = 1}^{n}(y_{i} - \hat{f}(x_{i}))^{2}\)
Thus, \(MSE\) is calculated by the squared average between the actual value, \(y_{i}\), and the estimated value, \(\hat{f}(x_{i})\).
It is important to notice that the \(MSE\) value only matters for the test set, which is the partition of the dataset that the model has not yet seen. The reason is that it does not matter if the model can predict data it has already seen, as we already know their respective \(y_{i}\)
This point yields another topic: the degrees of freedom (df). Essentially, \(df\) is the level of flexibility a model has. A linear regression, for example, has a very low \(df\), whereas more flexible models will have higher \(df\). Then, it is relevant to notice that a model with a higher \(df\) will have a lower \(training\) \(MSE\). The reason why is that with larger flexibility, the model will be really close to the data points contained in the training dataset, thus the \(training\) \(MSE\) will be quite low. However, the scenario will be really different with the \(test\) \(MSE\). As the model is following the noise, it will not accurately predict unseen values. Then, there will be a \(df\) that will reduce the \(test\) \(MSE\), and will probably not be the highest, but somewhere in between.
When a model has a low \(training\) \(MSE\), but a high \(test\) \(MSE\), the model is considered to suffer from overfitting. The overfitting condition also implies the existence of a less flexible model that would have yielded a lower \(test\) \(MSE\). Sometimes, test data is not readily available, and some techniques are necessary to attempt to estimate the \(test\) \(MSE\).
Finally, there are methods that will seek to optimise \(df\), as though to yield the model which will have the most effective \(test\) \(MSE\).
To reach the most effective \(test\) \(MSE\), we have to evaluate our models based on the variance of the training dataset, and on the bias introduced by estimating a real-life problem.
\(test\) \(MSE = Var(\hat{f}(x_{0})) + [Bias(\hat{f}(x_{0}))] + Var(\epsilon)\)
Hence, we can decompose the \(test\) \(MSE\) by adding the variance, the bias and the \(Var(\epsilon)\). This will be applied to every dataset and every model. As we have seen, as the \(df\) increases, variance will increase and bias will decrease. The ultimate goal is to find a balance between the two that yields the best \(test\) \(MSE\).
With a classification method, there are a few different concepts that are applied. As the goal is to accurately classify a data point, we can evaluate our estimation of \(f\) with the average of correct classifications. This is exemplified in the following formula:
\(\frac{1}{nb}\sum^{n}_{i = 1}I(y_{i}\neq \hat{y}_{i})\)
The more correct classifications there are, the more zeros this formula will yield, and thus the more accurate the model will be.
The Bayes Classifier - minimizing the average error.
The test error can be minimized if we assume the classification of a given observation based on the most likely class it could belong to given its predictor values. This conditional method is given by the following formula:
\(Pr(Y = j|X = x_{0})\)
This method, called the Bayes Classifier, is used in situations where there are only two categories, and it will predict class one if \(Pr(Y = j|X = x_{0}) > 0.5\), and class two otherwise. When this equation equals exactly 0.5, it means that the given observation will fall into the Bayes decision boundary. The error that this method produces is analogous to the irreducible error and is given by the following formula:
\(1 - E(max_{j}Pr(Y = j|X))\)
K-Nearest Neighbours (KNN) - approximation of the Bayes Classifier
The Bayes Classifier is always impossible to use because, for its application, it is necessary to know the conditional distribution of \(Y\) given \(X\). Then, as it is the most effective method to reduce the average error, it is considered as an unattainable gold standard. The KNN approach uses a given positive \(K\) integer and a test observation \(x_{0}\); identifies the \(K\) points that are closer to \(x_{0}\), represented by \(N_{0}\); estimates the conditional probability for class \(j\) as the percentage of points in \(N_{0}\) that belong to class \(j\); and finally classifies the test observation \(x_{0}\) as the class with the largest probability. This relationship is determined by the following formula:
\(Pr(Y = j|X = x_{0}) = \frac{1}{K}\sum_{i\epsilon N_{0}} I(y_{i} = j)\)
For example, for a given dataset and the value of K = 3, to classify a given data point, the model will search for the 3 nearest data points. Suppose 2 out of 3 are class a, and 1 is class b; then the model will predict that this data point belongs to class a.
It is important to notice that, as in the \(df\) for the regression methods, the value of \(K\) will determine the flexibility of the model. And, as it was true then, it is also true that the more flexible, the less bias and more variance the model will account for.
The goal of this lab is to introduce the R language.
Vectors
# creating a vector
<- c(1, 6, 2)
x x
## [1] 1 6 2
<- c(1, 4, 3)
y y
## [1] 1 4 3
# checking the length of a vector
length(x)
## [1] 3
length(y)
## [1] 3
+ y x
## [1] 2 10 5
# checking and deleting objects
ls() #looks for all objects registered on the environment
## [1] "x" "y"
# rm() remove objects from the environment
rm(list = ls()) #removes a list of objects, which is set to all registered objects
Matrices
# creating a matrix
# the function matrix() takes at least three inputs: data, number of rows, and number of columns
<- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
x x
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# the input of the arguments data =, nrow =, and ncol = is not necessary
# the reason is because R will take the order of the input data as the order defined on the function's documentation
<- matrix(c(1, 2, 3, 4), 2, 2)
x x
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# the matrix() function will fill columns first, but the argument byrow = TRUE changes it to rows first
# it is also not necessary to assing the function to a value
# this will only yield the result of the function and will not store the result for future reference
matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
# calculations with matrices
# it is possible to compute several calculations, including square root and power
sqrt(x)
## [,1] [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000
^2 x
## [,1] [,2]
## [1,] 1 9
## [2,] 4 16
Correlation of vectors
# the function rnorm() will generate a vector of random normal variables, being the first argument the sample size
# by definition, rnorm() will create random variables with mean of zero and standard deviation of 1
# as it is random, every time we call it we will get a different answer
<- rnorm(50)
x
# it is possible to set other arguments, including mean and standard deviation
<- x + rnorm(50, mean = 50, sd = 0.1)
y
# the function cor() will yield the correlation between the vectors
cor(x, y)
## [1] 0.9961915
# to make sure the result will always be the same, it is possible to set a seed with the function set.seed()
# this makes possible the reproduction of results
set.seed(1303)
rnorm(50)
## [1] -1.1439763145 1.3421293656 2.1853904757 0.5363925179 0.0631929665
## [6] 0.5022344825 -0.0004167247 0.5658198405 -0.5725226890 -1.1102250073
## [11] -0.0486871234 -0.6956562176 0.8289174803 0.2066528551 -0.2356745091
## [16] -0.5563104914 -0.3647543571 0.8623550343 -0.6307715354 0.3136021252
## [21] -0.9314953177 0.8238676185 0.5233707021 0.7069214120 0.4202043256
## [26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
## [31] 1.5732737361 0.0127465055 0.8726470499 0.4220661905 -0.0188157917
## [36] 2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412 1.3677342065
## [41] 0.2640073322 0.6321868074 -1.3306509858 0.0268888182 1.0406363208
## [46] 1.3120237985 -0.0300020767 -0.2500257125 0.0234144857 1.6598706557
Other calculations
# we can also compute the mean, variance, square root of variance (standard deviation), and directly the standard deviation
set.seed(3)
<- rnorm(100)
y mean(y)
## [1] 0.01103557
var(y)
## [1] 0.7328675
sqrt(var(y))
## [1] 0.8560768
sd(y)
## [1] 0.8560768
The plot() function
<- rnorm(100)
x <- rnorm(100)
y plot(x, y)
# we can set x-axis and y-axis labels with xlab = and ylab =
# we can also set the title with main =
plot(x, y,
xlab = 'this is the x-axis',
ylab = 'this is the y-axis',
main = 'Plot of X vs Y')
The seq() function
# we can use seq() to generate a vector between two numbers, imputing numbers equally separated
<- seq(1, 10)
x x
## [1] 1 2 3 4 5 6 7 8 9 10
# we can add an argument to specify the length of the vector
# if length is not specified, the vector will contains values separated by 1
<- seq(-pi, pi, length = 50)
x x
## [1] -3.14159265 -3.01336438 -2.88513611 -2.75690784 -2.62867957 -2.50045130
## [7] -2.37222302 -2.24399475 -2.11576648 -1.98753821 -1.85930994 -1.73108167
## [13] -1.60285339 -1.47462512 -1.34639685 -1.21816858 -1.08994031 -0.96171204
## [19] -0.83348377 -0.70525549 -0.57702722 -0.44879895 -0.32057068 -0.19234241
## [25] -0.06411414 0.06411414 0.19234241 0.32057068 0.44879895 0.57702722
## [31] 0.70525549 0.83348377 0.96171204 1.08994031 1.21816858 1.34639685
## [37] 1.47462512 1.60285339 1.73108167 1.85930994 1.98753821 2.11576648
## [43] 2.24399475 2.37222302 2.50045130 2.62867957 2.75690784 2.88513611
## [49] 3.01336438 3.14159265
The contour() function
# generates a contour plot, which represents three-dimensional data
# it is similar to a topographical map
# it takes three arguments: vector of x values; vector of y values; a matrix that corresponds to a z value for every correlation of (x, y)
<- x
y <- outer(x, y, function(x, y) cos(y) / (1 + x^2))
f contour(x, y, f)
contour(x, y, f, nlevels = 45, add = TRUE)
<- (f - t(f)) / 2
fa contour(x, y, fa, nlevels = 15)
The image() and persp() functions
# image() works similarly to contour(), but adds colours to it, creating a heatmap
# the colours will ultimately depend on the z value
# alternatively, persp() produces three-dimensional plots
# it uses the arguments theta and phi to control the viewing angles
image(x, y, fa)
persp(x, y, fa)
persp(x, y, fa, theta = 30)
persp(x, y, fa, theta = 30, phi = 20)
persp(x, y, fa, theta = 30, phi = 70)
persp(x, y, fa, theta = 30, phi = 40)
Indexing a matrix
# we can select parts of the matrix for inspection purposes
# the first number inside the brackets will always represent row, and the second the column
<- matrix(1:16, 4, 4)
A A
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
# a single element
2, 3] A[
## [1] 10
# a collection of elements
c(1, 3), c(2, 4)] A[
## [,1] [,2]
## [1,] 5 13
## [2,] 7 15
# a partition of the matrix
1:3, 2:4] A[
## [,1] [,2] [,3]
## [1,] 5 9 13
## [2,] 6 10 14
## [3,] 7 11 15
# it is also possible to leave the row or column argument empty
# this will select all the rows or all the columns
1:2, ] A[
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
1:2] A[,
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
# the use of the negative symbol (-) will tell R to include all rows and columns except those indicated
-c(1, 3), ] A[
## [,1] [,2] [,3] [,4]
## [1,] 2 6 10 14
## [2,] 4 8 12 16
-c(1, 3), -c(1, 3, 4)] A[
## [1] 6 8
# the dim() function will yields the total number of rows and columns
dim(A)
## [1] 4 4
Reading a table
# loading the book's R package
library(ISLR2)
# reading a dataset
# it is necessary to input the correct directory
# Auto <- read.table(Auto)
# viewing the dataset in a separate window
View(Auto)
# viewing the first few rows of the data
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
A few techniques for plotting
# plotting cylinders and mpg
# as we are plotting from a dataframe, it is necessary to use $ to specify variables
plot(Auto$cylinders, Auto$mpg)
# it is possible to tell R to use Auto as a reference with the attach() function
# this will make possible using only the variables in functions
attach(Auto)
plot(cylinders, mpg)
Boxplots
# R will plot boxplots by default if the variable on the x-axis is qualitative
# when the Auto dataframe is loaded, R will understand cylinders as a numeric variable
# because it is represented by numbers
# however, in reality cylinders can be understood with different categories
# thus, we can tell R to treat it as a qualitative variable instead with as.factor()
<- as.factor(cylinders)
cylinders
# plotting again
plot(cylinders, mpg)
# changing the colour
plot(cylinders, mpg, col = 'red')
# adjusting the size of each 'box' to the amount of observations in that category
plot(cylinders, mpg, col = 'red', varwidth = TRUE)
# changing the plot to horizontal
plot(cylinders, mpg, col = 'red', varwidth = TRUE, horizontal = TRUE)
# adding labels
plot(cylinders, mpg, col = 'red', varwidth = TRUE, xlab = 'cylinders', ylab = 'MPG')
Histograms
# instead of plot(), we use hist()
hist(mpg)
# setting col = 2 will also turn the colour red
hist(mpg, col = 2)
# adjusting the size of the bins
hist(mpg, col = 2, breaks = 15)
Scatterplots
# to plot a matrix of scatterplots, we use the pairs() function
# it will generate pairs of scatterplots for all the variables contained in the dataframe
pairs(Auto)
# we can also set which variables we want to see
pairs( ~ mpg + displacement + horsepower + weight + acceleration,
data = Auto)
Obs|\(X_{1}\)|\(X_{2}\)|\(X_{3}\)|\(Y\)
1 | 0 | 3 | 0 | Red
2 | 2 | 0 | 0 | Red
3 | 0 | 1 | 3 | Red
4 | 0 | 1 | 2 | Green
5 |-1 | 0 | 1 | Green
6 | 1 | 1 | 1 | Red
Suppose we wish to use this dataset to make a prediction for \(Y\) when \(X_{1} = X_{2} = X_{3} = 0\) using \(K\)-nearest neighbours.
<- College college
# rownames(college) <- college[, 1]
# college <- college[, -1]
head(college)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(~ Private + Apps + Accept + Enroll + Top10perc + Top25perc + F.Undergrad +
+ Outstate + Room.Board, college) P.Undergrad
plot(college$Private, college$Outstate)
<- college %>%
college mutate(Elite = case_when(Top10perc >= 50 ~ 'Yes',
< 50 ~ 'No')) %>%
Top10perc mutate(Elite = as.factor(Elite)); summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate Elite
## Min. : 10.00 No :694
## 1st Qu.: 53.00 Yes: 83
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
plot(college$Elite, college$Outstate)
par(mfrow = c(2, 2))
hist(college$Apps, breaks = 100)
hist(college$Accept, breaks = 100)
hist(college$Enroll, breaks = 100)
hist(college$PhD, breaks = 100)
<- Auto
auto summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
range(auto$mpg)
## [1] 9.0 46.6
range(auto$cylinders)
## [1] 3 8
range(auto$displacement)
## [1] 68 455
range(auto$horsepower)
## [1] 46 230
range(auto$weight)
## [1] 1613 5140
range(auto$acceleration)
## [1] 8.0 24.8
range(auto$year)
## [1] 70 82
range(auto$origin)
## [1] 1 3
mean(auto$mpg)
## [1] 23.44592
sd(auto$mpg)
## [1] 7.805007
mean(auto$displacement)
## [1] 194.412
sd(auto$displacement)
## [1] 104.644
mean(auto$horsepower)
## [1] 104.4694
sd(auto$horsepower)
## [1] 38.49116
mean(auto$weight)
## [1] 2977.584
sd(auto$weight)
## [1] 849.4026
mean(auto$acceleration)
## [1] 15.54133
sd(auto$acceleration)
## [1] 2.758864
<- auto[-c(10:85),]; head(auto2) auto2
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
mean(auto2$mpg)
## [1] 24.40443
sd(auto2$mpg)
## [1] 7.867283
mean(auto2$displacement)
## [1] 187.2405
sd(auto2$displacement)
## [1] 99.67837
mean(auto2$horsepower)
## [1] 100.7215
sd(auto2$horsepower)
## [1] 35.70885
mean(auto2$weight)
## [1] 2935.972
sd(auto2$weight)
## [1] 811.3002
mean(auto2$acceleration)
## [1] 15.7269
sd(auto2$acceleration)
## [1] 2.693721
pairs(~ auto$mpg + auto$cylinders + auto$displacement + auto$horsepower +
$weight + auto$acceleration + auto$year + auto$origin) auto
<- Boston
boston summary(boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
pairs(~ boston$crim + boston$zn + boston$indus + boston$chas + boston$nox +
$rm + boston$age + boston$dis + boston$rad + boston$tax +
boston$ptratio + boston$lstat + boston$medv) boston
range(boston$rad)
## [1] 1 24
range(boston$age)
## [1] 2.9 100.0
range(boston$tax)
## [1] 187 711
range(boston$ptratio)
## [1] 12.6 22.0
library(dplyr)
%>%
boston filter(chas == 1) %>%
count()
## n
## 1 35
median(boston$ptratio)
## [1] 19.05
library(dplyr)
%>%
boston arrange(medv) %>%
head(5)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 38.35180 0 18.1 0 0.693 5.453 100.0 1.4896 24 666 20.2 30.59 5.0
## 2 67.92080 0 18.1 0 0.693 5.683 100.0 1.4254 24 666 20.2 22.98 5.0
## 3 25.04610 0 18.1 0 0.693 5.987 100.0 1.5888 24 666 20.2 26.77 5.6
## 4 9.91655 0 18.1 0 0.693 5.852 77.8 1.5004 24 666 20.2 29.97 6.3
## 5 45.74610 0 18.1 0 0.693 4.519 100.0 1.6582 24 666 20.2 36.98 7.0
range(boston$crim)
## [1] 0.00632 88.97620
range(boston$indus)
## [1] 0.46 27.74
range(boston$nox)
## [1] 0.385 0.871
range(boston$rm)
## [1] 3.561 8.780
range(boston$age)
## [1] 2.9 100.0
range(boston$dis)
## [1] 1.1296 12.1265
range(boston$rad)
## [1] 1 24
range(boston$tax)
## [1] 187 711
range(boston$ptratio)
## [1] 12.6 22.0
range(boston$lstat)
## [1] 1.73 37.97
library(dplyr)
%>%
boston filter(rm > 8) %>%
count()
## n
## 1 13
Linear regression has been widely used for many years. Even though it is more simple in comparison to newer models, the concepts used by linear regression models are applied in the newer ones. Thus, it is really important to understand the concept behind linear regression to follow the study on statistical learning.
When we approach an issue, there are many statistical-related questions that are necessary to be answered before a model is developed:
Often, a linear regression can answer all of these questions.
The simple linear regression is the most straightforward form of the linear regression. It is defined by a linear model where there is only one predictor variable:
\(Y \sim \beta_{0} + \beta_{1}X\)
Whereas \(\beta_{0}\) and \(\beta_{1}\) are two unknown constants, and the former is called the intercept, and the latter the slope. After we fit a model to the training data, we are left with the following equation:
\(\hat{Y} \sim \hat{\beta}_{0} + \hat{\beta}_{1}x\)
In order to fit a linear regression model, it is necessary to estimate the coefficients \(\beta_{0}\) and \(\beta_{1}\). To do so, we collect data points from the training dataset \((x_{1}, y_{1}), ..., (x_{n}, y_{n})\) where each pair consists of a different observation of the dataset. Then, we calculate a intercept and a slope that would generate a line that would be closest to all data points. The most common method for that is the least squares.
The least squares method will use the \(residual\) \(sum\) \(of\) \(squares\) (RSS) to determine the intercept and the slope. In essence, the calculation will square the difference between \(y_{i}\) and \(\hat{y}_{i}\), where \(y_{i}\) is the \(i\)th observed response value, and \(\hat{y}_{i}\) is the predicted response of the model for the \(i\)th observation. Thus, the squared residual error for the \(i\)th observation will be: \(\epsilon_{i}^{2} = (y_{i} - \hat{y}_{i})^{2}\).
Finally, the RSS will be calculated by all the squared residual errors:
\(RSS = \epsilon_{1}^{2} + ... + \epsilon_{n}^{2}\)
To understand the linear model, we can take as an example the following formula:
\(Y = \beta_{0}X^0 + \beta_{1}X^{1} + ... + \beta_{n}X^{n} + \epsilon\)
Where \(X^{0}\) equals 1, meaning the coefficient \(\beta_{0}\) is the intercept; the pairs of \(\beta_{1}X_{1} + ... + \beta_{n}X_{n}\) are the coefficients and \(X_{i}\) the value of \(X\) elevated to the \({i}\)th power; and \(\epsilon\) the irreducible error. \(\beta_{0}\) is called the intercept because it is the value of \(Y\) when \(X = 1\). This formula will be used for all linear models, and the simple linear model is the example when there are only two coefficients, \(\beta_{0}\) and \(\beta_{1}\).
The result of this equation is the \(population\) \(regression\) \(line\), which is the best linear approximation of the true relationship between \(X\) and \(Y\), unknown in real applications - as it is the case with the coefficients. When we fit a linear model to a given dataset, using the \(RSS\) to estimate the coefficients, our model’s regression line will be an approximation of the actual population regression line. If we could have access to multiple datasets, and thus generate multiple regression lines, the average of these models would be equal to the actual population regression line.
This concept originates from the idea that when we don’t have access to the whole population of \(Y\), but only a sample, we could estimate the the population mean \(\mu\) of \(Y\). The sample mean will be \(\hat{\mu}\) and unbiased, as we expect that the value of \(\hat{\mu}\) will be a good approximation of \(\mu\). As \(\hat{\mu}\) is an estimation of \(\mu\), it can overestimate or underestimate \(\mu\). However, if we could have various datasets, the average of all \(\hat{\mu}\) found in these samples would ultimately be a very close approximation of \(\mu\).
It is possible to assert how good a single \(\hat{\mu}\) is in terms of closeness to the real \(\mu\) using the \(standard\) \(error\) of \(\hat{\mu}\), \(SE(\hat{\mu})\), given by the following formula:
\(Var(\hat{\mu}) = SE(\hat{\mu})^{2} = \frac{\sigma^{2}}{n}\)
Where \(\sigma\) is the standard deviation of each observation \(y_{i}\) of \(Y\), and holds true provided that all \(n\) observations are uncorrelated. Essentially, it tells us the average by which \(\hat{\mu}\) differs from \(\mu\), and we can conclude that the difference will be reduced as the number of observations \(n\) increases.
To estimate coefficients, we have a similar formula to compute the respective standard errors:
\(SE(\hat{\beta_{0}})^{2} = \sigma^{2}[\frac{1}{n} + \frac{\overline{x}^{2}}{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{2}}]\)
\(SE(\hat{\beta_{1}})^{2} = \frac{\sigma^{2}}{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{2}}\)
Where \(\sigma^{2} = Var(\epsilon)\). To use these formulas, we have to guarantee that the errors \(\epsilon_{i}\) have a common variance \(\sigma^{2}\) and are uncorrelated. Sometimes we cannot guarantee this relationship, but estimates might still be useful. Analysing the formulas, we can see that for \(SE(\hat{\beta_{1}})^{2}\) we can reduce the error if the values of \(x_{i}\) are more spread out. Additionally, \(SE(\hat{\beta_{0}})^{2}\) would equal \(SE(\hat{\mu})\) if \(\overline{x}\) were zero, a situation where the coefficient \(\hat{\beta}_{0}\) would equal \(\overline{y}\). Finally, generally we do not know the value of \(\sigma^{2}\), but we can estimate it calculating the \(residual\) \(standard\) \(error\) \(RSE\), given by \(\sqrt{\frac{RSS}{n - 2}}\).
With the standard errors, we get the notion of a confidence interval. In essence, a confidence interval is the chance that the true value is within the range of the values we present with our model. For our coefficients \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\), we use the following formulas to discover the 95% confidence interval:
\(\hat{\beta}_{i} ± 2 \cdot SE(\hat{\beta_{i}})^{2}\)
Another important aspect of the standard error is to perform hypothesis tests. Usually, the null hypothesis (\(H_{0}\)) will say that \(X\) has no relationship with \(Y\), and the alternative hypothesis (\(H_{\alpha}\)) will say that \(X\) has a relationship with \(Y\). To test the hypothesis, we will try to debunk \(H_{0}\), that is, we will try to prove our estimation of the coefficient \(\beta_{1}\), \(\hat{\beta}_{1}\), is sufficiently far from zero that \(\hat{\beta}_{1}X_{1}\) will yield a significant value that proves the relationship between \(X\) and \(Y\). To understand how far from zero \(\hat{\beta}_{1}\) has to be, we look at the value of \(SE(\hat{\beta_{1}})\): if it is small, small values of \(\hat{\beta}_{1}\) will be relevant, and also if the standard error is large, we need \(\hat{\beta}_{1}\) to be large too. To calculate this, we compute a \(t-statistic\), given by:
\(t = \frac{\hat{\beta}_{1} - 0}{SE(\hat{\beta_{1}})}\)
This will measure how many standard deviations \(\hat{\beta}_{1}\) is far from zero. With this result, we can reach the \(p\)-value. If the \(p\)-value is small enough, we can conclude that it is highly unlikely that the data is a result of chance, thus rejecting \(H_{0}\) and stating that there is a relationship between \(X\) and \(Y\).
Once we are able to reject the null hypothesis, it is ideal to understand how good the model explains the data. For simple linear regressions, two methods stand out: Residual Standard Error (\(RSE\)), and \(R^{2}\).
\(RSE = \sqrt{\frac{1}{n - 2}RSS} = \sqrt{\frac{1}{n - 2}\sum_{i = 1}^{n}(y_{i} - \hat{y_{i}})^{2}}\)
\(R^{2} = \frac{TSS - RSS}{TSS}\)
\(r = Cor(X,Y) = \frac{\sum_{i = 1}^{n}(x_{i} - \overline{x})(y_{i} - \overline{y})}{\sqrt{\sum_{i = 1}^{n}(x_{i} - \overline{x})^{2}}\sqrt{\sum_{i = 1}^{n}(y_{i} - \overline{y})^{2}}}\)
The multiple linear regression will take into account multiple coefficients, and it is a generally more used approach. It uses the following formula:
\(Y = \beta_{0}X^0 + \beta_{1}X^{1} + ... + \beta_{n}X^{n} + \epsilon\)
Where each \(\beta_{j}\) is defined as the average effect on \(Y\) of a single unit increase in the predictor \(X_{j}\), given that all the other predictors remain unchanged.
The method for estimating the coefficients are the same of the simple linear regression, to use the least squares approach. Thus, we will choose coefficients \(\beta_{0}, \beta_{1}, ..., \beta_{p}\) that will minimize the sum of the squared residuals. This will use the following formula:
\(RSS = \sum_{i = 1}^{n}(y_{i} - \hat{y}_{i})^{2} = \sum_{i = 1}^{n}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i1} - ... - \hat{\beta}_{p}x_{ip})^{2}\)
One factor that will come up when comparing simple linear regressions with a multiple linear regression is that a predictor might appear to correlate with the outcome, but when there are more predictors its correlation disappears. The reason is that when we choose a simple linear regression, we might be ignoring another relationship that is only possible when we consider other predictors. An explanation to this is that the variable that seemed to perform well does not correlates with the outcome, but with another predictor that does correlates. To assess this, we can run a correlation matrix that comprises the outcome and the predictors.
A common example of this situation is the supposed correlation between ice cream consumption and shark attacks. The explanation is that there is another predictor, temperature, that explains the issue. Due to the increase of temperature, there is an increased consumption of ice creams, and more people go to the beach, thus increasing the occurrence of shark attacks.
Performing a multiple linear regression often prompts four questions:
\(F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)}\)
If the null hypothesis \(H_{0}\) is correct, the value of \(F\) will be closer to one, and, if the hypothesis \(H_{a}\) is correct, the value of \(F\) will be greater than one The amount by how much \(F\) has to be larger than one will depend on the number of observations \(n\) and the number of predictors \(p\). If \(n\) is large, the value can be just a little larger than one; but if \(n\) is small, \(F\) has to be very larger than one. At the end of the day, the value of \(F\) will be used together with the \(p-value\) to reject the null hypothesis. This is specially useful when the number of predictors \(p\) is large and the \(p-values\) are small by chance, which would wrongly suggest a relationship between a predictor and the response variable.
2: Deciding on important variables After checking for \(F-statistic\) and \(p-values\), the next step is to select which variables out of the total of variables are the best to be included into the model. The ideal approach would be to test every single possibility, but this becomes impractical with more predictors, as the total of necessary model fits is equal to \(2^{p}\), which means that with \(p = 2\), we would have 4 models; and with \(p = 30\) we would have 1,073,741,824 models. Thus, we turn into what is called model selection, and there are three ways to do so.
The first is called forward selection, where we start with the intercept and add coefficients one by one starting on whichever has the smallest \(p-value\); the second is called backward selection, where we take all coefficients and take out coefficients one by one, starting with whichever has the highest \(p-value\); the third one is called mixed selection, and it is a mix of the two approaches, starting with just the intercept.
3: The model fit There are two main ways to assert the usefulness of a linear model, \(RSE\) and \(R^{2}\), which is the same as it was for the simple linear model. Essentially, we are targeting for a \(R^{2}\) closer to 1, as it is the measure that tells how much the data is explained by the model; and a low \(RSE\), which measures how many units the fitted line deviates from the true regression line.
With this measurements, we can evaluate if a predictor adds to the model. For example, if we have two predictors that together have an \(R^{2}\) of 0.850, and the addition of a third predictor only elevates this number to 0.851, this means that the predictor is not important for the model. We can make similar assumptions taking into consideration the \(p-value\) and the result in the \(RSE\) that the addition of the variable would generate.
Lastly, it is important to plot the data, as some patterns are only spotted visually.
4: Predictions To cast a prediction about unseen data, there is a great uncertainty about the actual value that \(Y\) is. This is explained by the fact that the coefficients \(\hat{\beta_{1}}, ..., \hat{\beta_{p}}\) are only an estimation of the true coefficients \(\beta_{1}, ..., \beta_{p}\), that is, there is the reducible error. that has to be accounted for. Additionally, a linear model is an approximation of reality, so the bias of the model also needs to be accounted for. Lastly, there is the irreducible error, \(\epsilon\), that has to be accounted for. Hence, there needs to be an interval where the true value of \(f(X)\) lies, which will be bigger than, but centered at the same number of, the confidence interval. This is called the prediction interval.