Statistical learning is a vast set of tools for understanding data. Supervised learning: build a statistical model relating inputs to output \(y = ax + b\) (linear model) Unsupervised learning: understand structure of data with no outputs
Wages for a group of males in the U.S Atlantic region. Examine association between age and education as well as year, on his wage.
#Wage data
ggplot(Wage, aes(age, wage)) +
geom_jitter(color = "grey50", alpha = 1/2) +
geom_smooth(se = F)
## `geom_smooth()` using method = 'gam'
ggplot(Wage, aes(year, wage)) +
geom_jitter(color = "grey50", alpha = 1/2) +
geom_smooth(method = "lm")
ggplot(Wage, aes(education, wage)) +
geom_boxplot(aes(fill = education), show.legend = F)
### Stock Market Data
# Smarket data
ggplot(Smarket %>% gather(day, change, c(Today, Lag1, Lag2, Lag3)), aes(Direction, change)) +
geom_boxplot(aes(fill = Direction), show.legend = F) +
facet_wrap(~ day)
No obvious relationship between previous stock market movements and today’s stock movement.
No output variables, so this is unsupervised, or a clustering problem. The NCI60 dataset contains 6,830 gene expression observations and 64 cancer cell lines. Extracting the principle components summarizes the 6,830 observations into two dimnesions \(Z_1\) and \(Z_2\).
\(n\) is the number of distinct observations, or data point. \(p\) is the number of variables available for prediction. Example: Wage contains \(n = 3,000\) observations and \(p = 12\) variables (year, age, sex, and more).
\(x_{ij}\) is the \(j\)th variables for the \(i\)th observation where \(i = 1, 2, ..., n\) and \(j = 1, 2, ..., p\). \(\mathbf{X}\) denotes a \(n \times p\). \(x_i\) denotes the \(i\)th vector row of observations. \(x_j\) denotes the \(p\) vector variable row. Indicating the dimension of a particular object. \(a \in \Bbb R\) means \(a\) is a rela number. \(a \in \Bbb R^n\) means \(a\) is a vector length \(n\). \(\mathbf{A} \in \Bbb R^{r \times d} \text{ and } \mathbf{B} \in \Bbb R^{d \times a}\) is useful for multiplying matrices: \(AB \in \Bbb R^{r \times a}\).
Input variables \(X = (X_1, X_2, ..., X_p)\) (sometimes called predictors, independent variables, or features) are associated to an output variable \(Y\) (often called the response or dependent variable). The difference in the relationship is called the error \(\epsilon\). Very generally, this relationship can be described as: \[Y = f(X) + \epsilon\]
\[\hat{Y} = \hat{f}(X)\] \(\hat{f}\) is treated as a black box, as long as \(\hat{Y}\) provides good predictions for \(Y\). Reducible Error: the accuracy of \(\hat{f}\) in estimating \(f\). We can work on this. Irreducible Error: the error \(\epsilon\) that is also a function of \(Y\). Cannot reduce this. Error is the sum of all variables that are not measured. Therefore, we cannot modify \(f\) to predict on them. Given an estimate function and set of predictors, we can estimate \(\hat{Y}\) and describe the expected value as such: \[E(Y - \hat{Y})^2 = \underbrace{[f(X) - \hat{f}(X)]^2}_\text{Reducible} + \underbrace{\text{Var}(\epsilon)}_\text{Irreducible}\] Where \(E(Y - \hat{Y})^2\) is the average expected value.
Sometimes we want to know how changes in \(X\) affect changes in \(Y\). Now we need to know the form of \(f\) in the relationship \(Y = f(X) + \epsilon\). Some questions could be: -Which predictors are associated with the response? -What is the relationship between the response and each predictor? -Can the relationship between \(Y\) and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
Training Data: observations used to create a model. Our goal: apply a statistical learning method to find a function \(\hat{f}\) such that \(Y \approx \hat{f}(X)\) for any observation \((X,Y)\).
Parametric simply means reducing the problem of finding \(f\) to estimating a set of coefficients. The procedure follows in two-steps:
Does not make assumptions about the functional form of \(f\). More accurately follows the data, but requires a lot more data to become accurate.
More flexible :: More accuracy. Easy interpretability :: Easy to infer.
Quantitative :: Regression. Qualitative :: Classification.
No free lunch in statistics: no one method dominates all others for all possible data sets. #### Measuring Quality of Fit Most common regression measure of fit is the mean squared error (MSE) given by \[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{f}(x_i))^2\] Basically adding all the squared errors (so they add) then normalizing by the size of observations. We are interested in the MSE on observations outside the training dataset, the test dataset. The more flexible a model, or the more degrees of freedom it has, the better it can fit the training set. Linear regression has only two degrees of freedom, the most restrictive. But a more flexible model can overfit the training set and perform poorly with a test set.
It is possible to show that the expected test MSE, for a given value \(x_0\), can always be decomposed into the sum of three fundamental quantities: 1. The variance of \(\hat{f}(x_0)\) 2. The squared bias of \(\hat{f}(x_0)\) 3. The variance of the error \(\epsilon\) \[E(y_0 - \hat{f}(x_0))^2 = \text{Var}(\hat{f}(x_0)) + [\text{Bias}(\hat{f}(x_0))]^2 + \text{Var}(\epsilon)\] This means we need a model that simultaneously minimizes the variance and bias of \(\hat{f}(x_0)\).
Variance: the amount \(\hat{f}\) changes if a different training set is used. In general more flexible models have higher variance. Bias: error from approximating a complicated model with a more restrictive one. More flexible models have less bias.
Use the training error rate \[\frac{1}{n} \sum_{i = 1}^n I(y_i \neq \hat{y}_i)\] Where \(I(y_i \neq \hat{y}_i)\) is 0 when correct and 1 incorrect for each observation. Bayes Classifier: test erro rate is minimized, on average, by assigning each observation to the most likely class given its predictor values. \[\text{Pr}(Y = j | X = x_0)\] In a binary classifier, the following corresponds to Class 1 \(\text{Pr}(Y = 1 | X = x_0) > 0.5\), and Class 2 otherwise. K-Nearest Neighbors: identifies the \(K\) points nearest to \(x_0\). The higher \(K\), the higher bias, the lower \(K\) the higher variance.
x <- c(1,6,2); y <- c(1,4,3)
ls() # lists all variables in the environment
## [1] "x" "y"
rm(x, y) # remove variable
x <- 1:10; y <- x
f <- outer(x, y, function(x,y) cos(y) / (1 + x^2))
contour(x,y,f)
fa <- (f - t(f)) / 2
contour(x,y,fa, nlevels = 45)
image(x,y,fa)
persp(x,y,fa)
summary(Auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
##
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name
## amc matador : 5
## ford pinto : 5
## toyota corolla : 5
## amc gremlin : 4
## amc hornet : 4
## chevrolet chevette: 4
## (Other) :365
?College
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
College %>%
ggplot(aes(Private, Outstate)) +
geom_boxplot()
College <- College %>%
mutate(Elite = Top10perc > 50)
ggplot(College, aes(Elite, Outstate)) + geom_boxplot()