II. Statistical Learning

2.1 What is Statistical Learning?

Input variables also referred to as: predictors, independent variables, features. Typically denoted X

Output variable often called the: response or dependent variable. Typically denoted Y

Assuming there is some relationship between our predictors and response Y = f(X) + E

Where f is some fixed but unknown function of X1, … , Xp

And E is a random error term which is independent of X and has mean zero.

In this formula, f represents the systematic information that X provides about Y.

"In essence, statistical learning refers to a set of approaches for estimating f. In this chapter we outline some of the key theoretical concepts that arise in estimating f, as well as tools for evaluating the estimates obtained.

2.1.1 Why Estimate f?

2 main reasons: prediction and inference

Prediction

We are not typically concerned with the exact form of the predicted funtion of x, rather more that it yields accurate predictions of Y.

The accuracy of the predicted value of Y depends on two quantities: the reducible error and the irreducible error.

The reducible error can potentially be reduced by improving the accuracy of the predicted-fby using the most appropriate statistical learning technique to estimate f.

However, there is always some irreducible error because Y is also a function of E, which, by definition, cannot be predicted using X. Thus, no matter how well we estimate f, we cannot reduce the error introduced by E.

The quantity E may contain unmeasured variables that are useful in predicting Y: since we don’t measure them, f cannot use them for its prediction.

The quantity E may also contain unmeasurable variation.

The focus of this book is on techniques for estimating f with the aim of minimizing the reducible error.

Inference

The goal is to understand the relationship between X and Y via our approximations for f. More specifically, to understand how Y changes as a function of X1, … , Xp.

Now we are interested in the form of the predicted function of f.

We would thus be curious about:

  1. Which predictors are associated with the response?
  2. What is the relationship between the response and each predictor?
  3. Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating f may be appropriate.

2.1.2 How Do We Estimate f?

Training data is our observed set of n different data points we use to train, or teach, our method how to estimate f.

Parametric Methods

Parametric methods involve a two-step model-based approach.

1.) First, we make an assumption about the functional form, or shape, of f. Assuming f(X) is linear:

  • f(X) = B0 + B1X1 + B2X2 + … + BpXp.

2.) After a model has been selected, we need a procedure that uses the training data to fit or train the model. In the case of the linear model, we need to estimate the paramaters B0, B1, … , Bp. That is, we want to find values of these paramaters such that

  • Y = B0 + B1X1 + B2X2 + … + BpXp.

Parametric modeling thus reduces the problem of estimating f down to one of estimating a set of parameters.

"Assuming a parametric form for f simplifies the problem of estimating f because it is generally much easier to estimate a set of paramaters, such as B0, B1, … , Bp in the linear model, than it is to fit an entirely arbitrary function f. The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor. We can try to address this problem by choosing flexible models that can fit many different possible functional forms for f. But in general, fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors, or noise, too closely.

Non-Parametric Methods

“Non-parametric methods do not make explicit assumptions about the functional form of f. Insteady they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly. Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f. Any parametric approach brings with it the possibility that the functional form used to estimate f is very different from the true f, in which case the resulting model will not fit the data well. In costrast, non-parametric approaches completely avoid this danger, since essentially no assumption about the form of f is made. But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f.

Overfitting the data is undesirable because the fit obtained will not yield accurate estimates of the response on new observations that were not part of the original training data set.

2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability?

Why would we ever choose to use a more restrictive method instead of a very flexible approach?

-More restrictive models are more interpretable and are thus better for inference problems.

-In settings where the interpretability of the predictive model is not of interest, as in the case of prediction problems, relatively more flexible techniques are more advantageous. However, more flexible methods often run the risk of overfitting

2.1.4 Supervised Versus Unsupervised Learning?

The methods we have already discussed are all examples of Supervised Learning. Essentially, we have inputs and are trying to either predict response value based on our inputs or make some inference as to the likelihood of our output given our inputs.

Unsupervised learning is not concerned with the response yi. “We lack a response variable that can supervise our analysis.” One such example is cluster analysis which seeks to ascentain, on the basis of x1, … , xn, whether the observations fall into relatively distinct groups.

2.1.5 Regression Versus Classification Problems

Quantitative variables take on numerical values.

Qualitative variables take on values in one of K different classes.

Problems with a quantitative response are typically referred to as regression problems.

Problems with a qualitative response are typically referred to as classification problems.

2.2 Assessing Model Accuracy?

Before undertaking any kind of statistical analysis, ask the question, “Which specific method works best for the particular data set”.

2.2.1 Measuring the Quality of Fit

The mean squared error (MSE) quantifies the extent to which the predicted response value for a given observation is close to the true response value for that observation.

“The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.”

Remember: we are not really interested in how our predicted functional form f(x) fits our response value in the training data. Rather, we are more interested in how our functional form f(x) fits a previously unseen test observation not used to train the statistical learning method.

Thus, we want to select the method with the lowest test MSE.

Degrees of freedom is a quantity that summarizes the flexibility of a curve.

As model flexibility increases, training MSE will decrease, but the test MSE may not.

When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.

Cross-validation: a method for estimating test MSE using the training data.

2.2.2 The Bias-Variance Trade-Off

The expected test MSE, for a given value of x0 can be decomposed into three fundamental quantities:

  • The variance of the predicted functional form of f(x0)
  • The squared bias of the predicted functional form of f(x0)
  • The variance of the error terms E
Thus, in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achies low variance* and low bias.*

Variance refers to the amount by which the predicted funtional form of f would change if we estimated it using a different training data set.

In general, more flexible statistical methods have higher variance.

Bias refers to the error that is introduced by approximating a real-life problem, which may be extrememly complicated, by a much simpler model.

Generally, more flexible methods result in less bias.

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease.

The relative rate of change of these two quantities determines whether the test MSE increases or decreases.

As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases.

__The relationship between bias, variance, and test set MSE outlined above is referred to as the bias-variance trade-off. The challenge lies in finding a method for which both the variance and the squared bias are low.

2.2.3 The Classification Setting

"The most common approach for quantifying the accuracy of our estimate of the functional form of f is the training error rate, the proportion of mistakes that are made if we apply our estimate of the functional form f* to the training observations.

Essentially, the rate of incorrect classifications.

The Bayes Classifier

The test error rate given is minimized, on average, by a very simple classifier that assigns each observation to the most likely class, given its predictor values.

The Bayes Classifier uses a conditional probability for a 2 class predictor function assigning each observation to class x if its probability of belonging to that class is > .5.

Bayes decision boundary the line representing the points where the boundary is exactly 50%.

K-Nearest Neighbors

Many approaches attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability.

K-nearest neighbors (KNN) is one such method.

Thus, you specify how many “K’s”. With too many, the model is too rigid, with too few it is too flexible.

2.3 Lab: Introduction to R

2.3.1 Basic Commands

1.) Defines a vector 2.) Views it

x = c(1,3,2,5)
x
[1] 1 3 2 5

Defines, lists, shows the lengths and sums the vectors.

x = c(1,6,2)
x
y=c(1,4,3)
y
length(x)
[1] 3
length(y)
[1] 3
x+y
[1]  2 10  5

Lists the vectors

ls()
[1] "x" "y"

Removes the specified vector

rm(x,y)
ls()
character(0)

Removes all vectors in the list at once.

rm(list=ls())

Creates a matrix w/ 3 dimensions: the data, # of rows, and # of columns.Note it fills in the columns first.

x=matrix(data=c(1,2,3,4), nrow=2, ncol=2)
x
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Again, creates a matrix, but fills in the rows first.

x=matrix(data=c(1,2,3,4), nrow=2, ncol=2,byrow=T)
x
     [,1] [,2]
[1,]    1    2
[2,]    3    4

Returns the square root of each element of a vector or matrix

sqrt(x)
         [,1]     [,2]
[1,] 1.000000 1.414214
[2,] 1.732051 2.000000
x
     [,1] [,2]
[1,]    1    2
[2,]    3    4
x^2
     [,1] [,2]
[1,]    1    4
[2,]    9   16

Generates a vector of random normal variables. First argument is n sample size. Then, derives the correlation between the two vectors.

x=rnorm(50)
y=x+rnorm(50,mean=50,sd=.1)
cor(x,y)
[1] 0.9942175

Sets a random vector for use later by defining the previously generated random variables.

set.seed(1303)
rnorm(50)
 [1] -1.1439763145  1.3421293656  2.1853904757
 [4]  0.5363925179  0.0631929665  0.5022344825
 [7] -0.0004167247  0.5658198405 -0.5725226890
[10] -1.1102250073 -0.0486871234 -0.6956562176
[13]  0.8289174803  0.2066528551 -0.2356745091
[16] -0.5563104914 -0.3647543571  0.8623550343
[19] -0.6307715354  0.3136021252 -0.9314953177
[22]  0.8238676185  0.5233707021  0.7069214120
[25]  0.4202043256 -0.2690521547 -1.5103172999
[28] -0.6902124766 -0.1434719524 -1.0135274099
[31]  1.5732737361  0.0127465055  0.8726470499
[34]  0.4220661905 -0.0188157917  2.6157489689
[37] -0.6931401748 -0.2663217810 -0.7206364412
[40]  1.3677342065  0.2640073322  0.6321868074
[43] -1.3306509858  0.0268888182  1.0406363208
[46]  1.3120237985 -0.0300020767 -0.2500257125
[49]  0.0234144857  1.6598706557

Sets the random variables for replicability by defining a vector of 100 random observations (the same as in the book) and then calculating the mean.

set.seed(3)
y=rnorm(100)
mean(y)
[1] 0.01103557

Calculates the variance of y.

var(y)
[1] 0.7328675

Calculates the square root of the variance of y (the standard deviation).

sqrt(var(y))
[1] 0.8560768

A simpler method of calculating the standard deviation of y.

sd(y)
[1] 0.8560768

2.3.2 Graphics

Defines vectors x and y by a series of 100 random normal observations then plots them. Finally, plots them with specific labels on the corresponding axis and a Header.

x=rnorm(100)
y=rnorm(100)
plot(x,y)

plot(x,y,xlab="This is the X axis",ylab="This is the Y axis",main="Plot of X and Y")

__Creates a pdf of the graphic we create. First we specify the type of file we wish to create and name it. Then we create it. Finally we specify that we are done creating it. Saves it in the location of the current working directory.

pdf("Figure.pdf")
plot(x,y,col="green")
dev.off()
null device 
          1 

Creates a sequence of numbers. If you specify two points it creates a vector of numbers between those two points. You can also specify length to give a sequence of equally spaced numbers between the two points by that length.

x=seq(1,10)
x
 [1]  1  2  3  4  5  6  7  8  9 10

More sophisticated form

x=seq(-pi,pi,length=50)
x
 [1] -3.14159265 -3.01336438 -2.88513611
 [4] -2.75690784 -2.62867957 -2.50045130
 [7] -2.37222302 -2.24399475 -2.11576648
[10] -1.98753821 -1.85930994 -1.73108167
[13] -1.60285339 -1.47462512 -1.34639685
[16] -1.21816858 -1.08994031 -0.96171204
[19] -0.83348377 -0.70525549 -0.57702722
[22] -0.44879895 -0.32057068 -0.19234241
[25] -0.06411414  0.06411414  0.19234241
[28]  0.32057068  0.44879895  0.57702722
[31]  0.70525549  0.83348377  0.96171204
[34]  1.08994031  1.21816858  1.34639685
[37]  1.47462512  1.60285339  1.73108167
[40]  1.85930994  1.98753821  2.11576648
[43]  2.24399475  2.37222302  2.50045130
[46]  2.62867957  2.75690784  2.88513611
[49]  3.01336438  3.14159265

2.3.3 Indexing Data

Creates a matrix of values between 1 and 16 with 4 rows and 4 columns.

A=matrix(1:16,4,4)
A
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Returns the value corresponding to row 2 column 3.

A[2,3]
[1] 10

Selects multiple rows and columns at a time, by providing vectors as the indices.

Returns the values at the intersection of rows 1 and 3 and columns 2 and 4.

A[c(1,3),c(2,4)]
     [,1] [,2]
[1,]    5   13
[2,]    7   15

Returns all the values that correspond to the intersection of rows 1 though 3 and columns 2 through 4.

A[1:3,2:4]
     [,1] [,2] [,3]
[1,]    5    9   13
[2,]    6   10   14
[3,]    7   11   15

Returns all values in rows 1 and 2.

A[1:2,]
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14

Returns all values in columns 1 and 2.

A[,1:2]
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8

Returns the vector row 1.

A[1,]
[1]  1  5  9 13

Returns all values except rows 1 and 3.

A[-c(1,3),]
     [,1] [,2] [,3] [,4]
[1,]    2    6   10   14
[2,]    4    8   12   16

Returns only the values not in rows 1 and 3 and not in columns 1, 3, and 4. Basically only the values in rows 2 and 4 and column 2 of the previously specified matrix.

A[-c(1,3),-c(1,3,4)]
[1] 6 8

2.3.4 Loading Data

Loads the data from the present working directory named “Auto.data” and defines it in the environment as “Auto”. Note the variables are not currently defined. Fix causes the data to be displayed in a pop out viewer.

Auto=read.table("Auto.data")
fix(Auto)

Reloads the data, but specifies that the values in the first row are the header and that values with a question mark are missing values.

Auto=read.table("Auto.data",header=T,na.string="?")
fix(Auto)

Generates an external viewer of the data and then describes how many rows by how many colums.

fix(Auto)
dim(Auto)
[1] 398   9

Lists the first four rows of data across the nine variables.

Auto[(1:4),]

Removes the rows with missing observations and then gives the new dimensions

Auto=na.omit(Auto)
dim(Auto)
[1] 392   9

Lists the variable names.

names(Auto)
[1] "mpg"          "cylinders"    "displacement"
[4] "horsepower"   "weight"       "acceleration"
[7] "year"         "origin"       "name"        

2.3.5 Additional Graphical and Numerical Summaries

plot(cylinders,mpg)
Error in plot(cylinders, mpg) : object 'cylinders' not found

__Note:__ cannot plot because the variables are not yet properly defined. Thus, R has no idea what cylinder* or mpg are.

__By joining the data frame with the variable via a dollar sign R knows that cylinders and mpg are variables linked to that data table.

plot(Auto$cylinders,Auto$mpg)

__However, by “attaching” Auto, we can tell R that the values in the header are linked to that data table and are our inputs.

attach(Auto)
plot(cylinders,mpg)

Converts quantitative data into qualitative. Converts cylinders which was previously quantiative with only 6 possible values into a variable with 5 factors. It is 5 factors because as you note from the plot above; there are no corresponding values for “7”.

cylinders=as.factor(cylinders)

Now that the data is categorical, it generates a boxplot. Here are a few boxplots with different options.

plot(cylinders,mpg)

plot(cylinders,mpg,xlab="Cylinders",ylab="mpg",main="Mileage by # of Cylinders",col="red")

plot(cylinders,mpg,xlab="Cylinders",ylab="mpg",main="Mileage by # of Cylinders",col="red",varwidth=T)

Creates a histogram of the data. Here are a few with different options.

hist(mpg)

hist(mpg,col=2)

hist(mpg,col=2,breaks=15)

Creates a scatterplot matrix. The second also creates a scatterplot matrix, but only for the five variables we specified.

pairs(Auto)

pairs(~mpg+displacement+horsepower+weight+acceleration,Auto)

Plots the data and then enables a click tool in the plots console which prints the values of points you select. Note does not work in R notebook, only in the console.

plot(horsepower,mpg)
identify(horsepower,mpg,name)
integer(0)

Produces a numerical summary of each variable in a particular data set.

summary(Auto)
      mpg          cylinders    
 Min.   : 9.00   Min.   :3.000  
 1st Qu.:17.00   1st Qu.:4.000  
 Median :22.75   Median :4.000  
 Mean   :23.45   Mean   :5.472  
 3rd Qu.:29.00   3rd Qu.:8.000  
 Max.   :46.60   Max.   :8.000  
                                
  displacement     horsepower        weight    
 Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :151.0   Median : 93.5   Median :2804  
 Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :455.0   Max.   :230.0   Max.   :5140  
                                               
  acceleration        year      
 Min.   : 8.00   Min.   :70.00  
 1st Qu.:13.78   1st Qu.:73.00  
 Median :15.50   Median :76.00  
 Mean   :15.54   Mean   :75.98  
 3rd Qu.:17.02   3rd Qu.:79.00  
 Max.   :24.80   Max.   :82.00  
                                
     origin                      name    
 Min.   :1.000   amc matador       :  5  
 1st Qu.:1.000   ford pinto        :  5  
 Median :1.000   toyota corolla    :  5  
 Mean   :1.577   amc gremlin       :  4  
 3rd Qu.:2.000   amc hornet        :  4  
 Max.   :3.000   chevrolet chevette:  4  
                 (Other)           :365  

2.4 Exercises

Conceptual

  1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
    1. The sample size n is extremely large, and the number of predictors p is small. If we have few predictors then our bias is likely to be larger. Also, given the large sample size, a more flexible method would likely reduce the bias and thus fit the data better.
    1. The number of predictors p is extremely large, and the number of observations n is small. Likely worse. With a large number of predictors we run the risk of overfitting the data. By increasing the flexibility, we only exacerbate this risk by approximating that may too closely fit the training data, but not necessarily work in practice.
    1. The relationship between the predictors and response is highly non-linear. Better. If the data is highly non-linear that it is fairly obvious that an inflexible model will not capture the function form of f very well at all.
    1. The variance of the error terms, i.e. σ2 = Var(E), is extremely high. Worse. Again, we run the risk of overfitting if we fit a model that is too flexible and functionally mirrors the residuals.
  1. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

+(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary. Regression-Inference. n=500 p=4

+(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables. Classification-prediction. n=20 p=14

+(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market. Regression-Inference. n=52 p=3

  1. We now revisit the bias-variance decomposition.

+(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one. __See attached jpeg.

+(b) Explain why each of the five curves has the shape displayed in part (a).

As we approximate a more flexible model, at first the bias decreases at a rate greater than the variance is increasing. However, eventually as we add flexibility past the optimal functional form the variance increases at a rate greater than the bias is decreasing. At this point, although our training MSE may be decreasing we are beginning to overfit the data. Thus, it is unlikely that our training set will fit our test set very well and past the optimal level of flexibility for our particular data and question our training MSE and test MSE deviate; the former decreasing and the latter increasing. The irreducible error is fundamentally unknowable and is just represented by a horizontal line independent of the other factors, but always below our test MSE because the test MSE contains the irreducible error.

  1. Skipped

  2. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Advantages: better fits the values if the true functional form of the data is not linear and reduces the bias of our estimated parameters.

Disadvantages: if the data is linear, then a more flexible approach runs the risk of overfitting the data. It also requires more parameters and increases the variance.

If the data is non-linear a more flexible approach is preferred. If the tradeoff is one such that a more flexible model leads to a significant reduction in the bias and only a small increase in the variance; a more flexible approach is preferable.

if the data is very linear, a more rigid approach is preffered. If the tradeoff is one such that a less flexible model only increases the bias slightly, but greatly reduces the variance; a less flexible approach is preferable.

A more flexible approach is preferred when we more interested in power of prediction as opposed to interpretability the results.

A less flexible approach would be preferred if we were more interested in interpreting the paramaters as opposed to simply predicting a response.

6.) Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

A parametric approach assumes a functional form of f and thus reduces regression and classification problems down to approximating a set of parameters and then interpreting the results or predicting a response. It requires fewer observations than do non-parametric approaches.

A non-parametric approach does not make any assumption about the functional form of f. Rather it requires a large n to accurately estimate the functional form of f.

Advantages of parametric approach to regression or classification: does not require as great an n and can be estimated with relatively fewer parameters.

Disadvantages of parametric approach to regression or classification: run the risk of approximating a functional form of f that is far from the true f. Also, run the risk of overfitting the model via the use of more flexible models when not appropriate.

7.) Skipped

__*Applied__

8.) This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

Load the data

College=read.csv("College.csv",header=T)

Add a new column for all rows in column one. View the data

rownames(College)=College[,1]
fix(College)

Eliminates the first column in the data where the names are stored. Note the actual row of college names is not a stored data colum.

College=College[,-1]
fix(College)

Summarize the data. Note the college names are not stored as a data column.

summary(College)
                            X       Private  
 Abilene Christian University:  1   No :212  
 Adelphi University          :  1   Yes:565  
 Adrian College              :  1            
 Agnes Scott College         :  1            
 Alaska Pacific University   :  1            
 Albertson College           :  1            
 (Other)                     :771            
      Apps           Accept          Enroll    
 Min.   :   81   Min.   :   72   Min.   :  35  
 1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
 Median : 1558   Median : 1110   Median : 434  
 Mean   : 3002   Mean   : 2019   Mean   : 780  
 3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
 Max.   :48094   Max.   :26330   Max.   :6392  
                                               
   Top10perc       Top25perc    
 Min.   : 1.00   Min.   :  9.0  
 1st Qu.:15.00   1st Qu.: 41.0  
 Median :23.00   Median : 54.0  
 Mean   :27.56   Mean   : 55.8  
 3rd Qu.:35.00   3rd Qu.: 69.0  
 Max.   :96.00   Max.   :100.0  
                                
  F.Undergrad     P.Undergrad     
 Min.   :  139   Min.   :    1.0  
 1st Qu.:  992   1st Qu.:   95.0  
 Median : 1707   Median :  353.0  
 Mean   : 3700   Mean   :  855.3  
 3rd Qu.: 4005   3rd Qu.:  967.0  
 Max.   :31643   Max.   :21836.0  
                                  
    Outstate       Room.Board  
 Min.   : 2340   Min.   :1780  
 1st Qu.: 7320   1st Qu.:3597  
 Median : 9990   Median :4200  
 Mean   :10441   Mean   :4358  
 3rd Qu.:12925   3rd Qu.:5050  
 Max.   :21700   Max.   :8124  
                               
     Books           Personal   
 Min.   :  96.0   Min.   : 250  
 1st Qu.: 470.0   1st Qu.: 850  
 Median : 500.0   Median :1200  
 Mean   : 549.4   Mean   :1341  
 3rd Qu.: 600.0   3rd Qu.:1700  
 Max.   :2340.0   Max.   :6800  
                                
      PhD            Terminal    
 Min.   :  8.00   Min.   : 24.0  
 1st Qu.: 62.00   1st Qu.: 71.0  
 Median : 75.00   Median : 82.0  
 Mean   : 72.66   Mean   : 79.7  
 3rd Qu.: 85.00   3rd Qu.: 92.0  
 Max.   :103.00   Max.   :100.0  
                                 
   S.F.Ratio      perc.alumni   
 Min.   : 2.50   Min.   : 0.00  
 1st Qu.:11.50   1st Qu.:13.00  
 Median :13.60   Median :21.00  
 Mean   :14.09   Mean   :22.74  
 3rd Qu.:16.50   3rd Qu.:31.00  
 Max.   :39.80   Max.   :64.00  
                                
     Expend        Grad.Rate     
 Min.   : 3186   Min.   : 10.00  
 1st Qu.: 6751   1st Qu.: 53.00  
 Median : 8377   Median : 65.00  
 Mean   : 9660   Mean   : 65.46  
 3rd Qu.:10830   3rd Qu.: 78.00  
 Max.   :56233   Max.   :118.00  
                                 

Create a scatterplot of the first ten variables of the data.

pairs(College[,1:10])

Plot Outstate vs. Private

plot(College$Outstate,col="green")

plot(College$Private,col="red")

Create a new qualitative variable “Elite”

Elite=rep("No",nrow(College))
Elite[College$Top10perc >50]="Yes"
Elite=as.factor(Elite)
College=data.frame(College,Elite)

Summary of how many Elite vs non-elite colleges

summary(Elite)
 No Yes 
699  78 

Boxplots of Elite vs Outstate tuition

plot(College$Elite,College$Outstate)

Histograms of different quantiative variables with different bins

par(mfrow=c(2,2))
hist(College$Apps,col="red",breaks=7)
hist(College$Accept,col="green",breaks=4)
hist(College$Enroll,col="orange",breaks=5)

9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

Import the data, specify headers in the top row and define missing variables as “NA”.

Auto=read.table("Auto.data",header=T,na.strings="?")
Auto=na.omit(Auto)
dim(Auto)
[1] 392   9
summary(Auto)
      mpg          cylinders      displacement     horsepower        weight      acceleration        year      
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613   Min.   : 8.00   Min.   :70.00  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804   Median :15.50   Median :76.00  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978   Mean   :15.54   Mean   :75.98  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140   Max.   :24.80   Max.   :82.00  
                                                                                                               
     origin                      name    
 Min.   :1.000   amc matador       :  5  
 1st Qu.:1.000   ford pinto        :  5  
 Median :1.000   toyota corolla    :  5  
 Mean   :1.577   amc gremlin       :  4  
 3rd Qu.:2.000   amc hornet        :  4  
 Max.   :3.000   chevrolet chevette:  4  
                 (Other)           :365  

Quantitative: mpg, displacement, horsepower, weight, acceleration and year

Qualitative: cylinders and origin

Range of quantitative variables

Range=apply(Auto[,-c(2,8,9)],2,range)
row.names(Range)=c("Min","Max")
Range
     mpg displacement horsepower weight
Min  9.0           68         46   1613
Max 46.6          455        230   5140
    acceleration year
Min          8.0   70
Max         24.8   82
rm(Range)

Mean and Standard Deviation of Quantiative Variables

SD_Mean=apply(Auto[,-c(2,8,9)],2,function(x){c(mean(x),sd(x))})
row.names(SD_Mean)=c("Mean","Standard Deviation")
SD_Mean
                         mpg displacement
Mean               23.445918      194.412
Standard Deviation  7.805007      104.644
                   horsepower    weight
Mean                104.46939 2977.5842
Standard Deviation   38.49116  849.4026
                   acceleration      year
Mean                  15.541327 75.979592
Standard Deviation     2.758864  3.683737
rm(SD)

Range excluding rows 10-65

Range2=apply(Auto[-c(10:85),-c(2,8,9)],2,range)
row.names(Range2)=c("Min","Max")
Range2
     mpg displacement horsepower weight
Min 11.0           68         46   1649
Max 46.6          455        230   4997
    acceleration year
Min          8.5   70
Max         24.8   82

_Mean excluding rows 10-85__

Mean2=apply(Auto[-c(10:85),-c(2,8,9)],2,mean)
Mean2
         mpg displacement   horsepower 
    24.40443    187.24051    100.72152 
      weight acceleration         year 
  2935.97152     15.72690     77.14557 

Standard Deviation excluding rows 10-85

SD2=apply(Auto[-c(10:85),-c(2,8,9)],2,sd)
SD2
         mpg displacement   horsepower 
    7.867283    99.678367    35.708853 
      weight acceleration         year 
  811.300208     2.693721     3.106217 

Now joining mean and SD

MEAN_SD2=apply(Auto[-c(10:85),-c(2,8,9)],2,function(x){c(mean(x),sd(x))})
row.names(MEAN_SD2)=c("Mean","Standard_Deviation")
MEAN_SD2
                         mpg displacement
Mean               24.404430    187.24051
Standard_Deviation  7.867283     99.67837
                   horsepower    weight
Mean                100.72152 2935.9715
Standard_Deviation   35.70885  811.3002
                   acceleration      year
Mean                  15.726899 77.145570
Standard_Deviation     2.693721  3.106217

Graphical depictions of the data

boxplot(Auto$mpg~Auto$cylinders,col="red",xlab="Cylinders",ylab="MPG",main="MPG by # of Cylinders")

scatter.smooth(Auto$year,Auto$mpg,col="green",xlab="Year",ylab="MPG",main="Mpg by Year")

pairs(Auto)

boxplot(Auto$mpg~Auto$origin,col="grey",xlab="Country of Origin",ylab="MPG",main="MPG by Country of Origin",names=c("American","European","Japanese"))

10. This exercise involves the Boston housing data set.

library(MASS)
Boston
?Boston
dim(Boston)
[1] 506  14
fix(Boston)

Finding indicate that predominantly African American towns have lower crime rates.

par(mfrow=c(2,2))
scatter.smooth(Boston$ptratio,Boston$lstat,xlab="Pupil-to-teacher Ratio",ylab="% of Lower Class Pop",main="PT Ratio by % of Lower Status",col="blue")
boxplot(Boston$medv~Boston$rad,col="green",xlab="Index of Accessibility to Radial Highways",ylab="Median Home Value",main="Median Home Value")
scatter.smooth(Boston$crim,Boston$black,col="red",xlab="Crime Rate",ylab="Proportion of Blacks by Town",main="Crime Rate by Proportion of Blacks")

Also, it seems that areas with higher percentages of the population classified as “lower status” also have higher pupil-to-teacher ratios. This means that teachers in more “impoverished” areas are having to do more.

I am not sure what the index of accessibility to radial highways is, but it seems clear that in the case of “24” it is linked to lower median home values. However, the dispersion of values is of some note.

plot(Boston$crim)
identify(Boston$crim)
warning: nearest point already identified
warning: nearest point already identified
 [1] 375 376 379 380 381 385 387 388 399 401 404 405 406 407 411 413 414 415 416 418 419 426 428 441

The above values are all town codes that have extremely high crime rates.

selection = Boston[,"chas"]
nrow(Boston[selection,])
[1] 35
median(Boston$ptratio)
[1] 19.05
plot(Boston$medv)
identify(Boston$medv)
integer(0)

print(Boston[399,])
Range3=apply(Boston[,],2,range)
row.names(Range3)=c("Min","Max")
Range3
        crim  zn indus chas   nox    rm   age
Min  0.00632   0  0.46    0 0.385 3.561   2.9
Max 88.97620 100 27.74    1 0.871 8.780 100.0
        dis rad tax ptratio  black lstat medv
Min  1.1296   1 187    12.6   0.32  1.73    5
Max 12.1265  24 711    22.0 396.90 37.97   50
summary(Boston$MoreThanSeven)
 No Yes 
442  64 
summary(Boston$morethaneight)
 No Yes 
493  13 
---
title: "R Notebook - Chapter 2: Statistical Learning"
output: html_notebook
---

# __II. Statistical Learning__

## __2.1 What is Statistical Learning?__

__Input variables__ also referred to as: *predictors, independent variables, features*. Typically denoted __X__

__Output variable__ often called the: *response* or *dependent variable*. Typically denoted __Y__

Assuming there is some relationship between our predictors and response
__Y = *f*(X) + *E*__

Where *f* is some fixed but unknown function of X~1~, ... , X~p~

And *E* is a random *error term* which is independent of X and has mean zero.

In this formula, *f* represents the *systematic* information that X provides about Y.

"In essence, statistical learning refers to a set of approaches for estimating *f*. In this chapter we outline some of the key theoretical concepts that arise in estimating *f*, as well as tools for evaluating the estimates obtained.

### __*2.1.1 Why Estimate f?*__

2 main reasons: __*prediction*__ and __*inference*__

__Prediction__

We are not typically concerned with the exact form of the predicted funtion of x, rather more that it yields accurate predictions of Y.

The accuracy of the predicted value of Y depends on two quantities: the __*reducible error*__ and the __*irreducible error*__.

The __*reducible error*__ can potentially be reduced by improving the accuracy of the predicted-*f*by using the most appropriate statistical learning technique to estimate *f*.

However, there is always some __*irreducible error*__ because Y is also a function of *E*, which, by definition, cannot be predicted using X. Thus, no matter how well we estimate *f*, we cannot reduce the error introduced by *E*.

The quantity *E* may contain unmeasured variables that are useful in predicting Y: since we don't measure them, *f* cannot use them for its prediction.

The quantity *E* may also contain unmeasurable variation.

The focus of this book is on techniques for estimating *f* with the aim of minimizing the reducible error.

__Inference__

The goal is to understand the relationship between X and Y via our approximations for *f*. More specifically, to understand how Y changes as a function of X~1~, ... , X~p~.

Now we are interested in the form of the predicted function of *f*.

We would thus be curious about:

1. *Which predictors are associated with the response?*
2. *What is the relationship between the response and each predictor?*
3. *Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?*

__Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating *f* may be appropriate.__

## __2.1.2 How Do We Estimate f?__

__*Training data*__ is our observed set of *n* different data points we use to train, or teach, our method how to estimate *f*.

#### __Parametric Methods__

Parametric methods involve a two-step model-based approach.

1.) First, we make an assumption about the functional form, or shape, of *f*. __Assuming *f*(X) is linear:__

+ *f*(X) = *B*~0~ + *B*~1~X~1~ + *B*~2~X~2~ + ... + *B*~p~X~p~.

2.) After a model has been selected, we need a procedure that uses the training data to *fit* or *train* the model. In the case of the linear model, we need to estimate the paramaters *B*~0~, *B*~1~, ... , *B*~p~.
That is, we want to find values of these paramaters such that

+ Y = *B*~0~ + *B*~1~X~1~ + *B*~2~X~2~ + ... + *B*~p~X~p~.

__*Parametric modeling*__ thus reduces the problem of estimating *f* down to one of estimating a set of parameters.

> "Assuming a parametric form for *f* simplifies the problem of estimating *f* because it is generally much easier to estimate a set of paramaters, such as *B*~0~, *B*~1~, ... , *B*~p~ in the linear model, than it is to fit an entirely arbitrary function *f*. __The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of *f*.__ If the chosen model is too far from the true *f*, then our estimate will be poor. We can try to address this problem by choosing *flexible* models that can fit many different possible functional forms for *f*. But in general, fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as *overfitting* the data, which essentially means they follow the errors, or *noise*, too closely.

#### __Non-Parametric Methods__

> "Non-parametric methods do not make explicit assumptions about the functional form of *f*. Insteady they seek an estimate of *f* that gets as close to the data points as possible without being too rough or wiggly. Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for *f*, they have the potential to accurately fit a wider range of possible shapes for *f*. Any parametric approach brings with it the possibility that the functional form used to estimate *f* is very different from the true *f*, in which case the resulting model will not fit the data well. In costrast, non-parametric approaches completely avoid this danger, since essentially no assumption about the form of *f* is made. __But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating *f* to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for *f*.__"

*Overfitting* the data is undesirable because the fit obtained will not yield accurate estimates of the response on new observations that were not part of the original training data set.

## __2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability?__

__*Why would we ever choose to use a more restrictive method instead of a very flexible approach?*__

-More restrictive models are more interpretable and are thus better for inference problems.

-In settings where the interpretability of the predictive model is not of interest, as in the case of prediction problems, relatively more flexible techniques are more advantageous. However, more flexible methods often run the risk of *overfitting* 

## __2.1.4 Supervised Versus Unsupervised Learning?__

The methods we have already discussed are all examples of __*Supervised Learning*__. Essentially, we have inputs and are trying to either predict response value based on our inputs or make some inference as to the likelihood of our output given our inputs.

__*Unsupervised learning*__ is not concerned with the response *y*~i~. "We lack a response variable that can supervise our analysis." One such example is *cluster analysis* which seeks to ascentain, on the basis of *x*~1~, ... , *x*~n~, whether the observations fall into relatively distinct groups.

## __2.1.5 Regression Versus Classification Problems__

*Quantitative* variables take on numerical values.

*Qualitative* variables take on values in one of K different *classes*.

Problems with a __quantitative__ response are typically referred to as *regression* problems.

Problems with a __qualitative__ response are typically referred to as *classification* problems.

## __2.2 Assessing Model Accuracy?__

Before undertaking any kind of statistical analysis, ask the question, "*Which specific method works best for the particular data set*".

## __2.2.1 Measuring the Quality of Fit__

The *mean squared error* (__MSE__) quantifies the extent to which the predicted response value for a given observation is close to the true response value for that observation.

> "The __MSE__ will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially."

Remember: we are not really interested in how our predicted functional form *f*(x) fits our response value in the training data. Rather, we are more interested in how our functional form *f*(x) fits a *previously unseen test observation not used to train the statistical learning method*.

Thus, we want to select the method with the lowest *test MSE*.

__*Degrees of freedom*__ is a quantity that summarizes the flexibility of a curve.

*As model flexibility increases, training MSE will decrease, but the test MSE may not.*

*When a given method yields a small training MSE but a large test MSE, we are said to be __overfitting__ the data. Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.*

*Cross-validation*: a method for estimating test MSE using the training data.

## __2.2.2 The Bias-Variance Trade-Off__

The expected test MSE, for a given value of *x*~0~ can be decomposed into three fundamental quantities:

* __The variance of the predicted functional form of f(*x*~0~)__
* __The squared *bias* of the predicted functional form of f(*x*~0~)__
* __The variance of the error terms *E*__

##### *Thus, in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achies *low variance* and *low bias*.*

__*Variance*__ refers to the amount by which the predicted funtional form of *f* would change if we estimated it using a different training data set.

In general, more flexible statistical methods have higher variance.

__*Bias*__ refers to the error that is introduced by approximating a real-life problem, which may be extrememly complicated, by a much simpler model.

Generally, more flexible methods result in less bias.

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease.

##### __The relative rate of change of these two quantities determines whether the test MSE increases or decreases.__

> As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases. 

__The relationship between bias, variance, and test set MSE outlined above is referred to as the *bias-variance trade-off*. __The challenge lies in finding a method for which both the variance and the squared bias are low.__

## __2.2.3 The Classification Setting__

"The most common approach for quantifying the accuracy of our estimate of the functional form of *f* is the training __error rate__, *the proportion of mistakes that are made if we apply our estimate of the functional form *f* to the training observations.

Essentially, the rate of incorrect classifications.

__*The Bayes Classifier*__

The test error rate given is minimized, on average, by a very simple classifier that *assigns each observation to the most likely class, given its predictor values*.

The __*Bayes Classifier*__ uses a conditional probability for a 2 class predictor function assigning each observation to class x if its probability of belonging to that class is > .5.

__*Bayes decision boundary*__ the line representing the points where the boundary is exactly 50%.

__K-Nearest Neighbors__

Many approaches attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest *estimated* probability. 

__*K-nearest neighbors* (KNN)__ is one such method.

Thus, you specify how many "__K's__". With too many, the model is too rigid, with too few it is too flexible.

## __2.3 Lab: Introduction to R__

## __*2.3.1 Basic Commands*__ 

__1.) Defines a vector 2.) Views it__
```{r}
x = c(1,3,2,5)
x
```
__Defines, lists, shows the lengths and sums the vectors.__
```{r}
x = c(1,6,2)
x
y=c(1,4,3)
y
```
```{r}
length(x)
length(y)
x+y
```
__Lists the vectors__
```{r}
ls()
```
__Removes the specified vector__
```{r}
rm(x,y)
ls()
```
__Removes all vectors in the list at once.__
```{r}
rm(list=ls())
```
__Creates a matrix w/ 3 dimensions: the data, # of rows, and # of columns.*Note* it fills in the columns first.__
```{r}
x=matrix(data=c(1,2,3,4), nrow=2, ncol=2)
x
```
__Again, creates a matrix, but fills in the rows first.__
```{r}
x=matrix(data=c(1,2,3,4), nrow=2, ncol=2,byrow=T)
x
```
__Returns the square root of each element of a vector or matrix__
```{r}
sqrt(x)
x
x^2
```
__Generates a vector of random normal variables. First argument is *n* sample size. Then, derives the correlation between the two vectors.__
```{r}
x=rnorm(50)
y=x+rnorm(50,mean=50,sd=.1)
cor(x,y)
```
__Sets a random vector for use later by defining the previously generated random variables.__
```{r}
set.seed(1303)
rnorm(50)
```
__Sets the random variables for replicability by defining a vector of 100 random observations (the same as in the book) and then calculating the mean.__
```{r}
set.seed(3)
y=rnorm(100)
mean(y)
```
__Calculates the variance of y.__
```{r}
var(y)
```
__Calculates the square root of the variance of y (the standard deviation).__
```{r}
sqrt(var(y))
```
__A simpler method of calculating the standard deviation of y.__
```{r}
sd(y)
```

## __*2.3.2 Graphics*__ 

__Defines vectors *x* and *y* by a series of 100 random normal observations then plots them. Finally, plots them with specific labels on the corresponding axis and a Header.__
```{r}
x=rnorm(100)
y=rnorm(100)
plot(x,y)
plot(x,y,xlab="This is the X axis",ylab="This is the Y axis",main="Plot of X and Y")
```
__Creates a pdf of the graphic we create. First we specify the type of file we wish to create and name it. Then we create it. Finally we specify that we are done creating it. Saves it in the location of the current working directory.
```{r}
pdf("Figure.pdf")
plot(x,y,col="green")
dev.off()
```
__Creates a sequence of numbers. If you specify two points it creates a vector of numbers between those two points. You can also specify length to give a sequence of equally spaced numbers between the two points by that length.__
```{r}
x=seq(1,10)
x
```
__More sophisticated form__
```{r}
x=seq(-pi,pi,length=50)
x
```

## __*2.3.3 Indexing Data*__

__Creates a matrix of values between 1 and 16 with 4 rows and 4 columns.__
```{r}
A=matrix(1:16,4,4)
A
```
Returns the value corresponding to __row 2 column 3__.
```{r}
A[2,3]
```
__Selects multiple rows and columns at a time, by providing vectors as the indices.__

Returns the values at the intersection of *rows* __1__ and __3__ and *columns* __2__ and __4__.
```{r}
A[c(1,3),c(2,4)]
```
Returns all the values that correspond to the intersection of *rows* __1__ though __3__ and *columns* __2__ through __4__. 
```{r}
A[1:3,2:4]
```
Returns all values in rows __1__ and __2__.
```{r}
A[1:2,]
```
Returns all values in columns __1__ and __2__.
```{r}
A[,1:2]
```
Returns the vector __row 1__.
```{r}
A[1,]
```
Returns all values except rows __1__ and __3__.
```{r}
A[-c(1,3),]
```
Returns only the values not in rows __1__ and __3__ and not in columns __1__, __3__, and __4__. Basically only the values in rows __2__ and __4__ and column __2__ of the previously specified matrix.
```{r}
A[-c(1,3),-c(1,3,4)]
```

## __*2.3.4 Loading Data*__

Loads the data from the present working directory named "Auto.data" and defines it in the environment as "Auto". __Note__ *the variables are not currently defined*. Fix causes the data to be displayed in a pop out viewer.
```{r}
Auto=read.table("Auto.data")
fix(Auto)
```

Reloads the data, but specifies that the values in the first row are the header and that values with a question mark are missing values.
```{r}
Auto=read.table("Auto.data",header=T,na.string="?")
fix(Auto)
```

__Generates an external viewer of the data and then describes how many rows by how many colums.__
```{r}
fix(Auto)
dim(Auto)
```
__Lists the first four rows of data across the nine variables.__
```{r}
Auto[(1:4),]
```
__Removes the rows with missing observations and then gives the new dimensions__
```{r}
Auto=na.omit(Auto)
dim(Auto)
```
__Lists the variable names__.
```{r}
names(Auto)
```

## __*2.3.5 Additional Graphical and Numerical Summaries*__

```{r}
plot(cylinders,mpg)
```
__*Note:__ cannot plot because the variables are not yet properly defined. Thus, R has no idea what *cylinder* or *mpg* are.

__By joining the data frame with the variable via a dollar sign R knows that cylinders and mpg are variables linked to that data table.
```{r}
plot(Auto$cylinders,Auto$mpg)
```

__However, by *"attaching"* Auto, we can tell R that the values in the header are linked to that data table and are our inputs.
```{r}
attach(Auto)
```
```{r}
plot(cylinders,mpg)
```

__Converts quantitative data into qualitative. Converts cylinders which was previously quantiative with only 6 possible values into a variable with 5 factors. It is 5 factors because as you note from the plot above; there are no corresponding values for "7".__
```{r}
cylinders=as.factor(cylinders)
```

__Now that the data is categorical, it generates a *boxplot*. Here are a few boxplots with different options.__
```{r}
plot(cylinders,mpg)
plot(cylinders,mpg,xlab="Cylinders",ylab="mpg",main="Mileage by # of Cylinders",col="red")
plot(cylinders,mpg,xlab="Cylinders",ylab="mpg",main="Mileage by # of Cylinders",col="red",varwidth=T)
```
__Creates a histogram of the data. Here are a few with different options.__
```{r}
hist(mpg)
hist(mpg,col=2)
hist(mpg,col=2,breaks=15)
```
__Creates a scatterplot matrix. The second also creates a scatterplot matrix, but only for the five variables we specified.__
```{r}
pairs(Auto)
pairs(~mpg+displacement+horsepower+weight+acceleration,Auto)
```
Plots the data and then enables a click tool in the plots console which prints the values of points you select. __*Note* does not work in R notebook, only in the console.__
```{r}
plot(horsepower,mpg)
identify(horsepower,mpg,name)
```
__Produces a numerical summary of each variable in a particular data set.__
```{r}
summary(Auto)
```

## __2.4 Exercises__

__*Conceptual*__

1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
+ (a) The sample size *n* is extremely large, and the number of predictors *p* is small. __If we have few predictors then our bias is likely to be larger. Also, given the large sample size, a more flexible method would likely reduce the bias and thus fit the data better.__
+ (b) The number of predictors *p* is extremely large, and the number of observations *n* is small. __Likely worse. With a large number of predictors we run the risk of *overfitting* the data. By increasing the flexibility, we only exacerbate this risk by approximating that may too closely fit the training data, but not necessarily work in practice.__
+ (c) The relationship between the predictors and response is highly non-linear. __Better. If the data is highly non-linear that it is fairly obvious that an inflexible model will not capture the function form of *f* very well at all.__
+ (d) The variance of the error terms, i.e. σ2 = Var(*E*), is extremely high. __Worse. Again, we run the risk of overfitting if we fit a model that is too flexible and functionally mirrors the residuals.__

2. Explain whether each scenario is a *classification* or *regression* problem,
and indicate whether we are most interested in *inference* or *prediction*. Finally, provide *n* and *p*.

+(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary. __Regression-Inference. *n*=500 *p*=4__

+(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product
we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables. __Classification-prediction. *n*=20 *p*=14__

+(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market,
and the % change in the German market. __Regression-Inference. *n*=52 *p*=3__

3. We now revisit the bias-variance decomposition.

+(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single
plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should
represent the values for each curve. There should be five curves. Make sure to label each one. __See attached jpeg.

+(b) Explain why each of the five curves has the shape displayed in part (a). 

> __As we approximate a more flexible model, at first the bias decreases at a rate greater than the variance is increasing. However, eventually as we add flexibility past the optimal functional form the variance increases at a rate greater than the bias is decreasing. At this point, although our training MSE may be decreasing we are beginning to overfit the data. Thus, it is unlikely that our training set will fit our test set very well and past the optimal level of flexibility for our particular data and question our training MSE and test MSE deviate; the former decreasing and the latter increasing. The irreducible error is fundamentally unknowable and is just represented by a horizontal line independent of the other factors, but always below our test MSE because the test MSE contains the irreducible error.__

4. __Skipped__

5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred? 

__Advantages: better fits the values if the true functional form of the data is not linear and reduces the bias of our estimated parameters.__ 

__Disadvantages: if the data is linear, then a more flexible approach runs the risk of overfitting the data. It also requires more parameters and increases the variance.__

__If the data is non-linear a more flexible approach is preferred. If the tradeoff is one such that a more flexible model leads to a significant reduction in the bias and only a small increase in the variance; a more flexible approach is preferable.__

__if the data is very linear, a more rigid approach is preffered. If the tradeoff is one such that a less flexible model only increases the bias slightly, but greatly reduces the variance; a less flexible approach is preferable.__

__A more flexible approach is preferred when we more interested in power of prediction as opposed to interpretability the results.__

__A less flexible approach would be preferred if we were more interested in interpreting the paramaters as opposed to simply predicting a response.__

6.) Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

__A parametric approach assumes a functional form of *f* and thus reduces regression and classification problems down to approximating a set of parameters and then interpreting the results or predicting a response. It requires fewer observations than do non-parametric approaches.__

__A non-parametric approach does not make any assumption about the functional form of *f*. Rather it requires a large *n* to accurately estimate the functional form of *f*.__

__Advantages of parametric approach to regression or classification: does not require as great an *n* and can be estimated with relatively fewer parameters.__

__Disadvantages of parametric approach to regression or classification: run the risk of approximating a functional form of *f* that is far from the true *f*. Also, run the risk of overfitting the model via the use of more flexible models when not appropriate.__

7.) __Skipped__

#### __*Applied__

8.) This exercise relates to the *College* data set, which can be found in the file *College.csv*. It contains a number of variables for 777 different universities and colleges in the US.

__Load the data__
```{r}
College=read.csv("College.csv",header=T)
```
__Add a new column for all rows in column one. View the data__
```{r}
rownames(College)=College[,1]
fix(College)
```
__Eliminates the first column in the data where the names are stored. *Note* the actual row of college names is not a stored data colum.__
```{r}
College=College[,-1]
fix(College)
```
__Summarize the data. *Note the college names are not stored as a data column.*__
```{r}
summary(College)
```
__Create a scatterplot of the first ten variables of the data.__
```{r}
pairs(College[,1:10])
```
__Plot Outstate vs. Private__
```{r}
plot(College$Outstate,col="green")
plot(College$Private,col="red")
```
__Create a new qualitative variable "Elite"__
```{r}
Elite=rep("No",nrow(College))
Elite[College$Top10perc >50]="Yes"
Elite=as.factor(Elite)
College=data.frame(College,Elite)
```
__Summary of how many Elite vs non-elite colleges__
```{r}
summary(Elite)
```
__Boxplots of Elite vs Outstate tuition__
```{r}
plot(College$Elite,College$Outstate)
```
__Histograms of different quantiative variables with different bins__
```{r}
par(mfrow=c(2,2))
hist(College$Apps,col="red",breaks=7)
hist(College$Accept,col="green",breaks=4)
hist(College$Enroll,col="orange",breaks=5)
```

#### __9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.__
__Import the data, specify headers in the top row and define missing variables as "NA"__.
```{r}
Auto=read.table("Auto.data",header=T,na.strings="?")
Auto=na.omit(Auto)
dim(Auto)
summary(Auto)
```

__*Quantitative*__: mpg, displacement, horsepower, weight, acceleration and year

__*Qualitative*__: cylinders and origin

__Range of quantitative variables__
```{r}
Range=apply(Auto[,-c(2,8,9)],2,range)
row.names(Range)=c("Min","Max")
Range
```

__Mean and Standard Deviation of Quantiative Variables__
```{r}
SD_Mean=apply(Auto[,-c(2,8,9)],2,function(x){c(mean(x),sd(x))})
row.names(SD_Mean)=c("Mean","Standard Deviation")
SD_Mean
```
__Range excluding rows 10-65__
```{r}
Range2=apply(Auto[-c(10:85),-c(2,8,9)],2,range)
row.names(Range2)=c("Min","Max")
Range2
```
_Mean excluding rows 10-85__
```{r}
Mean2=apply(Auto[-c(10:85),-c(2,8,9)],2,mean)
Mean2
```
__Standard Deviation excluding rows 10-85__
```{r}
SD2=apply(Auto[-c(10:85),-c(2,8,9)],2,sd)
SD2
```
__Now joining mean and SD__
```{r}
MEAN_SD2=apply(Auto[-c(10:85),-c(2,8,9)],2,function(x){c(mean(x),sd(x))})
row.names(MEAN_SD2)=c("Mean","Standard_Deviation")
MEAN_SD2
```

__Graphical depictions of the data__
```{r}
boxplot(Auto$mpg~Auto$cylinders,col="red",xlab="Cylinders",ylab="MPG",main="MPG by # of Cylinders")
scatter.smooth(Auto$year,Auto$mpg,col="green",xlab="Year",ylab="MPG",main="Mpg by Year")
pairs(Auto)
boxplot(Auto$mpg~Auto$origin,col="grey",xlab="Country of Origin",ylab="MPG",main="MPG by Country of Origin",names=c("American","European","Japanese"))
```

#### __10. This exercise involves the *Boston* housing data set.__

```{r}
library(MASS)
Boston
?Boston
```

```{r}
dim(Boston)
```

*Finding indicate that predominantly African American towns have lower crime rates.*
```{r}
par(mfrow=c(2,2))
scatter.smooth(Boston$ptratio,Boston$lstat,xlab="Pupil-to-teacher Ratio",ylab="% of Lower Class Pop",main="PT Ratio by % of Lower Status",col="blue")
boxplot(Boston$medv~Boston$rad,col="green",xlab="Index of Accessibility to Radial Highways",ylab="Median Home Value",main="Median Home Value")
scatter.smooth(Boston$crim,Boston$black,col="red",xlab="Crime Rate",ylab="Proportion of Blacks by Town",main="Crime Rate by Proportion of Blacks")
```
*Also, it seems that areas with higher percentages of the population classified as "lower status" also have higher pupil-to-teacher ratios. This means that teachers in more "impoverished" areas are having to do more.*

*I am not sure what the index of accessibility to radial highways is, but it seems clear that in the case of "24" it is linked to lower median home values. However, the dispersion of values is of some note.*

```{r}
plot(Boston$crim)
identify(Boston$crim)
```
__The above values are all town codes that have extremely high crime rates.__

```{r}
selection = Boston[,"chas"]
nrow(Boston[selection,])
```

```{r}
median(Boston$ptratio)
```

```{r}
plot(Boston$medv)
identify(Boston$medv)
```

```{r}
print(Boston[399,])
```

```{r}
Range3=apply(Boston[,],2,range)
row.names(Range3)=c("Min","Max")
Range3
```

```{r}
MoreThanSeven=rep("No",nrow(Boston))
MoreThanSeven[Boston$rm>7]="Yes"
as.factor(MoreThanSeven)
Boston=data.frame(Boston,MoreThanSeven)
summary(Boston$MoreThanSeven)
```

```{r}
morethaneight=rep("No",nrow(Boston))
morethaneight[Boston$rm>8]="Yes"
as.factor(morethaneight)
Boston=data.frame(Boston,morethaneight)
summary(Boston$morethaneight)
```






