Q1. For each of parts (a) through (d), indicate whether i. or ii. is correct, and explain your answer. In general, do we expect the performance of a flexible statistical learning method to perform better or worse than an inflexible method when :
Better. A flexible method will fit the data closer and with the large sample size, would perform better than an inflexible approach.
Worse. A flexible method would overfit the small number of observations.
Better. With more degrees of freedom, a flexible method would fit better than an inflexible one.
Worse. A flexible method would fit to the noise in the error terms and increase variance.
Q2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide \(n\) and \(p\).
Regression and inference with \(n = 500\) and \(p = 3\)
Classification and prediction with \(n = 20\) and \(p=13\)
Regression and prediction with \(n = 52\) and \(p = 3\)
Q3. We now revisit the bias-variance decomposition.
The training MSE declines monotonically as flexibility increases, this is because as flexibility increases the \(f\) curve fits the observed data more closely. The test MSE intially declines as flexibility increases but at some point it levels off and then starts to increase again (U-shape), this is because when a \(f\) curve yields a small training MSE but a large test MSE we are actually overfitting the data (our procedure tries too hard to find patterns in the training data that are maybe only caused by chance rather than by true properties of the unknown \(f\)). The squared bias decreases monotonically and the variance increases monotonically; as a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. Variance refers to the amount by wich \(\hat{f}\) would change if we estimated it using a different training data set, so if the curve fits the observations very closely, changing any point may cause \(\hat{f}\) to change considerably, and so will result in some variance. Bias refers to the error that is introduced by approximating a real-life problem by a much simpler model, so if we use a very simple model (linear regression) it is unlikely that any real-life problem has such a simple linear relationship, and so performing linear regression will result in some bias in the estimate of \(f\). The irreducible error is a constant so it is a parallel line, this curve lies below the test MSE curve because the expected test MSE will always be greater the \(Var(\varepsilon)\) (see relation \((2.7)\)).
Q4. You will now think of some real-life applications for statistical learning.
Classification 1 – Is this TV series/movie/ad campaign going to be successful or not (Response : Success/Failure, Predictors : Money spent, Talent, Running Time, Producer, TV Channel, Air time slot, etc., Goal : Prediction). Classification 2 – Should this applicant be admitted into Harvard University or not (Response : Admit/Not admit, Predictors : SAT Scores, GPA, Socio Economic Strata, Income of parents, Essay effectiveness, Potential, etc., Goal : Prediction). Classification 3 – Salk Polio vaccine trials – Successful/Not Successful (Response : Did the child get polio or not, Predictors : Age, Geography, General health condition, Control/Test group, etc., Goal : Prediction).
Regression 1 – GDP Growth in European economies (Response : What is the GDP of countries predicted to be by 2050, Predictors : Population, Per capita income, Education, Average life expectancy, Tax Revenue, Government Spending etc., Goal : Inference). Regression 2 – What is the average house sale price in XXX neighborhood over the next 5 years (Response : Average house in XXX neighborhood will sell for Y next year, Z the year after, T after that, etc., Predictors : Proximity to transit, Parks, Schools, Average size of family, Average Income of Family, Crime Rate, Price Flux in surrounding neighborhoods etc., Goal : Inference). Regression 3 – Gas mileage that a new car design will result in (Response : With certain parameters being set, X is the mileage we will get out of this car, Predictors: Fuel type, Number of Cylinders, Engine Version, etc., Goal : Inference).
Cluster 1 – Division of countries into Developed, Developing and Third World (Response : By 2050, countries in Asia can be split into these following clusters, Predictors : Per Capita Income, Purchasing power parity, Average birth rate, Average number of years of education received, Average Death Rate, Population etc., Goal: Prediction). Cluster 2 – Division of average working population into income segments for taxation purposes (Response : This worker falls under this taxation bracket, Predictors : Income, Job Industry, Job Segment, Size of Company, etc., Goal : Prediction). Cluster 3 – Cluster new movies being produced into ratings G/PG/R/PG-13 etc. (Response : This movie is a R/PG/PG-13, Predictors : Violent content, Sexual language, theme, etc., Goal : Prediction).
Q5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification ? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred ?
The advantages of a very flexible approach are that it may give a better fit for non-linear models and it decreases the bias.
The disadvantages of a very flexible approach are that it requires estimating a greater number of parameters, it follows the noise too closely (overfit) and it increases the variance.
A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results.
A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.
Q6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach) ? What are its disadvantages ?
A parametric approach reduces the problem of estimating \(f\) down to one of estimating a set of parameters because it assumes a form for \(f\).
A non-parametric approach does not assume a patricular form of \(f\) and so requires a very large sample to accurately estimate \(f\).
The advantages of a parametric approach to regression or classification are the simplifying of modeling \(f\) to a few parameters and not as many observations are required compared to a non-parametric approach.
The disadvantages of a parametric approach to regression or classification are a potentially inaccurate estimate \(f\) if the form of \(f\) assumed is wrong or to overfit the observations if more flexible models are used.
Q7. The table below provides a training data set containing 6 observations, 3 predictors, and 1 qualitative response variable. Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.
\[ \begin{array}{|c|c|c|c|c|c|} \hline \mathrm{Obs.} &X_1 &X_2 &X_3 &Y &\mathrm{Distance} \cr \hline 1 &0 &3 &0 &\mathrm{Red} &3 \cr 2 &2 &0 &0 &\mathrm{Red} &2 \cr 3 &0 &1 &3 &\mathrm{Red} &3.16 \cr 4 &0 &1 &2 &\mathrm{Green} &2.23 \cr 5 &-1 &0 &1 &\mathrm{Green} &1.41 \cr 6 &1 &1 &1 &\mathrm{Red} &1.73\cr \hline \end{array} \]
If \(K = 1\) then \(x_5\in\mathcal{N}_0\) and we have \[P(Y = \mathrm{Red} | X = x_0) = \frac{1}{1}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Red}) = I(y_5 = \mathrm{Red}) = 0\] and \[P(Y = \mathrm{Green} | X = x_0) = \frac{1}{1}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Green}) = I(y_5 = \mathrm{Green}) = 1.\] Our prediction is then Green.
If \(K = 3\) then \(x_2,x_5,x_6\in\mathcal{N}_0\) and we have \[P(Y = \mathrm{Red} | X = x_0) = \frac{1}{3}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Red}) = \frac{1}{3}(1 + 0 + 1) = \frac{2}{3}\] and \[P(Y = \mathrm{Green} | X = x_0) = \frac{1}{3}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Green}) = \frac{1}{3}(0 + 1 + 0) = \frac{1}{3}.\] Our prediction is then Red.
As \(K\) becomes larger, the boundary becomes inflexible (linear). So in this case we would expect the best value for \(K\) to be small.
Q8. This exercise relates to the “College” data set, which can be found in the file “College.csv”. It contains a number of variables for 777 different universities and colleges in the US.
library(ISLR)
data(College)
college <- read.csv("College.csv")
head(college[, 1:5])
## X Private Apps Accept Enroll
## 1 Abilene Christian University Yes 1660 1232 721
## 2 Adelphi University Yes 2186 1924 512
## 3 Adrian College Yes 1428 1097 336
## 4 Agnes Scott College Yes 417 349 137
## 5 Alaska Pacific University Yes 193 146 55
## 6 Albertson College Yes 587 479 158
rownames <- college[, 1]
college <- college[, -1]
head(college[, 1:5])
## Private Apps Accept Enroll Top10perc
## 1 Yes 1660 1232 721 23
## 2 Yes 2186 1924 512 16
## 3 Yes 1428 1097 336 22
## 4 Yes 417 349 137 60
## 5 Yes 193 146 55 16
## 6 Yes 587 479 158 38
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.
pairs(college[, 1:10])
Use the plot() function to produce side-by-side boxplots of “Outstate” versus “Private”.
plot(college$Private, college$Outstate, xlab = "Private University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")
Create a new qualitative variable, called “Elite”, by binning the “Top10perc” variable. Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of “Outstate” versus “Elite”.
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college$Elite <- Elite
summary(college$Elite)
## No Yes
## 699 78
plot(college$Elite, college$Outstate, xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")
Use the hist() function to produce some histograms with numbers of bins for a few of the quantitative variables.
par(mfrow = c(2,2))
hist(college$Books, col = 2, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 3, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 4, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 6, xlab = "% alumni", ylab = "Count")
Continue exploring the data, and provide a brief summary of what you discover.
summary(college$PhD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 62.00 75.00 72.66 85.00 103.00
It is a little weird to have universities with \(103\%\) of faculty with Phd’s, let us see how many universities have this percentage and their names.
weird.phd <- college[college$PhD == 103, ]
nrow(weird.phd)
## [1] 1
rownames[as.numeric(rownames(weird.phd))]
## [1] Texas A&M University at Galveston
## 777 Levels: Abilene Christian University ... York College of Pennsylvania
Q9. This exercise involves the “Auto” data set studied in the lab. Make sure the missing values have been removed from the data.
auto <- read.csv("Auto.csv", na.strings = "?")
auto <- na.omit(auto)
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:5] 33 127 331 337 355
## .. ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
All variables except “horsepower” and “name” are quantitative.
summary(auto[, -c(4, 9)])
## mpg cylinders displacement weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :5140
## acceleration year origin
## Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :15.50 Median :76.00 Median :1.000
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
sapply(auto[, -c(4, 9)], mean)
## mpg cylinders displacement weight acceleration
## 23.445918 5.471939 194.411990 2977.584184 15.541327
## year origin
## 75.979592 1.576531
sapply(auto[, -c(4, 9)], sd)
## mpg cylinders displacement weight acceleration
## 7.8050075 1.7057832 104.6440039 849.4025600 2.7588641
## year origin
## 3.6837365 0.8055182
subset <- auto[-c(10:85), -c(4,9)]
sapply(subset, range)
## mpg cylinders displacement weight acceleration year origin
## [1,] 11.0 3 68 1649 8.5 70 1
## [2,] 46.6 8 455 4997 24.8 82 3
sapply(subset, mean)
## mpg cylinders displacement weight acceleration
## 24.404430 5.373418 187.240506 2935.971519 15.726899
## year origin
## 77.145570 1.601266
sapply(subset, sd)
## mpg cylinders displacement weight acceleration
## 7.867283 1.654179 99.678367 811.300208 2.693721
## year origin
## 3.106217 0.819910
auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
pairs(auto)
We seem to get more mileage per gallon on a 4 cyl vehicle than the others. Weight, displacement and horsepower seem to have an inverse effect with mpg. We see an overall increase in mpg over the years. Almost doubled in one decade. Japanese cars have higher mpg than US or European cars.
From the plots above, the cylinders, horsepower, year and origin can be used as predictors. Displacement and weight were not used because they are highly correlated with horespower and with each other.
auto$horsepower <- as.numeric(auto$horsepower)
cor(auto$weight, auto$horsepower)
## [1] 0.8645377
cor(auto$weight, auto$displacement)
## [1] 0.9329944
cor(auto$displacement, auto$horsepower)
## [1] 0.897257
Q10. This exercise involves the “Boston” housing data set.
library(MASS)
Boston$chas <- as.factor(Boston$chas)
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 14
par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)
hist(Boston$crim, breaks = 50)
Most suburbs do not have any crime (\(80\%\) of data falls in crim < 20).
pairs(Boston[Boston$crim < 20, ])
There may be a relationship between crim and nox, rm, age, dis, lstat and medv.
hist(Boston$crim, breaks = 50)
nrow(Boston[Boston$crim > 20, ])
## [1] 18
hist(Boston$tax, breaks = 50)
nrow(Boston[Boston$tax == 666, ])
## [1] 132
hist(Boston$ptratio, breaks = 50)
nrow(Boston[Boston$ptratio > 20, ])
## [1] 201
nrow(Boston[Boston$chas == 1, ])
## [1] 35
median(Boston$ptratio)
## [1] 19.05
row.names(Boston[min(Boston$medv), ])
## [1] "5"
range(Boston$tax)
## [1] 187 711
Boston[min(Boston$medv), ]$tax
## [1] 222
nrow(Boston[Boston$rm > 7, ])
## [1] 64
nrow(Boston[Boston$rm > 8, ])
## [1] 13