Solutions of the exercises from Chapter 2

Conceptual

Q1. For each of parts (a) through (d), indicate whether i. or ii. is correct, and explain your answer. In general, do we expect the performance of a flexible statistical learning method to perform better or worse than an inflexible method when :

The sample size \(n\) is extremely large, and the number of predictors \(p\) is small ?

Better. A flexible method will fit the data closer and with the large sample size, would perform better than an inflexible approach.

The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small ?

Worse. A flexible method would overfit the small number of observations.

The relationship between the predictors and response is highly non-linear ?

Better. With more degrees of freedom, a flexible method would fit better than an inflexible one.

The variance of the error terms, i.e. \(\sigma^2 = Var(\varepsilon)\), is extremely high ?

Worse. A flexible method would fit to the noise in the error terms and increase variance.

Q2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide \(n\) and \(p\).

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Regression and inference with \(n = 500\) and \(p = 3\)

We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Classification and prediction with \(n = 20\) and \(p=13\)

We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

Regression and prediction with \(n = 52\) and \(p = 3\)

Q3. We now revisit the bias-variance decomposition.

Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

Sketch of the five curves

Explain why each of the five curves has the shape displayed in part (a).

The training MSE declines monotonically as flexibility increases, this is because as flexibility increases the \(f\) curve fits the observed data more closely. The test MSE intially declines as flexibility increases but at some point it levels off and then starts to increase again (U-shape), this is because when a \(f\) curve yields a small training MSE but a large test MSE we are actually overfitting the data (our procedure tries too hard to find patterns in the training data that are maybe only caused by chance rather than by true properties of the unknown \(f\)). The squared bias decreases monotonically and the variance increases monotonically; as a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. Variance refers to the amount by wich \(\hat{f}\) would change if we estimated it using a different training data set, so if the curve fits the observations very closely, changing any point may cause \(\hat{f}\) to change considerably, and so will result in some variance. Bias refers to the error that is introduced by approximating a real-life problem by a much simpler model, so if we use a very simple model (linear regression) it is unlikely that any real-life problem has such a simple linear relationship, and so performing linear regression will result in some bias in the estimate of \(f\). The irreducible error is a constant so it is a parallel line, this curve lies below the test MSE curve because the expected test MSE will always be greater the \(Var(\varepsilon)\) (see relation \((2.7)\)).

Q4. You will now think of some real-life applications for statistical learning.

Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction ? Explain your answer.

Classification 1 – Is this TV series/movie/ad campaign going to be successful or not (Response : Success/Failure, Predictors : Money spent, Talent, Running Time, Producer, TV Channel, Air time slot, etc., Goal : Prediction). Classification 2 – Should this applicant be admitted into Harvard University or not (Response : Admit/Not admit, Predictors : SAT Scores, GPA, Socio Economic Strata, Income of parents, Essay effectiveness, Potential, etc., Goal : Prediction). Classification 3 – Salk Polio vaccine trials – Successful/Not Successful (Response : Did the child get polio or not, Predictors : Age, Geography, General health condition, Control/Test group, etc., Goal : Prediction).

Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction ? Explain your answer.

Regression 1 – GDP Growth in European economies (Response : What is the GDP of countries predicted to be by 2050, Predictors : Population, Per capita income, Education, Average life expectancy, Tax Revenue, Government Spending etc., Goal : Inference). Regression 2 – What is the average house sale price in XXX neighborhood over the next 5 years (Response : Average house in XXX neighborhood will sell for Y next year, Z the year after, T after that, etc., Predictors : Proximity to transit, Parks, Schools, Average size of family, Average Income of Family, Crime Rate, Price Flux in surrounding neighborhoods etc., Goal : Inference). Regression 3 – Gas mileage that a new car design will result in (Response : With certain parameters being set, X is the mileage we will get out of this car, Predictors: Fuel type, Number of Cylinders, Engine Version, etc., Goal : Inference).

Describe three real-life applications in which cluster analysis might be useful.

Cluster 1 – Division of countries into Developed, Developing and Third World (Response : By 2050, countries in Asia can be split into these following clusters, Predictors : Per Capita Income, Purchasing power parity, Average birth rate, Average number of years of education received, Average Death Rate, Population etc., Goal: Prediction). Cluster 2 – Division of average working population into income segments for taxation purposes (Response : This worker falls under this taxation bracket, Predictors : Income, Job Industry, Job Segment, Size of Company, etc., Goal : Prediction). Cluster 3 – Cluster new movies being produced into ratings G/PG/R/PG-13 etc. (Response : This movie is a R/PG/PG-13, Predictors : Violent content, Sexual language, theme, etc., Goal : Prediction).

Q5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification ? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred ?

The advantages of a very flexible approach are that it may give a better fit for non-linear models and it decreases the bias.

The disadvantages of a very flexible approach are that it requires estimating a greater number of parameters, it follows the noise too closely (overfit) and it increases the variance.

A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results.

A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.

Q6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach) ? What are its disadvantages ?

A parametric approach reduces the problem of estimating \(f\) down to one of estimating a set of parameters because it assumes a form for \(f\).

A non-parametric approach does not assume a patricular form of \(f\) and so requires a very large sample to accurately estimate \(f\).

The advantages of a parametric approach to regression or classification are the simplifying of modeling \(f\) to a few parameters and not as many observations are required compared to a non-parametric approach.

The disadvantages of a parametric approach to regression or classification are a potentially inaccurate estimate \(f\) if the form of \(f\) assumed is wrong or to overfit the observations if more flexible models are used.

Q7. The table below provides a training data set containing 6 observations, 3 predictors, and 1 qualitative response variable. Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

\[ \begin{array}{|c|c|c|c|c|c|} \hline \mathrm{Obs.} &X_1 &X_2 &X_3 &Y &\mathrm{Distance} \cr \hline 1 &0 &3 &0 &\mathrm{Red} &3 \cr 2 &2 &0 &0 &\mathrm{Red} &2 \cr 3 &0 &1 &3 &\mathrm{Red} &3.16 \cr 4 &0 &1 &2 &\mathrm{Green} &2.23 \cr 5 &-1 &0 &1 &\mathrm{Green} &1.41 \cr 6 &1 &1 &1 &\mathrm{Red} &1.73\cr \hline \end{array} \]

What is our prediction with \(K = 1\) ? Why ?

If \(K = 1\) then \(x_5\in\mathcal{N}_0\) and we have \[P(Y = \mathrm{Red} | X = x_0) = \frac{1}{1}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Red}) = I(y_5 = \mathrm{Red}) = 0\] and \[P(Y = \mathrm{Green} | X = x_0) = \frac{1}{1}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Green}) = I(y_5 = \mathrm{Green}) = 1.\] Our prediction is then Green.

What is our prediction with \(K = 3\) ? Why ?

If \(K = 3\) then \(x_2,x_5,x_6\in\mathcal{N}_0\) and we have \[P(Y = \mathrm{Red} | X = x_0) = \frac{1}{3}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Red}) = \frac{1}{3}(1 + 0 + 1) = \frac{2}{3}\] and \[P(Y = \mathrm{Green} | X = x_0) = \frac{1}{3}\sum_{i\in\mathcal{N}_0}I(y_i = \mathrm{Green}) = \frac{1}{3}(0 + 1 + 0) = \frac{1}{3}.\] Our prediction is then Red.

If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for \(K\) to be large or small ? Why ?

As \(K\) becomes larger, the boundary becomes inflexible (linear). So in this case we would expect the best value for \(K\) to be small.

Applied

Q8. This exercise relates to the “College” data set, which can be found in the file “College.csv”. It contains a number of variables for 777 different universities and colleges in the US.

Use the read.csv() function to read the data into R. Call the loaded data “college”. Make sure that you have the directory set to the correct location for the data.

library(ISLR)
data(College)
college <- read.csv("College.csv")

Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.

head(college[, 1:5])

##                              X Private Apps Accept Enroll
## 1 Abilene Christian University     Yes 1660   1232    721
## 2           Adelphi University     Yes 2186   1924    512
## 3               Adrian College     Yes 1428   1097    336
## 4          Agnes Scott College     Yes  417    349    137
## 5    Alaska Pacific University     Yes  193    146     55
## 6            Albertson College     Yes  587    479    158

rownames <- college[, 1]
college <- college[, -1]
head(college[, 1:5])

##   Private Apps Accept Enroll Top10perc
## 1     Yes 1660   1232    721        23
## 2     Yes 2186   1924    512        16
## 3     Yes 1428   1097    336        22
## 4     Yes  417    349    137        60
## 5     Yes  193    146     55        16
## 6     Yes  587    479    158        38

Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.

pairs(college[, 1:10])

Use the plot() function to produce side-by-side boxplots of “Outstate” versus “Private”.

plot(college$Private, college$Outstate, xlab = "Private University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")

Create a new qualitative variable, called “Elite”, by binning the “Top10perc” variable. Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of “Outstate” versus “Elite”.

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college$Elite <- Elite
summary(college$Elite)

##  No Yes 
## 699  78

plot(college$Elite, college$Outstate, xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")

Use the hist() function to produce some histograms with numbers of bins for a few of the quantitative variables.

par(mfrow = c(2,2))
hist(college$Books, col = 2, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 3, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 4, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 6, xlab = "% alumni", ylab = "Count")

Continue exploring the data, and provide a brief summary of what you discover.

summary(college$PhD)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   62.00   75.00   72.66   85.00  103.00

It is a little weird to have universities with \(103\%\) of faculty with Phd’s, let us see how many universities have this percentage and their names.

weird.phd <- college[college$PhD == 103, ]
nrow(weird.phd)

## [1] 1

rownames[as.numeric(rownames(weird.phd))]

## [1] Texas A&M University at Galveston
## 777 Levels: Abilene Christian University ... York College of Pennsylvania

Q9. This exercise involves the “Auto” data set studied in the lab. Make sure the missing values have been removed from the data.

Which of the predictors are quantitative, and which are qualitative ?

auto <- read.csv("Auto.csv", na.strings = "?")
auto <- na.omit(auto)
str(auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5] 33 127 331 337 355
##   .. ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

All variables except “horsepower” and “name” are quantitative.

What is the range of each quantitative predictor ?

summary(auto[, -c(4, 9)])

##       mpg          cylinders      displacement       weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :5140  
##   acceleration        year           origin     
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :15.50   Median :76.00   Median :1.000  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000

What is the mean and standard deviation of each quantitative predictor ?

sapply(auto[, -c(4, 9)], mean)

##          mpg    cylinders displacement       weight acceleration 
##    23.445918     5.471939   194.411990  2977.584184    15.541327 
##         year       origin 
##    75.979592     1.576531

sapply(auto[, -c(4, 9)], sd)

##          mpg    cylinders displacement       weight acceleration 
##    7.8050075    1.7057832  104.6440039  849.4025600    2.7588641 
##         year       origin 
##    3.6837365    0.8055182

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains ?

subset <- auto[-c(10:85), -c(4,9)]
sapply(subset, range)

##       mpg cylinders displacement weight acceleration year origin
## [1,] 11.0         3           68   1649          8.5   70      1
## [2,] 46.6         8          455   4997         24.8   82      3

sapply(subset, mean)

##          mpg    cylinders displacement       weight acceleration 
##    24.404430     5.373418   187.240506  2935.971519    15.726899 
##         year       origin 
##    77.145570     1.601266

sapply(subset, sd)

##          mpg    cylinders displacement       weight acceleration 
##     7.867283     1.654179    99.678367   811.300208     2.693721 
##         year       origin 
##     3.106217     0.819910

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
pairs(auto)

We seem to get more mileage per gallon on a 4 cyl vehicle than the others. Weight, displacement and horsepower seem to have an inverse effect with mpg. We see an overall increase in mpg over the years. Almost doubled in one decade. Japanese cars have higher mpg than US or European cars.

Suppose that we wish to predict gas mileage (“mpg”) on the basis of other variables. Do your plots suggest that any of the other variables might be useful in predicting “mpg” ?

From the plots above, the cylinders, horsepower, year and origin can be used as predictors. Displacement and weight were not used because they are highly correlated with horespower and with each other.

auto$horsepower <- as.numeric(auto$horsepower)
cor(auto$weight, auto$horsepower)

## [1] 0.8645377

cor(auto$weight, auto$displacement)

## [1] 0.9329944

cor(auto$displacement, auto$horsepower)

## [1] 0.897257

Q10. This exercise involves the “Boston” housing data set.

To begin, load in the “Boston” data set.

library(MASS)
Boston$chas <- as.factor(Boston$chas)
nrow(Boston)

## [1] 506

ncol(Boston)

## [1] 14

Make some pairwise scatterplots of the predictors in this data set.

par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)

Are any of the predictors associated with per capita crime rate ?

hist(Boston$crim, breaks = 50)

Most suburbs do not have any crime (\(80\%\) of data falls in crim < 20).

pairs(Boston[Boston$crim < 20, ])

There may be a relationship between crim and nox, rm, age, dis, lstat and medv.

Do any of the suburbs of Boston appear to have particularly high crime rates ? Tax rates ? Pupil-teacher ratios ?

hist(Boston$crim, breaks = 50)

nrow(Boston[Boston$crim > 20, ])

## [1] 18

hist(Boston$tax, breaks = 50)

nrow(Boston[Boston$tax == 666, ])

## [1] 132

hist(Boston$ptratio, breaks = 50)

nrow(Boston[Boston$ptratio > 20, ])

## [1] 201

How many of the suburbs in this data set bound the Charles river ?

nrow(Boston[Boston$chas == 1, ])

## [1] 35

What is the median pupil-teacher ratio among the towns in this data set ?

median(Boston$ptratio)

## [1] 19.05

Which suburb of Boston has lowest median value of owner-occupied homes ? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors ?

row.names(Boston[min(Boston$medv), ])

## [1] "5"

range(Boston$tax)

## [1] 187 711

Boston[min(Boston$medv), ]$tax

## [1] 222

In this data set, how many of the suburbs average more than seven rooms per dwelling ? More than eight rooms per dwelling ?

nrow(Boston[Boston$rm > 7, ])

## [1] 64

nrow(Boston[Boston$rm > 8, ])

## [1] 13