This notebook was prepared by Everton Lima.

Introduction to Statistical Learning Solutions (ISLR)

Ch 2 Exercises

Table of Contents

Conceptual

Applied

1

1a

For very large number of observations, and a small set of predictors, the variance within each observation is unexpectedly small. For any predictor p the variance is defined as \(Var(p) = \frac{1}{n}\cdot\sum_{i=1}^n(p_i - \mu)^{2}\) where \(\mu\) is the mean of p. From this equation one can easily observe that the variance is inversely proportional to n. Thus, I would expect more flexible models to perform better since it would achieve a low bias and not be penalized for a higher variance in the results.

1b

If the number of predictors is extremely large, and the number of predictors is small then there is a higher variance associated with the number of predictors. In this case also many machine learning methods are not applicable. For example if there is a larger number of predictors than observations the coefficients in a linear regression model cannot be estimated.

1c

A flexible model will perform better.

1d

A inflexible model will have better performance.

2

2a

This is an inference problem because we are interest in which factors affect CEO salary. In this case there are 500 observations and 3 predictor values.

2b

This is a prediction problem. There are 20 observations and 14 predictors.

2c

We are interested in predicting the target variable. There are 3 predictors and the number of observations is the number of weeks in 2012 (52).

3

3a

library('ggplot2')
library('reshape2')

xrange <- 0:1000
bias <- 10-xrange
variance <- xrange
testError <- 6+(xrange-5)^2
trainError <- sapply(xrange,function(x){ if(x<=5) 5.5+(x-5)^2  else 11-x })

data=data.frame(xrange,variance,bias,trainError,testError)
data_long=melt(data,id="xrange")

ggplot(data_long,aes(x=xrange, y=value, colour=variable)) + ylim(2,10) + xlim(0,10) + geom_line(na.rm = T) + 
theme(axis.line=element_blank(),axis.text.x=element_blank(),
          axis.text.y=element_blank(),axis.ticks=element_blank(),
          axis.title.x=element_blank(),
          axis.title.y=element_blank(),legend.position="none",
          panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
          panel.grid.minor=element_blank(),plot.background=element_blank())

3b

Variance is directly proportional to the increase in flexibility. The relationship between variance and model flexibility is likely non-linear, however here the relationship is represented as linear. This is still a good representation, since it is clear that an increase in flexibility provides an increase in variance.

Bias has a inversely proportional relationship with flexibility. From the graph this is clearly the case, as flexibility increases there is a linear decrease in bias.

The test error is represented by an ‘U’ shape curve. This is because in increase in flexibility typically improves model performance initially. Performance only later worsening when the variance produced by flexible models not being off set by a decrease in bias.

The train error curve closely follows the test error curve initially, but as flexibility increases this quantity continues to decrease. With high level of flexibility train error can be reduced to zero, but in such cases, the model is over fitted. It provides no insight on the underlying behavior of \(\hat{f}\) but instead it models noise.

The Bayes error is represented by a horizontal line. This error is the irreducible error. The curve shows the lower bound of modeling error. It can be inferred that a model with no training error is typically modeling noise, since it cannot reasonably be lower than the irreducible error (unless there is no variance inherent in the data, which is unlikely).

4

4a

In classifications problems the target variable is typically nominal. Classifying emails into spam or not-spam categories is a elliptical application for classification. Another well known application is if a bank customer is going to default or not default on their loan. However, target variables can also be ordinal. A typical application is rating or score prediction.

4b

There are a wide range of regression products in machine learning. Its applications long predate machine learning, as regression was first used to predict orbits of planets around the sun. Other applications are predicting stock market percent increase and housing prices.

4c

Clustering is an important technique or knowledge discovery. Clustering can be use to provide user suggestions by finding similar customers in a web store, or movie streaming site.

5

Flexible models make less strict (or none) assumptions about the form of f, while inflexible models have a strong bias in regards to the form of f.

Flexible models should be preferred when there is a large amount of data available. This is because is a large amount of observations is provided then there is a smaller variance associated with the predictors. Moreover, flexible models make better predictive models.

Inflexible models should be used when the number of observations is small, or it is known that there is a high variance associated with the observations. This type of model is also better at inference.

6

Parametric methods makes explicit assumptions about the form of f, while non-parametric methods do not. Parametric methods then suffer from a higher bias, and thus are less flexible than non-parametric methods. This class of model is great for inference since the model provides an explicit relationship between the target variables and the set of predictors.

Non-parametric models then are much more flexible than parametric models, since no form of f is assumed. The goal for this class of model is to estimate f as closely as possible via the observed data. However, they suffer from many disadvantages when compares to parametric methods. Typically more observations are needed (think K-nearest). Moreover, this class of model does not provide an explicit relationship between the target variable and the predictors making non-parametric models harder to interpret.

7

7a

train <- data.frame(X1=c(0,2,0,0,-1,1),X2=c(3,0,1,1,0,1),X3=c(0,0,3,2,1,1),Y=c("Red","Red","Red","Green","Green","Red"))
print(train) 
##   X1 X2 X3     Y
## 1  0  3  0   Red
## 2  2  0  0   Red
## 3  0  1  3   Red
## 4  0  1  2 Green
## 5 -1  0  1 Green
## 6  1  1  1   Red
scores = apply(train[,-4],1,function(x){  sqrt(sum((c(0,0,0)-x)^2)) })
names(scores) = 1:6
print(scores)
##        1        2        3        4        5        6 
## 3.000000 2.000000 3.162278 2.236068 1.414214 1.732051

7b

The prediction is 5 because it is the point with the smallest euclidean distance from the origin.

7c

Red. In this case the points with the smallest distance are 2, 5, and 6. Since 2 and 6 are both red points then the prediction value in also red.

7d

If the Bayes decision boundary is highly non-linear, I would expect the value for K to be small. Small values for K provide a more flexible boundary.

8

8a

The data can also be loaded directly from the ISLR package.

library('ISLR')
?College

8b

The fix functions allows the supplied argument to be edited (either data or functions). Here the use of this function is simply for inspection. Other functions such as head are typically more practical for this.

Also, since the data has been loaded from the ISLR library directly, the row names are already set correctly.

head(College[,])
##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55

8c

summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00
pairs(College[,1:10])

plot(x=College[,"Outstate"],y=College[,"Private"],xlab="Outsate",ylab="Private")

Elite = rep("No",nrow(College))
Elite[College$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
College = data.frame(College,Elite)
summary(Elite)
##  No Yes 
## 699  78
plot(College[,"Outstate"],Elite,xlab="Outstate")

par(mfrow = c(2,2))

hist(College[,"PhD"],xlab="Percent of faculty with PhD")
hist(College[,"PhD"],xlab="Percent of faculty with PhD",breaks = 100/5)

hist(College[,"Accept"],xlab="Number of applicants accepted")
hist(College[,"Room.Board"],xlab="Room and board costs")

Subtracting the amount of accepted students divided by the number of applications from one allows us to produce a quantity that measures the difficulty of acceptance for a particular school. We can then use this quantity to answer questions such as; Do difficult to get into schools invest more per student? Is there a relationship between the faculty and student ratio and this quantity?

Unaccept = apply(College[],1,function(x){ 1- as.numeric(x[3]) / as.numeric(x[2])  })

plot(Unaccept,College[,"Expend"],ylab="Ammount Spend per Student",xlim = c(0,1))
text(Unaccept,College[,"Expend"],labels = ifelse(Unaccept > 0.75,row.names(College),""),pos = 3)

plot(Unaccept,College[,"S.F.Ratio"],ylab="Student Faculty Ratio",xlim = c(0,1),ylim=c(0,25))
text(Unaccept,College[,"S.F.Ratio"],labels = ifelse(Unaccept > 0.78,row.names(College),""),pos = 3)

9

9a

The quantitative predictors are displacement,horsepower, weight, acceleration. The remaining ones are qualitative; cylinders, year, origin, name.

9b
dat = apply(Auto[,-c(7,8,9)],2,range)
row.names(dat) = c("Min","Max")
dat
##      mpg cylinders displacement horsepower weight acceleration
## Min  9.0         3           68         46   1613          8.0
## Max 46.6         8          455        230   5140         24.8
9c
dat = apply(Auto[,-c(7,8,9)],2,function(x){ c(mean(x),sd(x)) })
row.names(dat) = c("Mean","SD")
dat
##            mpg cylinders displacement horsepower    weight acceleration
## Mean 23.445918  5.471939      194.412  104.46939 2977.5842    15.541327
## SD    7.805007  1.705783      104.644   38.49116  849.4026     2.758864
9d
dat = apply(Auto[-c(10:85),-c(7,8,9)],2,function(x){ c(range(x),mean(x),sd(x)) })
row.names(dat) = c("Min","Max","Mean","SD")
dat
##            mpg cylinders displacement horsepower    weight acceleration
## Min  11.000000  3.000000     68.00000   46.00000 1649.0000     8.500000
## Max  46.600000  8.000000    455.00000  230.00000 4997.0000    24.800000
## Mean 24.404430  5.373418    187.24051  100.72152 2935.9715    15.726899
## SD    7.867283  1.654179     99.67837   35.70885  811.3002     2.693721
pairs(Auto)

boxplot(mpg~origin,data=Auto,names=c("American","European","Japanese"),main="Miles per Gallon ~ Origin")

boxplot(mpg~cylinders,data=Auto,main="Miles per Gallon ~ Origin")

9f

Any variable, except the number of cylinders and name, seem to be a good predictor in the value of mpg.

10

10a
library('MASS')
nrow(Boston) # number of rows.
## [1] 506
ncol(Boston) # number of columns.
## [1] 14

Inspecting the data shows that each column is a feature, and each row is an observation (Boston suburb).

head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
10b
par(mfrow = c(2,2))

plot(Boston[,c("dis","crim")])
plot(Boston[,c("ptratio","crim")])
plot(Boston[,c("medv","crim")])
plot(Boston[,c("indus","crim")])

boxplot(crim~chas,Boston)

10c

There seems to be an association between the median value of owner-occupied homes and crime per capital. Low median value of owner-occupied home indicates a higher crime per capita.

There is also seem to be a correlation between distance to employment centers and crime. Areas close to employment centers show high crime per capita.

10d
ranges = apply(Boston,2,range)
row.names(ranges) = c("min","max")
ranges
##         crim  zn indus chas   nox    rm   age     dis rad tax ptratio
## min  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6
## max 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0
##      black lstat medv
## min   0.32  1.73    5
## max 396.90 37.97   50
10e
selection = Boston[,"chas"]
nrow(Boston[selection,])
## [1] 35
10f
median(Boston[,"ptratio"])
## [1] 19.05
10g
Boston[which.min(Boston[,"medv"]),]
##        crim zn indus chas   nox    rm age    dis rad tax ptratio black
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.9
##     lstat medv
## 399 30.59    5

This suburb presents a significant black population, with all houses being built before 1940, and high teacher pupil ratio when comparing to the other observations in the data.

10h
rooms = lapply(1:8,function(x){ sum(Boston[,"rm"] > x) })
rooms
## [[1]]
## [1] 506
## 
## [[2]]
## [1] 506
## 
## [[3]]
## [1] 506
## 
## [[4]]
## [1] 504
## 
## [[5]]
## [1] 490
## 
## [[6]]
## [1] 333
## 
## [[7]]
## [1] 64
## 
## [[8]]
## [1] 13

Applying the summary function to observations with more the 8 rooms yields the output below. Straight away one can note that these suburbs have a low crime per capita, and high median home value.

summary(Boston[Boston[,"rm"] > 8,])
##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          black      
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
##  Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
##      lstat           medv     
##  Min.   :2.47   Min.   :21.9  
##  1st Qu.:3.32   1st Qu.:41.7  
##  Median :4.14   Median :48.3  
##  Mean   :4.31   Mean   :44.2  
##  3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :7.44   Max.   :50.0
apply(Boston[rooms>8,] ,2,mean) < apply(Boston,2,mean)
##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE 
##     tax ptratio   black   lstat    medv 
##   FALSE   FALSE   FALSE   FALSE   FALSE