Statistical Learning
\[\\[1in]\]
Question 2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
- We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
This is a regression problem, and we are interested in inference. n = 500 and p = 3
- We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
This is a classification problem, and we are most interested in prediction. n = 20 (20 similar products previously launched) and p = 13 (price, marketing budget, competition price, and ten other variables).
- We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.
This is a regression problem because % change in the US dollar is a quantitative DV. Also, it is a prediction problem as well because it states “we are interested in predicting the % change”. n = 52 (weekly data over 2012) and p = 3 (% change in US, % change in British, % change in German).
\[\\[1in]\]
Question 5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
-The advantages of a very flexible approach are that it decreases their bias, and it can also give a better fit for non-linear models.
-The disadvantages of this are that it requires estimating a larger number of parameters, therefore increasing the variance.
-A more flexible approach better when the relationship is highly non-linear. This is because you have several data points to find a pattern, and/or the irreducible error is low.
-A less flexible approach is best when the relationship is very linear since you don’t have that many data points, and/or the irreducible error is high.
-A less flexible approach would be preferred vs. a more flexible approach when we are looking for inference and the interpretability of the output.
\[\\[1in]\]
Question 6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non- parametric approach)? What are its disadvantages?
Because fewer assumptions are made regarding the data structure, a non-parametric method is better at fitting highly nonlinear patterns. However, in circumstances where there is a considerable risk of overfitting , this could be harmful. Furthermore, since you lack the interpretation structure of the parameters, you will find it difficult to understand the data.
\[\\[1in]\]
Question 8. This exercise relates to the “College” data set, which can be found in the file “College.csv”. It contains a number of variables for 777 different universities and colleges in the US.
(a). Use the read.csv() function to read the data into R. Call the loaded data “college”. Make sure that you have the directory set to the correct location for the data.
library(readr)
College <- read.csv("~/Downloads/College.csv", stringsAsFactors = FALSE)
(b). Look at the data using the view() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:
rownames(College) <- College[, 1]
View(College)
College <- College[, -1]
View(College)
(c). i. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(College)
Private Apps Accept Enroll Top10perc
Length:777 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
Mode :character Median : 1558 Median : 1110 Median : 434 Median :23.00
Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
Max. :48094 Max. :26330 Max. :6392 Max. :96.00
Top25perc F.Undergrad P.Undergrad Outstate Room.Board
Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780
1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
Median : 54.0 Median : 1707 Median : 353.0 Median : 9990 Median :4200
Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358
3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700 Max. :8124
Books Personal PhD Terminal S.F.Ratio
Min. : 96.0 Min. : 250 Min. : 8.00 Min. : 24.0 Min. : 2.50
1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50
Median : 500.0 Median :1200 Median : 75.00 Median : 82.0 Median :13.60
Mean : 549.4 Mean :1341 Mean : 72.66 Mean : 79.7 Mean :14.09
3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50
Max. :2340.0 Max. :6800 Max. :103.00 Max. :100.0 Max. :39.80
perc.alumni Expend Grad.Rate
Min. : 0.00 Min. : 3186 Min. : 10.00
1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
Median :21.00 Median : 8377 Median : 65.00
Mean :22.74 Mean : 9660 Mean : 65.46
3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
Max. :64.00 Max. :56233 Max. :118.00
Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50 %.
Elite <- rep("No", nrow(College))
Elite[College$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
College <- data.frame(College, Elite)
summary(Elite)
No Yes
699 78
There are 78 Elite schools.
plot(Outstate ~ Elite, data = College,
xlab = "Elite University",
ylab = "Tuition in $")

- Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative vari- ables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
par(mfrow=c(2,2))
hist(College$Top10perc, xlab = "Top 10%", main="")
hist(College$Top25perc, xlab = "Top 25%", main="")
hist(College$Grad.Rate, xlab = "Graduation rate", main="")
hist(College$PhD, xlab = "Proportion of faculty with Ph.D.’s", main="")

- Continue exploring the data, and provide a brief summary of what you discover.
summary(College$Books)
Min. 1st Qu. Median Mean 3rd Qu. Max.
96.0 470.0 500.0 549.4 600.0 2340.0
The avarage amount of money that students spend on books is $549, which is a lot of money!
\[\\[1in]\]
Question 9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
data(Auto)
summary(Auto)
mpg cylinders displacement horsepower weight
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
acceleration year origin name
Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
(Other) :365
View(Auto)
- Which of the predictors are quantitative, and which are qualitative?
sapply(Auto, class)
mpg cylinders displacement horsepower weight acceleration year
"numeric" "integer" "numeric" "integer" "integer" "numeric" "integer"
origin name
"integer" "factor"
Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year.
Qualitative: name, origin
- What is the range of each quantitative predictor? You can answer this using the range() function.
#Selecting quantitative predictors
qualitative_columns <- which(names(Auto) %in% c('name', 'origin'))
qualitative_columns
[1] 8 9
# Apply the range function to the columns of auto data that are not qualitative
sapply(Auto[, -qualitative_columns], range)
mpg cylinders displacement horsepower weight acceleration year
[1,] 9.0 3 68 46 1613 8.0 70
[2,] 46.6 8 455 230 5140 24.8 82
- What is the mean and standard deviation of each quantitative predictor?
sapply(Auto[, -qualitative_columns], mean)
mpg cylinders displacement horsepower weight acceleration year
23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327 75.979592
sapply(Auto[, -qualitative_columns], sd)
mpg cylinders displacement horsepower weight acceleration year
7.805007 1.705783 104.644004 38.491160 849.402560 2.758864 3.683737
- Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
sapply(Auto[-seq(10, 85), -qualitative_columns], mean)
mpg cylinders displacement horsepower weight acceleration year
24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899 77.145570
sapply(Auto[-seq(10, 85), -qualitative_columns], sd)
mpg cylinders displacement horsepower weight acceleration year
7.867283 1.654179 99.678367 35.708853 811.300208 2.693721 3.106217
- Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
pairs(Auto[, -qualitative_columns])

plot(mpg ~ weight, data = Auto)

# Heavier weight correlates with lower mpg.
plot(mpg ~ year, data = Auto)

# Cars become more efficient over time.
- Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
pairs(Auto)

Yes, during that year, acceleration and origin would be good predictors of mpg.
\[\\[1in]\]
Question 10. This exercise involves the Boston housing data set.
- To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library.
library(ISLR2)
Now the data set is contained in the object Boston.
Boston
Read about the data set:
?Boston
How many rows are in this data set? How many columns? What do the rows and columns represent?
There are 506 rows, and 13 columns. The rows represent the suburbs of Boston and the columns represent:
-crim: per capita crime rate by town.
-zn: proportion of residential land zoned for lots over 25,000 sq.ft.
-indus: proportion of non-retail business acres per town.
-chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
-nox: nitrogen oxides concentration (parts per 10 million).
-rm: average number of rooms per dwelling.
-age: proportion of owner-occupied units built prior to 1940.
-dis: weighted mean of distances to five Boston employment centres.
-rad: index of accessibility to radial highways.
-tax: full-value property-tax rate per $10,000.
-ptratio: pupil-teacher ratio by town.
-lstat: lower status of the population (percent).
-medv: median value of owner-occupied homes in $1000s.
- Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
pairs(Boston)

par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)

- Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
plot(crim ~ age, data = Boston, log = "xy")

# Older homes, more crime
plot(crim ~ dis, data = Boston, log = "xy")

#Closer to work-area, more crime
- Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
par(mfrow=c(1,3))
hist(Boston$crim[Boston$crim > 1], breaks=25)
# most cities have low crime rates, but there is a long tail: 18 suburbs appear
# to have a crime rate > 20, reaching to above 80
hist(Boston$tax, breaks=25)
# there is a large divide between suburbs with low tax rates and a peak at 660-680
hist(Boston$ptratio, breaks=25)

- How many of the census tracts in this data set bound the Charles river?
summary(Boston$chas==1)
Mode FALSE TRUE
logical 471 35
- What is the median pupil-teacher ratio among the towns in this data set?
median(Boston$ptratio)
[1] 19.05
- Which census tract of Boston has lowest median value of owner- occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
which.min(Boston$medv)
[1] 399
Suburb #399 has the lowest median value of owner- occupied homes.
par(mfrow=c(5,3), mar=c(2, 2, 1, 0))
for (i in 1:ncol(Boston)){
hist(Boston[, i], main=colnames(Boston)[i], breaks="FD")
abline(v=Boston[399, i], col="red", lw=3)
}

- In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.
summary(Boston$rm > 7)
Mode FALSE TRUE
logical 442 64
There are 64 suburbs with more than 7 rooms per dwelling.
summary(Boston$rm > 8)
Mode FALSE TRUE
logical 493 13
There are 13 suburbs with more than 8 rooms per dwelling
#Suburbs that average more than eight rooms per dwelling:#
summary(subset(Boston, rm > 8))
crim zn indus chas nox
Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000 Min. :0.4161
1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000 1st Qu.:0.5040
Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000 Median :0.5070
Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538 Mean :0.5392
3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000 3rd Qu.:0.6050
Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000 Max. :0.7180
rm age dis rad tax
Min. :8.034 Min. : 8.40 Min. :1.801 Min. : 2.000 Min. :224.0
1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288 1st Qu.: 5.000 1st Qu.:264.0
Median :8.297 Median :78.30 Median :2.894 Median : 7.000 Median :307.0
Mean :8.349 Mean :71.54 Mean :3.430 Mean : 7.462 Mean :325.1
3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652 3rd Qu.: 8.000 3rd Qu.:307.0
Max. :8.780 Max. :93.90 Max. :8.907 Max. :24.000 Max. :666.0
ptratio lstat medv
Min. :13.00 Min. :2.47 Min. :21.9
1st Qu.:14.70 1st Qu.:3.32 1st Qu.:41.7
Median :17.40 Median :4.14 Median :48.3
Mean :16.36 Mean :4.31 Mean :44.2
3rd Qu.:17.40 3rd Qu.:5.12 3rd Qu.:50.0
Max. :20.20 Max. :7.44 Max. :50.0
summary(Boston)
crim zn indus chas nox
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710
rm age dis rad tax
Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0
1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0
Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0
Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2
3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0
Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0
ptratio lstat medv
Min. :12.60 Min. : 1.73 Min. : 5.00
1st Qu.:17.40 1st Qu.: 6.95 1st Qu.:17.02
Median :19.05 Median :11.36 Median :21.20
Mean :18.46 Mean :12.65 Mean :22.53
3rd Qu.:20.20 3rd Qu.:16.95 3rd Qu.:25.00
Max. :22.00 Max. :37.97 Max. :50.00
Relatively lower crime (comparing range), lower lstat (comparing range)
