library(tidyverse)
library(openintro)

Exercise 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide \(n\) and \(p\).
\((a)\) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.


This is a regression problem as the response variable is numeric and continuous. We are interested in inference as we want to understand the relationship of the firm’s characteristics to the CEO salary. The** \(n\) (number of firms) is 500 and the** \(p\) (number of characteristics) is 3.

\((b)\) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.


This is a classification problem as the response variable is categorical. We are interested in prediction as we wish to know if the product will be a success or a failure. The \(n\) (number of products) is 20 and \(p\) (number of characteristics) is 13.

\((c)\) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.


This is a regression problem as the response variable is numeric and continuous. We are interested in prediction as we wish to predict the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. The \(n\) (number of weeks) is 52 and \(p\) (number of characteristics) is 4.

Exercise 5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?


The advantages of a very flexible approach are that it may give a better fit for non-linear models and it decreases bias.

The disadvantages of a very flexible approach are that it requires estimating a greater number of parameters, it is prone to overfitting (follows the noise closely) and it increases the variance.

A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results.

A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.

Exercise 6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?


A parametric approach reduces the problem of estimating \(f\) down to one of estimating a set of parameters because it assumes a form for \(f\).

A non-parametric approach does not assume a particular form of \(f\) and so requires a very large sample to accurately estimate \(f\).

The advantages of a parametric approach to regression or classification are the simplifying of modeling \(f\) to a few parameters and not as many observations are required compared to a non-parametric approach.

The disadvantages of a parametric approach to regression or classification are a potentially inaccurate estimate \(f\) if the form of \(f\) assumed is wrong or to overfit the observations if more flexible models are used.

Exercise 8

This exercise relates to the College data set, which can be found in the file College.csv on the book website. It contains a number of variables for 777 different universities and colleges in the US. The variables are


• Private : Public/private indicator

• Apps : Number of applications received

• Accept : Number of applicants accepted

• Enroll : Number of new students enrolled

• Top10perc : New students from top 10 % of high school class

• Top25perc : New students from top 25 % of high school class

• F.Undergrad : Number of full-time undergraduates

• P.Undergrad : Number of part-time undergraduates

• Outstate : Out-of-state tuition

• Room.Board : Room and board costs

• Books : Estimated book costs

• Personal : Estimated personal spending

• PhD : Percent of faculty with Ph.D.’s

• Terminal : Percent of faculty with terminal degree

• S.F.Ratio : Student/faculty ratio

• perc.alumni : Percent of alumni who donate

• Expend : Instructional expenditure per student

• Grad.Rate : Graduation rate

Before reading the data into R, it can be viewed in Excel or a text editor.
\((a)\) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.
library(ISLR)
data(College)
college <- read.csv("College.csv", stringsAsFactors = T)
\((b)\) Look at the data using the View() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:
rownames(college) <- college[, 1]
#View(college)
head(college)
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try


college <- college[, -1]
#View(college)
head(college)
Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.
\((c)\) \(i.\) Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00
\(ii.\) Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
pairs(college[, 1:10])

\(iii.\) Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
plot(college$Private, college$Outstate, xlab = "Private University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")

\(iv.\) Create a new qualitative variable, called Elite, by binning the Top 10% variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50 %.
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
summary(college$Elite)
##  No Yes 
## 699  78
plot(college$Elite, college$Outstate, xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot")

\(v.\) Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
par(mfrow = c(2,2))
hist(college$Books, col = 2, xlab = "Books", ylab = "Count")
hist(college$PhD, col = 3, xlab = "PhD", ylab = "Count")
hist(college$Grad.Rate, col = 4, xlab = "Grad Rate", ylab = "Count")
hist(college$perc.alumni, col = 6, xlab = "% alumni", ylab = "Count")

\(vi.\) Continue exploring the data, and provide a brief summary of what you discover.


The histogram that shows the interactions between colleges and the percent of faculty that have Ph.Ds is very left skewed. The average cost of books seems to be $500. The graduation rate across colleges is roughly 65%. Most college alumni also do not seem to be donating to their respective universities. Roughly 20% of students seem to be donating on average.

summary(college$PhD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   62.00   75.00   72.66   85.00  103.00

Upon investigating further, it can be seen that some universities have 103% of faculty that hold Ph.Ds, which is extremely unusual.

highphd <- college[college$PhD == 103, ]
print(highphd)
##                                   Private Apps Accept Enroll Top10perc
## Texas A&M University at Galveston      No  529    481    243        22
##                                   Top25perc F.Undergrad P.Undergrad Outstate
## Texas A&M University at Galveston        47        1206         134     4860
##                                   Room.Board Books Personal PhD Terminal
## Texas A&M University at Galveston       3122   600      650 103       88
##                                   S.F.Ratio perc.alumni Expend Grad.Rate Elite
## Texas A&M University at Galveston      17.4          16   6415        43    No

Texas A&M University at Galveston seems to be the only university to have 103% of faculty that hold Ph.Ds. It is possible that this was a data entry error.

Exercise 9

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
\((a)\) Which of the predictors are quantitative, and which are qualitative?


All predictors, except origin and name, are quantitative.

auto <- read.csv("Auto.csv", na.strings = "?",stringsAsFactors = T)
auto <- na.omit(auto)
str(auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
\((b)\) What is the range of each quantitative predictor? You can answer this using the range() function.
sapply(auto[, -c(8, 9)], range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82
\((c)\) What is the mean and standard deviation of each quantitative predictor?
sapply(auto[, -c(8, 9)], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592
sapply(auto[, -c(8, 9)], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737
\((d)\) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
subset <- auto[-c(10:85), -c(8,9)]
sapply(subset, range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0         3           68         46   1649          8.5   70
## [2,] 46.6         8          455        230   4997         24.8   82
sapply(subset, mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570
sapply(subset, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217
\((e)\) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.


mpg seems to be higher on a 4 cylinder vehicle rather than others. Weight, displacement and horsepower have an inverse effect with mpg. There is an overall increase in mpg over the years. Japanese cars have higher mpg than US or European cars.

auto$cylinders <- as.factor(auto$cylinders)
auto$year <- as.factor(auto$year)
auto$origin <- as.factor(auto$origin)
pairs(auto)

\((f)\) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.


Cylinders, horsepower, year and origin can be used as predictors. Displacement and weight were not used because they are highly correlated with horsepower and with each other.

auto$horsepower <- as.numeric(auto$horsepower)
cor(auto$weight, auto$horsepower)
## [1] 0.8645377
cor(auto$weight, auto$displacement)
## [1] 0.9329944
cor(auto$displacement, auto$horsepower)
## [1] 0.897257

Exercise 10

This exercise involves the Boston housing data set.

\((a)\) To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library.
library(ISLR2)
## 
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit
Now the data set is contained in the object Boston.
Boston$chas <- as.factor(Boston$chas)
Read about the data set:
?Boston
How many rows are in this data set? How many columns? What do the rows and columns represent?
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 13
\((b)\) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.


The relationship between crim and nox or rm is hard to discern. The relationship between crim and age is left skewed. The relationship between crim and dis is right skewed.

par(mfrow = c(2, 2))
plot(Boston$nox, Boston$crim)
plot(Boston$rm, Boston$crim)
plot(Boston$age, Boston$crim)
plot(Boston$dis, Boston$crim)

\((c)\) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.


Most suburbs do not have any crime (80% of data falls in crim < 20). There may be a relationship between crim and nox, rm, age, dis, lstat and medv.

hist(Boston$crim, breaks = 50)

pairs(Boston[Boston$crim < 20, ])

\((d)\) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.


The range of the crime rate varies across census tracts, with some areas experiencing higher crime rates than others. This could reflect different socio-economic conditions and levels of urbanization across different parts of the city.

Many census tracts with a tax rate of 666 suggests issues of data completeness, as this value is used to indicate missing or censored data. Further investigation is needed to find the reasons behind this pattern and whether it reflects true differences in tax rates or are data anomalies.

Similarly, the range in pupil-teacher ratios reflects variations in educational resources and class sizes between different areas of Boston. Larger pupil-teacher ratios may suggest overcrowding or resource limitations in schools within particular census tracts.

hist(Boston$crim, breaks = 50)

nrow(Boston[Boston$crim > 20, ])
## [1] 18
hist(Boston$tax, breaks = 50)

nrow(Boston[Boston$tax == 666, ])
## [1] 132
hist(Boston$ptratio, breaks = 50)

nrow(Boston[Boston$ptratio > 20, ])
## [1] 201
\((e)\) How many of the census tracts in this data set bound the Charles river?
nrow(Boston[Boston$chas == 1, ])
## [1] 35
\((f)\) What is the median pupil-teacher ratio among the towns in this data set?
median(Boston$ptratio)
## [1] 19.05
\((g)\) Which census tract of Boston has lowest median value of owner- occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.


Census tract 399 and 406 has the lowest median value of owner-occupied homes. Both have high crime rate (crim) and it above average based on the range. There are no land zones for large residential plots but there is a high industrial presence (indus is above average). These tracts are not located near the Charles River (chas= 0), have high air pollution (nox is above average most likely due to industrial activity), have smaller homes than average (rm is less than 6.28), and all homes were older constructions (age = 100). They are very close to employment hubs (dis is 1.48 for 399 and 1.42 for 406), has maximum highway accessibility (rad = 24), have high tax rates (tax = 666), higher than average pupil-teacher ratios (pratio = 20.2 and close to upper range) which indicate limited educational resources. Tract 399 has a high percentage of low-income residents (lstat = 30.59, which is in the upper range) and tract 406 also has a high percentage of low-income residents but is less than tract 399.

lowest <- min(Boston$medv)
min_row <- which(Boston$medv == lowest)
low_tract <- Boston[min_row,]
low_tract
summary(Boston)
##       crim                zn             indus       chas         nox        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471   Min.   :0.3850  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1: 35   1st Qu.:0.4490  
##  Median : 0.25651   Median :  0.00   Median : 9.69           Median :0.5380  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14           Mean   :0.5547  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10           3rd Qu.:0.6240  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
##        rm             age              dis              rad        
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
##  1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
##  Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
##  Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
##  3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000  
##       tax           ptratio          lstat            medv      
##  Min.   :187.0   Min.   :12.60   Min.   : 1.73   Min.   : 5.00  
##  1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95   1st Qu.:17.02  
##  Median :330.0   Median :19.05   Median :11.36   Median :21.20  
##  Mean   :408.2   Mean   :18.46   Mean   :12.65   Mean   :22.53  
##  3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.00   Max.   :37.97   Max.   :50.00
\((h)\) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.


64 suburbs average more than seven rooms per dwelling. 13 suburbs average more than seven rooms per dwelling. Tracts 98, 164, 205, 225, 226, 227, 233, 234, 254, 258, 263, 268, and 365 have more than eight rooms per dwelling. Most of these census tracts with more than 8 rooms per dwelling appear to be in wealthier, suburban areas with lower crime rates, lower pollution levels, and more residential zoning. A few of these tracts, like 365, have higher crime rates and are closer to industrial or commercial areas, suggesting that some larger homes are still located in more mixed-use or even urban settings. The relatively low pupil-teacher ratios and the low percentages of lower-status populations further suggest that these areas are more affluent with better educational resources.

nrow(Boston[Boston$rm > 7, ])
## [1] 64
nrow(Boston[Boston$rm > 8, ])
## [1] 13
Boston[Boston$rm >8, ]