1. Flexible vs Inflexible Methods
2. Scenarios - Classification or Regression?
3. Bias-Variance Decomposition
- (a) Sketch
- (b) Explanation
4. Applications of Statistical Learning
5. Flexible vs Inflexible Methods
6. Parametric & Non-Parametric Approaches
7. K-Nearest Neighbors
8. APPLIED: The college Dataset
9. APPLIED: The Auto Dataset
10. APPLIED: The Boston Housing Dataset

NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

1. Flexible vs Inflexible Methods

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a) Large $n$, Small $p$

Q: The sample size $n$ is extremely large, and the number of predictors $p$ is small.

Better - the large number of observations (and low dimensionality) mean that a flexible method could capture more complex relationships without a high chance of overfitting

(b) Large $p$, Small $n$

Q: The number of predictors $p$ is extremely large, and the number of observations $n$ is small.

Worse - in high dimension & low observation datasets, flexible methods are at a higher risk of finding spurious relationships with the response (i.e. overfitting) and will likely underperform.

(c) Non-Linear

Q: The relationship between the predictors and response is highly non-linear.

Better - a flexible method would be able to capture these non-linear relationships with the response.

(d) High-Variance Error

Q: The variance of the error terms, i.e. $\sigma^{2} = Var(\epsilon)$, is extremely high.

Whenever we build a predictive model, we assume there exists some relationship between the response $Y$ and the predictors $X = (X_{1}, X_{2}, ..., X_{p})$, which can be written in the form:

\[Y = f(X) + \epsilon\] Where $f$ is some fixed function of $X$, and $\epsilon$ is a random error term (independent of $X$), with mean zero. That is, the value we expect $\epsilon$ to take, on average, is zero: \[\mathbb{E}[\epsilon] = 0\]

Looking at the expected value of the residual sum of squares, $(Y - \hat{Y})^2$, we can show that its expected value can be expressed as:

\[ \begin{align*} \mathbb{E}[(Y - \hat{Y})^2] &= \mathbb{E}[(f(X) + \epsilon - \hat{f}(X))^2] \\ & = (f(X) - \hat{f}(X))^2 + Var(\epsilon) \end{align*} \]

Where $[f(X) - \hat{f}(X)]^2$ is called the reducible error, and $Var(\epsilon)$ is the irreducible error.

In the case of the irreducible error being very high, we would expect a flexible method to perform worse, as it can find relationships between the response and the large irreducible error that a less flexible method would not (it can overfit far easier).

2. Scenarios - Classification or Regression?

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide $n$ and $p$.

(a) CEO Salary

Q: We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Regression (CEO salary is continuous)
Inference (understand the relationship between the predictors and salary)
$n$ = 500
$p$ = 3

(b) Product Success

Q: We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Classification (products classed as either a success or a failure)
Prediction (not concerned with how each predictor relates with the response)
$n$ = 20
$p$ = 13

(c) Currency % Change

Q: We are interested in predicting the % change in the US Dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Regression (% change in USD/Euro exchange rate is continuous)
Prediction (not concerned with how each predictor relates with the response)
$n$ = 52
$p$ = 3

3. Bias-Variance Decomposition

We now revisit the bias-variance decomposition.

(a) Sketch

Q: Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

(b) Explanation

Q: Explain why each of the five curves has the shape displayed in part (a).

Bias is the error introduced when the complexity of a problem is not sufficiently modelled by the simplicity of the chosen method (e.g. linear regression for non-linear relationships). As model flexibility increases (linear->trees->boosting, decreasing K in KNN, etc.), bias decreases monotonically, because less assumptions are being made about the data structure and its relationship with the response.
Variance refers to the amount by which our predictions would change if the training data were changed, and can be thought of as the error introduced when a model is overfit to the training data. As model flexibility increases, variance increases monotonically, because the method becomes more specified (and then overspecified) to the nuances of the training data, to the point where $\hat{f}$ doesn’t generalize to new data.
Training Error decreases monotonically as flexibility increases. More flexible methods are generally higher variance, and can learn more complex relationships more completely, but also run the risk of overfitting, which is seen where the training error and test error diverge. Think of a decision tree, where the number of terminal nodes = the number of training observations (this model will have 0 training error and a high test error).
Test Error decreases, levels-out then increases. The minima is the point of optimal bias-variance tradeoff, where $\mathbb{E}[(Y - \hat{Y})^2] = [Bias(\hat{f}(X))]^2 + Var(\hat{f}(X)) + Var(\epsilon)$ is minimized. To the right of this minima, the method is overfitting ($\hat{f}$ is too high variance to make up for its lack of bias), and to the left the method is underfitting ($\hat{f}$ is too high bias to make up for its lack of variance).
Irreducible Error refers to the error introduced by inherent uncertainty/noise in the system being approximated. It is constant and > 0 regardless of the flexibility of the model, because $\epsilon$ may contain unmeasured variables not in $X$ that could be used to predict $y$, and because $\epsilon$ may contain unmeasurable variation in $y$ that could not be accounted for in $X$ even if we wanted to. This means that it doesn’t matter how closely $\hat{f}$ models the ‘true’ function $f$, there will still be an (unknown) minimum error of $Var(\epsilon) > 0$.

4. Applications of Statistical Learning

You will now think of some real-life applications for statistical learning.

(a) Classification Uses

Q: Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

(i) Fraud Detection

Application: Fraud detection system
Response: Classifying transactions into fraudulent (1) or not fraudulent (0)
Predictors: Size of transaction, time of day, new transaction (first time to this company/account or not), country receiving transaction, etc.
Goal: Prediction (we don’t care for the form that $\hat{f}$ takes, as long as it predicts & prevents fraud accurately)

(ii) Disease Diagnosis

Application: Disease Diagnosis
Response: Classifying people into having (1) or not having (0) a particular disease
Predictors: Age, BMI, Smoker (yes/no), resting heart rate, whether they display particular symptoms (yes/no), etc.
Goal: Prediction ($\hat{f}$ can be a black-box if it improves the effectiveness of predictions and saves more lives)

(iii) Loan Defaults

Application: Understanding contributing factors to loan defaults
Response: Classifying loans into default (1) or not (0)
Predictors: Credit score, outstanding balance, type of credit, other recent defaults (yes/no), total payments so far, etc.
Goal: Inference (determine what predictors are associated with loan defaults, and how they are related, e.g. as outstanding balance increases by 1k…)

(b) Regression Uses

Q: Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

(i) Personal Income

Application: Understanding the relationship between lifestyle, personality etc. and a persons income
Response: Annual Income
Predictors: Age, experience in role, combined salary of reportees, weekly hours worked, personality traits, number of children, etc.
Goal: Inference (want to understand the nature of relationship between the predictors and the response)

(ii) House Prices

Application: Predicting house prices, to identify and bid on houses that may be underpriced
Response: House Price
Predictors: Total area of property, number of bedrooms, number of bathrooms, basement (yes/no), garden (yes/no), year built, etc.
Goal: Prediction (we want to value the property as accurately as possible)

(iii) YouTube Views

Application: Predicting the number of views a YouTube video will receive (for manual review/promotion etc.)
Response: Views (first month)
Predictors: Total views on channel, avg views over last 3 videos, views (after 1 day), number of google search results (after 1 day), number of subscribers, video length, channel theme (political, gaming, etc.), average videos per month, etc.
Goal: Prediction (we want accurate predictions, don’t care about the form $\hat{f}$ takes)

(c) Clustering Uses

Q: Describe three real-life applications in which cluster analysis might be useful.

Clustering of a customer-base based on spending habits, age, location etc. to design seperate marketing strategies for each
Clustering attributes of an animal of a particular species to explore possible sub-species
Recommendation systems, where products, movies, etc. are recommended based on what was popular with others in the same cluster

5. Flexible vs Inflexible Methods

Q: What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification?

Flexible method advantages:

Decreased bias (less assumptions about the functional form of $\hat{f}$)
Capture non-linear relationships
Capture complex variable interactions
Often require less assumptions
Often outperform less flexible methods in prediction error

Flexible method disadvantages:

Increased variance (easier to overfit)
Require tuning
Increased training times (cross-validation or equivalent is basically necessary to mitigate high variance & tune hyperparameters)
Require more variables and observations to work optimally
Less interpretable

Q: Under what circumstances might a more flexible approach be preferred to a less flexible approach?

When the primary concern is predictive power
When there is sufficient computing power for variance-controlling measures (e.g. cross-validation)
When there is a large number of variables
When non-linear or interaction relationships are present
When there is a large amount of data

Q: When might a less flexible approach be preferred?

When we are more interested in interpretability/inference
When we suspect that the assumptions of the less flexible method are satisfied (e.g. relationships appear linear, some variables are certainly useless -> LASSO linear regression may perform better than anything else)
When clarification behind why an observation has a particular prediction may be necessary

6. Parametric & Non-Parametric Approaches

Q: Describe the differences between a parametric and a non-parametric statistical learning approach.

A parametric approach involves making an assumption about the functional form/shape that $f$ takes (e.g. that $f$ is linear in $X$: $f(X) = \beta_{0} + \sum_{i = 1}^{p}\beta_{i}X_{i} + \epsilon$), then simply applying a fitting procedure (e.g. OLS, minimizing $\sum_{i = 1}^{n}(y_{i} - \hat{y}_{i})^2$), fitting the model by estimating the parameters.

Conversely, non-parametric approaches do not make the same assumptions about the functional form of $f$, giving them the potential to fit a wider range of possible ‘shapes’ with $\hat{f}$.

Q: What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

Parametric method advantages:

The problem of estimating $f$ is simplified considerably (instead of estimating an arbitrary $p$-dimensional function, we only have to estimate the $p+1$ parameters)
This significantly reduces computation time
Fewer parameter estimates also means that less observations are required to accurately estimate $f$
Generally more interpretable, although there are non-parametric methods that are also highly interpretable (e.g. decision trees)

Parametric method disadvantages:

The form that we assume $f$ takes will often not match the true unknown form of $f$, limiting predictive power and meaning and any inferences we draw from the model will be poor/limited
Often more heavily influenced by outliers, particularly when compared to tree-based non-parametric approaches

7. K-Nearest Neighbors

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Suppose we wish to use this data set to make a prediction for $Y$ when $X_{1} = X_{2} = X_{3} = 0$ using K-nearest neighbors.

data <- data.frame(X1 = c(0, 2, 0, 0, -1, 1), 
                   X2 = c(3, 0, 1, 1, 0, 1), 
                   X3 = c(0, 0, 3, 2, 1, 1), 
                   Y = c('Red', 'Red', 'Red', 'Green', 'Green', 'Red'), 
                   stringsAsFactors = F)

colnames(data) <- c('$X_{1}$', '$X_{2}$', '$X_{3}$', '$Y$')

library(knitr)
library(kableExtra)

kable(data, row.names = T) %>%
   kable_styling(bootstrap_options = "striped", full_width = F)

	$X_{1}$	$X_{2}$	$X_{3}$	$Y$
1	0	3	0	Red
2	2	0	0	Red
3	0	1	3	Red
4	0	1	2	Green
5	-1	0	1	Green
6	1	1	1	Red

(a) Euclidean Distance

Q: Compute the Euclidean distance between each observation and the test point, $Z = (X_{1}, X_{2}, X_{3}) = (0, 0, 0)$.

$d(a, b) = d(b, a) = \sqrt{\sum_{i = 1}^{p}(a_{i} - b_{i})^2}$

Z <- c(0, 0, 0)

data$dist_X_Z <- sqrt((data$`$X_{1}$` - Z[1])^2 + 
                        (data$`$X_{2}$` - Z[2])^2 + 
                        (data$`$X_{3}$` - Z[3])^2) %>%
  round(4)

colnames(data)[5] <- '$d(X, Z)$'

kable(data, row.names = T, escape = F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  column_spec(6, background = "lightgreen")

	$X_{1}$	$X_{2}$	$X_{3}$	$Y$	$d(X, Z)$
1	0	3	0	Red	3.0000
2	2	0	0	Red	2.0000
3	0	1	3	Red	3.1623
4	0	1	2	Green	2.2361
5	-1	0	1	Green	1.4142
6	1	1	1	Red	1.7321

(b) $K$ = 1

Q: What is our prediction with $K$ = 1? Why?

Green, since the single nearest neighbor to $Z$ is observation 5.

data[5, c(4, 5)] <- cell_spec(data[5, c(4, 5)], bold = T, background = "lightgreen")


kable(data, row.names = T, escape = F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

	$X_{1}$	$X_{2}$	$X_{3}$	$Y$	$d(X, Z)$
1	0	3	0	Red	3
2	2	0	0	Red	2
3	0	1	3	Red	3.1623
4	0	1	2	Green	2.2361
5	-1	0	1	Green	1.4142
6	1	1	1	Red	1.7321

(c) $K$ = 3

Q: What is our prediction with $K$ = 3? Why?

Red, since the three nearest neighbors to $Z$ are observations 5, 6 and 2. Two of these have $Y$ = “Red”, so the prediction for $Z$ would be “Red”.

data[2, c(4, 5)] <- cell_spec(data[2, c(4, 5)], bold = T, background = "lightgreen")
data[6, c(4, 5)] <- cell_spec(data[6, c(4, 5)], bold = T, background = "lightgreen")

kable(data, row.names = T, escape = F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

	$X_{1}$	$X_{2}$	$X_{3}$	$Y$	$d(X, Z)$
1	0	3	0	Red	3
2	2	0	0	Red	2
3	0	1	3	Red	3.1623
4	0	1	2	Green	2.2361
5	-1	0	1	Green	1.4142
6	1	1	1	Red	1.7321

(d) Non-Linear Bayes Decision Boundary

Q: If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for $K$ to be large or small? Why?

We would expect a small $K$ to perform better for a nonlinear decision boundary. A small $K$ would be more ‘flexible’ and the KNN boundary would be pulled in different directions easier, whereas a large $K$ would have a ‘smoothing’ effect and produce a more linear boundary (hence, a worse classifier), because it takes more points into account.

8. APPLIED: The `college` Dataset

NOTE: Because some of this code in the book is a little old, here I will sometimes use code that I believe is more efficient/simple to accomplish the same tasks.

This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

The variables are:

Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10% of high school class
Top25perc : New students from top 25% of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate

(a) Import

Q: Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

college <- read.csv("./College.csv", stringsAsFactors = TRUE)

(b) View, Row Names

Q: Basically the title.

library(tidyverse)

rownames(college) <- college$X
college$X <- NULL

glimpse(college)

## Rows: 777
## Columns: 18
## $ Private     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ...
## $ Apps        <int> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, ...
## $ Accept      <int> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 1...
## $ Enroll      <int> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 4...
## $ Top10perc   <int> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44,...
## $ Top25perc   <int> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73,...
## $ F.Undergrad <int> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 1...
## $ P.Undergrad <int> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, ...
## $ Outstate    <int> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1...
## $ Room.Board  <int> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 3...
## $ Books       <int> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, ...
## $ Personal    <int> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800,...
## $ PhD         <int> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79,...
## $ Terminal    <int> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87...
## $ S.F.Ratio   <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11....
## $ perc.alumni <int> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, ...
## $ Expend      <int> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 116...
## $ Grad.Rate   <int> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68,...

(c) Exploring

(i) Summary

Q: Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

(ii) Pairwise Scatter Plots

Q: Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.

pairs(college[ ,1:10])

(iii) `Outstate` vs `Private`

Q: Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

# plot(college$Private, college$Outstate)

theme_set(theme_light()) # setting a theme for all graphs (looks cleaner)

ggplot(college, aes(x = Private, y = Outstate, fill = Private)) + 
  geom_boxplot() + 
  scale_y_continuous(labels = scales::comma_format()) + 
  theme(legend.position = "none") + 
  labs(title = "Outstate vs Private - Boxplot")

(iv) New Variable - `Elite`

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

college$Elite <- factor(ifelse(college$Top10perc <= 50, "No", "Yes"))

# above is a one-liner for the given code below

# Elite2=rep("No",nrow(college))
# Elite2[college$Top10perc >50]="Yes"
# Elite2=as.factor(Elite2)
# college=data.frame(college, Elite2)
# 
# identical(college$Elite, college$Elite2)

Q: Use the summary() function to see how many elite universities there are.

summary(college$Elite)

##  No Yes 
## 699  78

Q: Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

ggplot(college, aes(x = Elite, y = Outstate, fill = Elite)) + 
  geom_boxplot() + 
  scale_y_continuous(labels = scales::comma_format()) + 
  theme(legend.position = "none") + 
  labs(title = "Outstate vs Elite - Boxplot")

(v) Histograms

Q: Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.

apps_1 <- ggplot(college, aes(x = Apps)) + 
  geom_histogram(bins = 30, fill = "deepskyblue3") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(title = "Apps - Histogram", 
       subtitle = "Bins = 30", 
       y = "Count")

apps_2 <- ggplot(college, aes(x = Apps)) + 
    geom_histogram(bins = 60, fill = "deepskyblue3") + 
    scale_x_continuous(labels = scales::comma_format(), limits = c(0, 20000)) + 
    labs(title = "Apps - Histogram", 
         subtitle = "Bins = 60, Apps range: 0 - 20,000", 
       y = "Count")



phd_1 <- ggplot(college, aes(x = PhD)) + 
  geom_histogram(bins = 15, fill = "#16A085") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(title = "PhD - Histogram", 
       subtitle = "Bins = 15", 
       y = "Count")

phd_2 <- ggplot(college, aes(x = PhD)) + 
  geom_histogram(bins = 40, fill = "#16A085") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(title = "PhD - Histogram", 
       subtitle = "Bins = 40", 
       y = "Count")



room.board_1 <- ggplot(college, aes(x = Room.Board)) + 
  geom_histogram(bins = 10, fill = "#BB8FCE") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(title = "Room.Board - Histogram", 
       subtitle = "Bins = 10", 
       y = "Count")

room.board_2 <- ggplot(college, aes(x = Room.Board)) + 
  geom_histogram(bins = 30, fill = "#BB8FCE") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(title = "Room.Board - Histogram", 
       subtitle = "Bins = 30", 
       y = "Count")



library(gridExtra)

grid.arrange(apps_1, apps_2, 
             phd_1, phd_2, 
             room.board_1, room.board_2, 
             ncol = 2, nrow = 3)

(vi) Continued Exploration: `best_predictor()`

Q: Continue exploring the data, and provide a brief summary of what you discover.

I wrote the following best_predictor() function based on some work I’ve done before, and it is really useful for getting an overview of strong relationships between variables (particularly where plotting every combination of variables isn’t practical).

It will take a dataframe and a response variable $y$ as the two arguments.

It then fits a linear or logistic regression model (depending on whether $y$ is numeric or categorical) between each variable, trying logarithmic & quadratic transformations of the predictor if it’s numeric.

The output is a table of the $R^2$ or $AIC$ values with that response vs all other variables in the dataset, from best to worst.

best_predictor <- function(dataframe, response) {
  
  if (sum(sapply(dataframe, function(x) {is.numeric(x) | is.factor(x)})) < ncol(dataframe)) {
    stop("Make sure that all variables are of class numeric/factor!")
  }
  
  # pre-allocate vectors
  varname <- c()
  vartype <- c()
  R2 <- c()
  R2_log <- c()
  R2_quad <- c()
  AIC <- c()
  AIC_log <- c()
  AIC_quad <- c()
  y <- dataframe[ ,response]
  
  # # # # # NUMERIC RESPONSE # # # # #
  if (is.numeric(y)) {
    
    for (i in 1:ncol(dataframe)) {
      
      x <- dataframe[ ,i]
      varname[i] <- names(dataframe)[i]
      
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      if (!identical(y, x)) {
        
        # linear: y ~ x
        R2[i] <- summary(lm(y ~ x))$r.squared 
        
        # log-transform: y ~ log(x)
        if (is.numeric(x)) { 
          if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
            R2_log[i] <- summary(lm(y ~ log(x + abs(min(x)) + 1)))$r.squared
          } else {
            R2_log[i] <- summary(lm(y ~ log(x)))$r.squared
          }
        } else {
          R2_log[i] <- NA
        }
        
        # quadratic: y ~ x + x^2
        if (is.numeric(x)) { 
          R2_quad[i] <- summary(lm(y ~ x + I(x^2)))$r.squared
        } else {
          R2_quad[i] <- NA
        }
        
      } else {
        R2[i] <- NA
        R2_log[i] <- NA
        R2_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               R2 = round(R2, 3), 
               R2_log = round(R2_log, 3), 
               R2_quad = round(R2_quad, 3)) %>%
      mutate(max_R2 = pmax(R2, R2_log, R2_quad, na.rm = T)) %>%
      arrange(desc(max_R2))
    
    
    # # # # # CATEGORICAL RESPONSE # # # # #
  } else {
    
    for (i in 1:ncol(dataframe)) {
      
      x <- dataframe[ ,i]
      varname[i] <- names(dataframe)[i]
      
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      if (!identical(y, x)) {
        
        # linear: y ~ x
        AIC[i] <- summary(glm(y ~ x, family = "binomial"))$aic 
        
        # log-transform: y ~ log(x)
        if (is.numeric(x)) { 
          if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
            AIC_log[i] <- summary(glm(y ~ log(x + abs(min(x)) + 1), family = "binomial"))$aic
          } else {
            AIC_log[i] <- summary(glm(y ~ log(x), family = "binomial"))$aic
          }
        } else {
          AIC_log[i] <- NA
        }
        
        # quadratic: y ~ x + x^2
        if (is.numeric(x)) { 
          AIC_quad[i] <- summary(glm(y ~ x + I(x^2), family = "binomial"))$aic
        } else {
          AIC_quad[i] <- NA
        }
        
      } else {
        AIC[i] <- NA
        AIC_log[i] <- NA
        AIC_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               AIC = round(AIC, 3), 
               AIC_log = round(AIC_log, 3), 
               AIC_quad = round(AIC_quad, 3)) %>%
      mutate(min_AIC = pmin(AIC, AIC_log, AIC_quad, na.rm = T)) %>%
      arrange(min_AIC)
  } 
}

Applying it to a classification situation, I want to know: what is the best single predictor of Private?

best_predictor(college, "Private")

## [1] "Response variable: Private"

##        varname     vartype     AIC AIC_log AIC_quad min_AIC
## 1  F.Undergrad     numeric 573.904 529.782  543.025 529.782
## 2     Outstate     numeric 578.891 587.785  580.690 578.891
## 3       Enroll     numeric 636.124 600.760  614.453 600.760
## 4  P.Undergrad     numeric 697.830 641.608  682.239 641.608
## 5       Accept     numeric 722.970 697.064  712.350 697.064
## 6    S.F.Ratio     numeric 723.261 714.265  698.678 698.678
## 7         Apps     numeric 761.995 722.126  751.608 722.126
## 8  perc.alumni     numeric 750.749 779.767  747.815 747.815
## 9   Room.Board     numeric 812.384 812.717  814.367 812.384
## 10      Expend     numeric 822.691 813.843  820.804 813.843
## 11   Grad.Rate     numeric 822.580 840.692  815.593 815.593
## 12    Personal     numeric 844.325 831.692  821.576 821.576
## 13         PhD     numeric 894.541 890.556  882.192 882.192
## 14   Top10perc     numeric 891.354 883.514  884.677 883.514
## 15    Terminal     numeric 901.062 899.590  897.510 897.510
## 16   Top25perc     numeric 907.533 906.502  908.150 906.502
## 17       Elite categorical 909.356      NA       NA 909.356
## 18       Books     numeric 914.487 914.575  912.451 912.451
## 19     Private categorical      NA      NA       NA      NA

Lets visualize:

private_predictors <- best_predictor(college, "Private") %>%
  select(-c(vartype, min_AIC)) %>%
  gather(key = "key", value = "AIC", -varname) %>%
  filter(!is.na(AIC))

private_predictors_order <- best_predictor(college, "Private") %>%
  select(varname, min_AIC) %>%
  filter(!is.na(min_AIC)) %>%
  arrange(desc(min_AIC)) %>%
  pull(varname)

private_predictors$varname <- factor(private_predictors$varname, 
                                     ordered = T, 
                                     levels = private_predictors_order)

ggplot(private_predictors, aes(x = AIC, y = varname, col = key, group = varname)) + 
  geom_line(col = "grey15") +
  geom_point(size = 2) + 
  theme_light() + 
  theme(legend.position = "bottom") +
  labs(title = "Best predictors (& transformations) for Private", 
       col = "Predictor Transformation", 
       y = "Predictor") + 
  scale_color_discrete(labels = c("None", "Log-Transformed", "Order 2 Polynomial"))

Below we can see the distributions of F.Undergrad for private and not private colleges. Private colleges tend to have a significantly smaller number of undergraduates.

ggplot(college, aes(x = Private, y = F.Undergrad, fill = Private)) + 
  geom_boxplot() + 
  scale_y_continuous(labels = scales::comma_format()) + 
  theme(legend.position = "none") + 
  labs(title = "F.Undergrad vs Private - Boxplot")

I’m also curious as to what the best predictor of Grad.Rate is.

best_predictor(college, "Grad.Rate")

## [1] "Response variable: Grad.Rate"

##        varname     vartype    R2 R2_log R2_quad max_R2
## 1     Outstate     numeric 0.326  0.311   0.327  0.327
## 2    Top10perc     numeric 0.245  0.226   0.249  0.249
## 3  perc.alumni     numeric 0.241  0.239   0.248  0.248
## 4    Top25perc     numeric 0.228  0.213   0.228  0.228
## 5       Expend     numeric 0.152  0.180   0.183  0.183
## 6   Room.Board     numeric 0.181  0.179   0.182  0.182
## 7        Elite categorical 0.122     NA      NA  0.122
## 8      Private categorical 0.113     NA      NA  0.113
## 9          PhD     numeric 0.093  0.071   0.108  0.108
## 10    Terminal     numeric 0.084  0.072   0.098  0.098
## 11   S.F.Ratio     numeric 0.094  0.093   0.096  0.096
## 12    Personal     numeric 0.073  0.073   0.082  0.082
## 13 P.Undergrad     numeric 0.066  0.074   0.080  0.080
## 14        Apps     numeric 0.022  0.040   0.025  0.040
## 15      Accept     numeric 0.005  0.021   0.006  0.021
## 16 F.Undergrad     numeric 0.006  0.000   0.007  0.007
## 17      Enroll     numeric 0.000  0.004   0.001  0.004
## 18       Books     numeric 0.000  0.000   0.000  0.000
## 19   Grad.Rate     numeric    NA     NA      NA     NA

Visualizing again:

gradrate_predictors <- best_predictor(college, "Grad.Rate") %>%
  select(-c(vartype, max_R2)) %>%
  gather(key = "key", value = "R2", -varname) %>%
  filter(!is.na(R2))

gradrate_predictors_order <- best_predictor(college, "Grad.Rate") %>%
  select(varname, max_R2) %>%
  filter(!is.na(max_R2)) %>%
  arrange(max_R2) %>%
  pull(varname)

gradrate_predictors$varname <- factor(gradrate_predictors$varname, 
                                     ordered = T, 
                                     levels = gradrate_predictors_order)

ggplot(gradrate_predictors, aes(x = R2, y = varname, col = key, group = varname)) + 
  geom_line(col = "grey15") +
  geom_point(size = 2) + 
  theme_light() + 
  theme(legend.position = "bottom") +
  labs(title = "Best predictors (& transformations) for Grad.Rate", 
       col = "Predictor Transformation", 
       y = "Predictor") + 
  scale_color_discrete(labels = c("None", "Log-Transformed", "Order 2 Polynomial"))

The strongest relationship is with Outstate. I visualize this relationship below and add a smoothing line.

ggplot(college, aes(x = Outstate, y = Grad.Rate)) + 
  geom_point() + 
  geom_smooth() + 
  scale_x_continuous(labels = scales::comma_format()) + 
  scale_y_continuous(breaks = seq(0, 120, 20)) +
  labs(title = "Outstate vs Grad.Rate - Scatterplot")

This would suggest that the relationship would best be described as linear, and the marginal increase in $R^2$ for the second order transformation is not meaningful (using adjusted $R^2$ would not have this problem, as $R^2$ will never decrease for variables added).

We can speculate that the true relationship might be logarithmic or quadratic over a larger range of Outstate though, as it wouldn’t make sense for our regression line to predict Grad.Rate as 120% for a sufficiently high value of Outstate, which is what will currently happen (although this is not a problem for this range of data).

On that topic, we can see from the scatter plot that one college is reporting graduation rate of >100% - perhaps there are other errors of this sort?

The variables in a percentage format are:

PhD
perc.alumni
Grad.Rate

Checking for values outside a $[0, 100]$ range:

sum(college$PhD > 100 | college$PhD < 0)

## [1] 1

sum(college$perc.alumni > 100 | college$perc.alumni < 0)

## [1] 0

sum(college$Grad.Rate > 100 | college$Grad.Rate < 0)

## [1] 1

college %>%
  rownames_to_column("college") %>%
  filter(PhD > 100 | PhD < 0 | Grad.Rate > 100 | Grad.Rate < 0) %>%
  select(college, PhD, Grad.Rate)

##                             college PhD Grad.Rate
## 1                 Cazenovia College  22       118
## 2 Texas A&M University at Galveston 103        43

We can see that ‘Texas A&M University at Galveston’ is reporting that 103% of its faculty have PhD’s, and ‘Cazenovia College’ has a graduation rate of 118% - impressive stuff!

9. APPLIED: The `Auto` Dataset

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

library(ISLR)

glimpse(Auto)

## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...

sum(is.na(Auto)) # 0

## [1] 0

(a) Classifying the Variables

Q: Which of the predictors are quantitative, and which are qualitative?

Typing ?ISLR::Auto into the console shows the variable definitions:

Quantitative:

mpg - Miles per gallon
cylinders - Number of cylinders between 4 and 8
displacement - Engine displacement (cu. inches)
horsepower - Engine horsepower
weight - Vehicle weight (lbs.)
acceleration - Time to accelerate from 0 to 60 mph (sec.)
year - Model year (modulo 100)

Qualitative:

origin - Origin of car (1. American, 2. European, 3. Japanese)
name - Vehicle name

(b) Variable Ranges

Q: What is the range of each quantitative predictor? You can answer this using the range() function.

range_Auto <- data.frame(sapply(Auto[ ,1:7], range))
rownames(range_Auto) <- c("min:", "max:")
range_Auto

##       mpg cylinders displacement horsepower weight acceleration year
## min:  9.0         3           68         46   1613          8.0   70
## max: 46.6         8          455        230   5140         24.8   82

(c) Mean, Standard Deviation

Q: What is the mean and standard deviation of each quantitative predictor?

Mean:

sapply(Auto[ ,1:7], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592

Standard Deviation:

sapply(Auto[ ,1:7], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

(d) Subsample - Mean, Standard Deviation

Q: Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

Auto_2 <- Auto[-c(10:85), ]

Range:

range_Auto_2 <- data.frame(sapply(Auto_2[ ,1:7], range))
rownames(range_Auto_2) <- c("min:", "max:")
range_Auto_2

##       mpg cylinders displacement horsepower weight acceleration year
## min: 11.0         3           68         46   1649          8.5   70
## max: 46.6         8          455        230   4997         24.8   82

Mean:

sapply(Auto_2[ ,1:7], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570

Standard Deviation:

sapply(Auto_2[ ,1:7], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

(e) Investigate Predictors

Q: Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

I am curious as to how relationships between the variables might differ across the different levels of origin. Looking at all relationships between the numeric variables is a good place to start here:

pairs(Auto[ ,1:7])

Many of these pairs clearly show a strong positive/negative relationship, and it’s unlikely that splitting by origin will give any additional insights. For this reason, I’m actually looking for weaker relationships.

On its own, the relationship between weight and acceleration looks quite boring - a weak negative correlation.

Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))

ggplot(Auto, aes(x = weight, y = acceleration)) + 
  geom_point() + 
  theme(legend.position = "none") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(x = "Weight", 
       y = "Acceleration", 
       title = "Correlation between weight and acceleration")

However, this relationship varies based on the country of origin of the car. European cars actually have a weak positive correlation - cars with a faster 0-60 time also tended to be heavier. For American and Japanese cars, this relationship was the reverse.

ggplot(Auto, aes(x = weight, y = acceleration, col = origin)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~ origin) + 
  theme(legend.position = "none") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(x = "Weight", 
       y = "Acceleration", 
       title = "Correlation between weight and acceleration, by origin")

Note also how the range for weight is significantly smaller for Japanese and European cars - there was clearly quite large differences in the types of cars each country was developing.

Another good example of this is found by looking at trends over time. The aggregate trend of displacement over time looks uninteresting:

ggplot(Auto, aes(x = year + 1900, y = displacement)) + 
  geom_jitter() + 
  theme(legend.position = "none") + 
  labs(x = "Year", 
       y = "Displacement", 
       title = "Engine Displacement (trends over time)")

This appears to be a weak negative relationship - fitting a linear model predicting displacement with year yields an $R^2$ of 0.1368.

summary(lm(displacement ~ year, data = Auto))

## 
## Call:
## lm(formula = displacement ~ year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175.73  -75.58  -13.45   69.69  229.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  992.691    101.661   9.765  < 2e-16 ***
## year         -10.506      1.336  -7.862 3.75e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97.35 on 390 degrees of freedom
## Multiple R-squared:  0.1368, Adjusted R-squared:  0.1346 
## F-statistic:  61.8 on 1 and 390 DF,  p-value: 3.748e-14

However, once we begin to look at this trend across different origins, it becomes more interesting:

ggplot(Auto, aes(x = year + 1900, y = displacement, col = factor(origin))) + 
  geom_jitter() + 
  geom_smooth(method = "lm") +
  theme(legend.position = "none") + 
  labs(x = "Year", 
       y = "Displacement", 
       title = "Engine Displacement (trends over time), by Origin") + 
  facet_wrap(~ origin)

From a linear model standpoint, we can account for the fact that the average engine displacement was significantly higher for American cars than Japanese/European by adding the origin variable to the model.

We can further account for the differing trends over time by adding an interaction term between year and origin, because the relationship between year and displacement is different, depending on the value of origin!

summary(lm(displacement ~ year * origin, data = Auto))

## 
## Call:
## lm(formula = displacement ~ year * origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -217.920  -26.784   -4.066   33.700  174.813 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1258.425     91.656  13.730  < 2e-16 ***
## year                  -13.373      1.211 -11.042  < 2e-16 ***
## originEuropean      -1252.625    208.468  -6.009 4.33e-09 ***
## originJapanese      -1215.032    190.072  -6.392 4.71e-10 ***
## year:originEuropean    14.745      2.752   5.357 1.46e-07 ***
## year:originJapanese    14.139      2.466   5.734 1.98e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.25 on 386 degrees of freedom
## Multiple R-squared:  0.5677, Adjusted R-squared:  0.5621 
## F-statistic: 101.4 on 5 and 386 DF,  p-value: < 2.2e-16

(year * origin is shorthand for year + origin + year:origin)

This boosts our $R^2$ significantly, to 0.5677.

(f) Predicting `mpg`

Q: Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Referring back to the scatter matrix, it appears that all quantitative predictors have some relationship with the response.

name (the name of the vehicle) is categorical, and has 301 unique values. This isn’t really practical in its current state as a predictor, particularly on such a small dataset.

However, we may be able to extract useful information. For example, the format of the name variable seems to be that the first word is the brand of the car.

I extract this information, correct some mistakes and collapse the variable down into 10 levels (with the 10th being ‘uncommon’):

Auto$brand <- sapply(strsplit(as.character(Auto$name), split = " "),
                     function(x) x[1]) # extract the first item from each list element

Auto$brand <- factor(ifelse(Auto$brand %in% c("vokswagen", "vw"), "volkswagen", 
                     ifelse(Auto$brand == "toyouta", "toyota", 
                            ifelse(Auto$brand %in% c("chevroelt", "chevy"), "chevrolet", 
                                   ifelse(Auto$brand == "maxda", "mazda", 
                                          Auto$brand))))) # fixing typo's

Auto$brand <- fct_lump(Auto$brand, 
                       n = 9, 
                       other_level = "uncommon") # collapse into 10 categories

table(Auto$brand)

## 
##        amc      buick  chevrolet     datsun      dodge       ford   plymouth 
##         27         17         47         23         28         48         31 
##     toyota volkswagen   uncommon 
##         26         22        123

Visualizing this new variable:

ggplot(Auto, aes(x = brand, y = mpg, fill = brand)) + 
geom_boxplot() + 
  theme(legend.position = "none") + 
  labs(title = "Brand vs Mpg - Boxplot", 
       subtitle = "Engineered feature", 
       x = "Brand", 
       y = "MPG")

The variable is now in a state of usability, and looks like it would be useful in predicting mpg.

origin also looks useful:

ggplot(Auto, aes(x = origin, y = mpg, fill = origin)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  labs(title = "Origin vs Mpg - Boxplot", 
       x = "Origin", 
       y = "MPG")

We can use the function written earlier, best_predictor(), to assess how these predictors do when compared with the quantitative predictors. I drop name for reasons mentioned earlier.

best_predictor(select(Auto, -name), "mpg")

## [1] "Response variable: mpg"

##        varname     vartype    R2 R2_log R2_quad max_R2
## 1       weight     numeric 0.693  0.713   0.715  0.715
## 2 displacement     numeric 0.648  0.686   0.689  0.689
## 3   horsepower     numeric 0.606  0.668   0.688  0.688
## 4    cylinders     numeric 0.605  0.603   0.607  0.607
## 5         year     numeric 0.337  0.332   0.368  0.368
## 6       origin categorical 0.332     NA      NA  0.332
## 7        brand categorical 0.260     NA      NA  0.260
## 8 acceleration     numeric 0.179  0.190   0.194  0.194
## 9          mpg     numeric    NA     NA      NA     NA

10. APPLIED: The `Boston` Housing Dataset

(a) Dataset Overview

Q: How many rows are in this data set? How many columns? What do the rows and columns represent?

library(MASS)

dim(Boston)

## [1] 506  14

# ?Boston

The Boston housing dataset is 506 rows and 14 columns.

Each row represents a Boston suburb.

The columns represent:

crim - Per capita crime rate by town
zn - Proportion of residential land zoned for lots over 25,000 sq.ft.
indus - Proportion of non-retail business acres per town
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox - Nitrogen oxides concentration (parts per 10 million)
rm - Average number of rooms per dwelling
age - Proportion of owner-occupied units built prior to 1940
dis - Weighted mean of distances to five Boston employment centres
rad - Index of accessibility to radial highways
tax - Full-value property-tax rate per $10,000
ptratio - Pupil-teacher ratio by town
black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat - Lower status of the population (percent)
medv - Median value of owner-occupied homes in $1000s

(b) Pairwise Scatter Plots

Q: Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

pairs(Boston)

There’s a lot going on here, and it’s hard to identify many relationships. I’ll loop through my best_predictor() function again for every variable in the dataframe, identifying what the best predictor for it is (including transformations).

Based on the data definitions, I convert chas to a factor format.

Boston$chas <- factor(Boston$chas)

Below we can see the best predictor for every quantitative variable:

predictor_df <- data.frame(response = character(), 
                             varname = character(), 
                             R2 = numeric(), 
                             R2_log = numeric(), 
                             R2_quad = numeric(), 
                             max_R2 = numeric(), 
                             stringsAsFactors = F)

for (i in setdiff(1:ncol(Boston), 4)) { # excluding 'chas' as it uses different eval metric
  response <- names(Boston)[i]
  predictor_info <- best_predictor(Boston, names(Boston)[i])[1, -2]
  predictor_df <- rbind(predictor_df, cbind(response, predictor_info))
}

predictor_df %>%
  arrange(desc(max_R2)) %>%
  rename(best_predictor = varname)

##    response best_predictor    R2 R2_log R2_quad max_R2
## 1       rad            tax 0.829  0.746   0.890  0.890
## 2       tax            rad 0.829  0.723   0.832  0.832
## 3       dis            nox 0.592  0.653   0.736  0.736
## 4       nox            dis 0.592  0.692   0.700  0.700
## 5     lstat           medv 0.544  0.648   0.679  0.679
## 6      medv          lstat 0.544  0.665   0.641  0.665
## 7       age            nox 0.535  0.586   0.653  0.653
## 8     indus            nox 0.583  0.609   0.628  0.628
## 9        rm           medv 0.484  0.399   0.488  0.488
## 10       zn            dis 0.441  0.349   0.471  0.471
## 11     crim            rad 0.391  0.323   0.399  0.399
## 12  ptratio           medv 0.258  0.252   0.268  0.268
## 13    black           crim 0.148  0.229   0.234  0.234

And the best predictor for the single qualitative variable, chas:

best_predictor(Boston, "chas")[1, ]

## [1] "Response variable: chas"

##   varname vartype     AIC AIC_log AIC_quad min_AIC
## 1    medv numeric 245.312 245.016  247.312 245.016

The strongest relationship identified among the quantitative variables is a quadratic relationship between rad and tax. I plot the fitted line on the graph below, and looking at this shows that the strong $R^2$ is deceptive.

g1 <- ggplot(Boston, aes(x = tax, y = rad)) + 
  geom_jitter(alpha = 0.1) + 
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2)") + 
  scale_x_continuous(labels = scales::comma_format()) +
  labs(title = "Tax vs Rad - Jitter Plot", 
       x = "Tax", 
       y = "Rad")

g2 <- Boston %>%
  filter(rad < 20) %>%
  ggplot(aes(x = tax, y = rad)) + 
  geom_jitter(alpha = 0.1) + 
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2)") + 
  scale_x_continuous(labels = scales::comma_format()) +
  labs(title = "Tax vs Rad - Jitter Plot", 
       subtitle = "Filtered for rad < 20 (132/506 obs. removed)",
       x = "Tax", 
       y = "Rad")

grid.arrange(g1, g2, ncol = 2)

Note that jitter & reduced alpha has been applied to the plot, due to overlapping observations.

In actuality, the only strong part of the relationship between rad and tax is: “If tax = 666 then rad = 24”.

Fitting a regression line of the form rad ~ tax + tax^2 on the remainder of the data, as done in the right graph, demonstrates this:

Boston_2 <- Boston %>%
  filter(rad < 20)

summary(lm(rad ~ tax + I(tax^2), data = Boston_2))

## 
## Call:
## lm(formula = rad ~ tax + I(tax^2), data = Boston_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0642 -0.7413 -0.0235  0.7961  3.6424 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.778e-02  8.026e-01  -0.122    0.903    
## tax          2.280e-02  4.284e-03   5.323 1.78e-07 ***
## I(tax^2)    -2.504e-05  5.484e-06  -4.567 6.75e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.565 on 371 degrees of freedom
## Multiple R-squared:  0.08678,    Adjusted R-squared:  0.08185 
## F-statistic: 17.63 on 2 and 371 DF,  p-value: 4.863e-08

In essence, for those with rad > 20, we have an $R^2$ of close to 1, and for those < 20, we have an $R^2$ of 0.087!

Looking at the second strongest pair of variables identified shows a ‘true’ relationship of the form dis ~ nox + nox^2:

ggplot(Boston, aes(x = nox, y = dis)) + 
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2)") + 
    labs(title = "Nox vs Dis - Scatter Plot", 
         subtitle = "Quadratic fit",
         x = "Nox",
         y = "Dis")

(c) Predictors of `crim`

Q: Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

I can call my best_predictor() function again here to give us a better idea of where to look.

best_predictor(Boston, "crim")

## [1] "Response variable: crim"

##    varname     vartype    R2 R2_log R2_quad max_R2
## 1      rad     numeric 0.391  0.323   0.399  0.399
## 2      tax     numeric 0.340  0.304   0.367  0.367
## 3     medv     numeric 0.151  0.279   0.358  0.358
## 4      dis     numeric 0.144  0.216   0.229  0.229
## 5    lstat     numeric 0.208  0.156   0.214  0.214
## 6      nox     numeric 0.177  0.185   0.199  0.199
## 7    indus     numeric 0.165  0.145   0.181  0.181
## 8      age     numeric 0.124  0.084   0.162  0.162
## 9    black     numeric 0.148  0.125   0.149  0.149
## 10 ptratio     numeric 0.084  0.079   0.100  0.100
## 11      rm     numeric 0.048  0.055   0.067  0.067
## 12      zn     numeric 0.040  0.057   0.056  0.057
## 13    chas categorical 0.003     NA      NA  0.003
## 14    crim     numeric    NA     NA      NA     NA

rad and tax are supposedly the best predictors again, but I have already covered the quirks of these variables in some detail.

The next best predictor is medv, which best_predictor() suggests should have a non-linear relationship with crim.

Note the large number of towns with a crim value of close to zero.

ggplot(Boston, aes(x = medv, y = crim)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2)", col = "green") +
    labs(title = "Medv vs Crim - Scatter Plot", 
       x = "Medv", 
       y = "Crim")

Those with a smaller value of medv (where the median value of owner-occupied homes is lower) tend to have a higher value of crim (higher crime rates). After a certain point, an increase in medv seems to have little impact on the crime rate.

A very similar non-linear relationship can be seen between crim and dis (weighted mean distance to employment centres).

ggplot(Boston, aes(x = dis, y = crim)) + 
  geom_point(alpha = 0.5) + 
    geom_smooth(method = "lm", formula = "y ~ x + I(x^2)", col = "green") +
    labs(title = "Dis vs Crim - Scatter Plot", 
       x = "Dis", 
       y = "Crim")

The relationship between crim and lstat (percent of the population that are lower status) looks like it could be linear, although a quadratic relationship yields a slightly higher $R^2$.

ggplot(Boston, aes(x = lstat, y = crim)) + 
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2)", col = "green") +
  labs(title = "Lstat vs Crim - Scatter Plot",
       x = "Lstat", 
       y = "Crim")

We are starting to build up a picture of those towns with the higher crime rates (lower-value housing, closer to unemployment centres, lower status).

(d) High Crime, Tax Rate & Pupil-Teacher Ratio Suburbs

Q: Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

Crime rates:

We have already seen that most suburbs have a crime rate at or close to zero, but some suburbs have incredibly high crime rates:

ggplot(Boston, aes(x = "", y = crim)) + 
  geom_boxplot() + 
  scale_y_continuous(breaks = seq(0, 100, 10)) + 
  labs(y = "Crim", 
       title = "Crim - Boxplot") + 
  coord_flip() + 
  theme(axis.title.y = element_blank(), 
        axis.ticks.y = element_blank())

It’s interesting to see how much variation there is in the crime rates within Boston, but I’m confused at the huge range:

range(Boston$crim)

## [1]  0.00632 88.97620

The median value of crim is 0.25651, but there are many suburbs with values far higher than this.

“Per capita crime rate by town” would suggest to me that a value of 88.98 either means there are 88.98% as many crimes as there are people over some period, or that there were 88.98 crimes per person over some period (using either of these definitions makes it seem exceptionally high).

Tax rates:

tax is weird, as seen already.

range(Boston$tax)

## [1] 187 711

The range may not be over many orders of magnitude, but there is still the case of a huge number of suburbs taking the same tax values far from the others, presumably because they all fall under the same tax rate.

ggplot(Boston, aes(x = tax)) + 
  geom_histogram(binwidth = 10, fill = "mediumseagreen") + 
  scale_y_continuous(breaks = seq(0, 200, 20)) +
  labs(title = "Tax - Histogram", 
       x = "Tax", 
       y = "Count")

Pupil-teacher Ratios:

ptratio doesn’t seem to have any huge outliers or gaps and is over a smaller range:

range(Boston$ptratio)

## [1] 12.6 22.0

ggplot(Boston, aes(x = ptratio)) + 
  geom_histogram(binwidth = 0.25, fill = "deepskyblue") + 
  scale_x_continuous(breaks = seq(12, 23, 0.5)) + 
  scale_y_continuous(breaks = seq(0, 200, 20)) + 
  labs(title = "Ptratio - Histogram", 
       x = "Ptratio", 
       y = "Count")

Again, there are many observations with the same value:

Boston %>%
  group_by(ptratio) %>%
  count() %>%
  arrange(desc(n))

## # A tibble: 46 x 2
## # Groups:   ptratio [46]
##    ptratio     n
##      <dbl> <int>
##  1    20.2   140
##  2    14.7    34
##  3    21      27
##  4    17.8    23
##  5    19.2    19
##  6    17.4    18
##  7    18.6    17
##  8    19.1    17
##  9    16.6    16
## 10    18.4    16
## # ... with 36 more rows

Checking the data dictionary, I thought this might be because these are aggregate figures (across multiple suburbs, perhaps at a town-level), which would explain the weird cases of many suburbs having the same value across multiple variables.

See below, where we can see that many (but not all) of these 140 suburbs have the same values for ptratio, rad, tax, zn & indus!

Boston %>%
  group_by(ptratio, rad, tax, zn, indus) %>%
  count() %>%
  arrange(desc(n))

## # A tibble: 78 x 6
## # Groups:   ptratio, rad, tax, zn, indus [78]
##    ptratio   rad   tax    zn indus     n
##      <dbl> <int> <dbl> <dbl> <dbl> <int>
##  1    20.2    24   666     0 18.1    132
##  2    14.7     5   403     0 19.6     30
##  3    21       4   307     0  8.14    22
##  4    17.4     8   307     0  6.2     18
##  5    21.2     4   437     0 21.9     15
##  6    13       5   264    20  3.97    12
##  7    18.4     4   304     0  9.9     12
##  8    18.6     4   277     0 10.6     11
##  9    20.9     5   384     0  8.56    11
## 10    19.1     7   330    22  5.86    10
## # ... with 68 more rows

However, when I add an extra variable black, which the data dictionary states is at a town-level, we see that there this varies within the groups of these 5 variables, so it’s not even that these are the same towns!

Boston %>%
  group_by(ptratio, rad, tax, zn, indus, black) %>%
  count() %>%
  arrange(desc(n))

## # A tibble: 430 x 7
## # Groups:   ptratio, rad, tax, zn, indus, black [430]
##    ptratio   rad   tax    zn indus black     n
##      <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
##  1    20.2    24   666     0 18.1   397.    29
##  2    19.2     6   391     0  9.69  397.     6
##  3    19.6     5   287     0  7.38  397.     6
##  4    21.2     4   437     0 21.9   397.     5
##  5    16       4   289     0 13.9   397.     4
##  6    17.9     3   233     0  6.91  397.     4
##  7    20.2     5   224     0  5.19  397.     4
##  8    14.7     5   403     0 19.6   397.     3
##  9    17.6     4   254    40  6.41  397.     3
## 10    17.8     3   193     0  2.46  397.     3
## # ... with 420 more rows

I’ve found this dataset very awkward to work with. It appears that many variables are at a suburb (row) level, and many are at different aggregate levels higher than that (towns & larger). The data dictionary doesn’t tell us which, making drawing insights from it far harder.

(e) Charles River

Q: How many of the suburbs in this data set bound the Charles river?

table(Boston$chas)

## 
##   0   1 
## 471  35

There are 35 suburbs bound by the Charles river.

(f) Median `ptratio`

Q: What is the median pupil-teacher ratio among the towns in this data set?

Here I’m going to assume that the authors wanted the median per suburb (row), as it appears that there is no way of grouping these suburbs into towns without getting contradictions.

We should get the same number of distinct combinations of crim, indus, ptratio & black combined as we do from distinct values from just one of these variables, as they are all ‘town-level’ variables, but this isn’t the case.

median(Boston$ptratio)

## [1] 19.05

(g) Lowest `medv` Suburbs

Q: Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

It appears that there are two suburbs with the lowest median value of owner-occupied homes (5).

Boston[Boston$medv == min(Boston$medv), ]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio  black lstat
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.90 30.59
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 384.97 22.98
##     medv
## 399    5
## 406    5

I create a data frame Boston_percentiles, which gives the percentile rank for every observation in the data frame, which I then filter for these two observations to see where they rank in the ranges of each variable.

Boston_percentiles <- sapply(Boston[ ,-4], function(x) rank(x)/length(x)) %>%
  as.data.frame()

Boston_percentiles[c(399, 406),]

##          crim        zn     indus       nox        rm      age        dis
## 399 0.9881423 0.3685771 0.7579051 0.8448617 0.0770751 0.958498 0.05731225
## 406 0.9960474 0.3685771 0.7579051 0.8448617 0.1363636 0.958498 0.04150198
##           rad       tax   ptratio     black     lstat        medv
## 399 0.8705534 0.8606719 0.7519763 0.8814229 0.9782609 0.002964427
## 406 0.8705534 0.8606719 0.7519763 0.3498024 0.8992095 0.002964427

We can see that both observations with the lowest medv take very similar values, and for many they take quite extreme values.

>= 90th percentile for: crim, age, lstat
>= 75th percentile for: indus, nox, rad, tax, ptratio
Both suburbs are chas = 0: they are not near the Charles river
Lower percentiles for: zn, rm, dis
black is the only variable that differs significantly, with each observation at the 88th and 35th percentiles

(h) Many Rooms Per Dwelling: `rm` > 8

Q: In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

More than seven rooms per dwelling:

sum(Boston$rm > 7)

## [1] 64

More than eight rooms per dwelling:

Boston_gt_8rooms <- Boston[Boston$rm > 8, ]

nrow(Boston_gt_8rooms)

## [1] 13

2/13 were located near the Charles river:

prop.table(table(Boston_gt_8rooms$chas))

## 
##         0         1 
## 0.8461538 0.1538462

We can see the percentile rank for these 13 suburbs:

Boston_gt_8rooms_perc <- Boston_percentiles[as.numeric(rownames(Boston_gt_8rooms)), ]

glimpse(Boston_gt_8rooms_perc)

## Rows: 13
## Columns: 13
## $ crim    <dbl> 0.34387352, 0.69367589, 0.03557312, 0.52766798, 0.58893281,...
## $ zn      <dbl> 0.3685771, 0.3685771, 0.9950593, 0.3685771, 0.3685771, 0.36...
## $ indus   <dbl> 0.09683794, 0.91798419, 0.08992095, 0.34090909, 0.34090909,...
## $ nox     <dbl> 0.21541502, 0.68873518, 0.08893281, 0.38833992, 0.38833992,...
## $ rm      <dbl> 0.9802372, 0.9920949, 0.9762846, 0.9861660, 0.9980237, 0.97...
## $ age     <dbl> 0.48418972, 0.74604743, 0.14130435, 0.50988142, 0.55830040,...
## $ dis     <dbl> 0.5454545, 0.2766798, 0.7480237, 0.4634387, 0.4634387, 0.50...
## $ rad     <dbl> 0.06422925, 0.49407115, 0.27173913, 0.71640316, 0.71640316,...
## $ tax     <dbl> 0.21541502, 0.63932806, 0.07806324, 0.41600791, 0.41600791,...
## $ ptratio <dbl> 0.37154150, 0.06818182, 0.06818182, 0.26778656, 0.26778656,...
## $ black   <dbl> 0.8814229, 0.4100791, 0.4624506, 0.3537549, 0.3201581, 0.39...
## $ lstat   <dbl> 0.075098814, 0.035573123, 0.011857708, 0.071146245, 0.09881...
## $ medv    <dbl> 0.9367589, 0.9851779, 0.9851779, 0.9565217, 0.9851779, 0.93...

Taking the mean of each column to simplify:

sapply(Boston_gt_8rooms_perc, mean)

##       crim         zn      indus        nox         rm        age        dis 
## 0.53815749 0.54651870 0.33612040 0.47514442 0.98814229 0.48008513 0.47202797 
##        rad        tax    ptratio      black      lstat       medv 
## 0.57236242 0.37473396 0.24407115 0.43987534 0.09007297 0.93227425

It appears that these suburbs with more than 8 rooms per dwelling don’t take such extreme values, with most variables falling somewhere in the middle 50%.

The exceptions are lstat, which averages in the bottom 10th percentile, and medv which averages in the top 10th percentile.

We could infer from this that these suburbs are more affluent areas.

ISLR - Statistical Learning (Ch. 2) - Exercise Solutions

Liam Morgan

October 2019