college
DatasetAuto
DatasetBoston
Housing DatasetNOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
Q: The sample size \(n\) is extremely large, and the number of predictors \(p\) is small.
A:
Better - the large number of observations (and low dimensionality) mean that a flexible method could capture more complex relationships without a high chance of overfitting
Q: The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small.
A:
Worse - in high dimension & low observation datasets, flexible methods are at a higher risk of finding spurious relationships with the response (i.e. overfitting) and will likely underperform.
Q: The relationship between the predictors and response is highly non-linear.
A:
Better - a flexible method would be able to capture these non-linear relationships with the response.
Q: The variance of the error terms, i.e. \(\sigma^{2} = Var(\epsilon)\), is extremely high.
A:
Whenever we build a predictive model, we assume there exists some relationship between the response \(Y\) and the predictors \(X = (X_{1}, X_{2}, ..., X_{p})\), which can be written in the form:
\[Y = f(X) + \epsilon\] Where \(f\) is some fixed function of \(X\), and \(\epsilon\) is a random error term (independent of \(X\)), with mean zero. That is, the value we expect \(\epsilon\) to take, on average, is zero: \[\mathbb{E}[\epsilon] = 0\]
Looking at the expected value of the residual sum of squares, \((Y - \hat{Y})^2\), we can show that its expected value can be expressed as:
\[ \begin{align*} \mathbb{E}[(Y - \hat{Y})^2] &= \mathbb{E}[(f(X) + \epsilon - \hat{f}(X))^2] \\ & = (f(X) - \hat{f}(X))^2 + Var(\epsilon) \end{align*} \]
Where \([f(X) - \hat{f}(X)]^2\) is called the reducible error, and \(Var(\epsilon)\) is the irreducible error.
In the case of the irreducible error being very high, we would expect a flexible method to perform worse, as it can find relationships between the response and the large irreducible error that a less flexible method would not (it can overfit far easier).
Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide \(n\) and \(p\).
Q: We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
A:
Q: We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
A:
Q: We are interested in predicting the % change in the US Dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
A:
We now revisit the bias-variance decomposition.
Q: Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
A:
Q: Explain why each of the five curves has the shape displayed in part (a).
A:
Bias is the error introduced when the complexity of a problem is not sufficiently modelled by the simplicity of the chosen method (e.g. linear regression for non-linear relationships). As model flexibility increases (linear->trees->boosting, decreasing K in KNN, etc.), bias decreases monotonically, because less assumptions are being made about the data structure and its relationship with the response.
Variance refers to the amount by which our predictions would change if the training data were changed, and can be thought of as the error introduced when a model is overfit to the training data. As model flexibility increases, variance increases monotonically, because the method becomes more specified (and then overspecified) to the nuances of the training data, to the point where \(\hat{f}\) doesn’t generalize to new data.
Training Error decreases monotonically as flexibility increases. More flexible methods are generally higher variance, and can learn more complex relationships more completely, but also run the risk of overfitting, which is seen where the training error and test error diverge. Think of a decision tree, where the number of terminal nodes = the number of training observations (this model will have 0 training error and a high test error).
Test Error decreases, levels-out then increases. The minima is the point of optimal bias-variance tradeoff, where \(\mathbb{E}[(Y - \hat{Y})^2] = [Bias(\hat{f}(X))]^2 + Var(\hat{f}(X)) + Var(\epsilon)\) is minimized. To the right of this minima, the method is overfitting (\(\hat{f}\) is too high variance to make up for its lack of bias), and to the left the method is underfitting (\(\hat{f}\) is too high bias to make up for its lack of variance).
Irreducible Error refers to the error introduced by inherent uncertainty/noise in the system being approximated. It is constant and > 0 regardless of the flexibility of the model, because \(\epsilon\) may contain unmeasured variables not in \(X\) that could be used to predict \(y\), and because \(\epsilon\) may contain unmeasurable variation in \(y\) that could not be accounted for in \(X\) even if we wanted to. This means that it doesn’t matter how closely \(\hat{f}\) models the ‘true’ function \(f\), there will still be an (unknown) minimum error of \(Var(\epsilon) > 0\).
You will now think of some real-life applications for statistical learning.
Q: Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
A:
1
) or not fraudulent (0
)1
) or not having (0
) a particular disease1
) or not (0
)
Q: Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
A:
Q: Describe three real-life applications in which cluster analysis might be useful.
A:
Q: What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification?
A:
Flexible method advantages:
Flexible method disadvantages:
Q: Under what circumstances might a more flexible approach be preferred to a less flexible approach?
A:
Q: When might a less flexible approach be preferred?
A:
Q: Describe the differences between a parametric and a non-parametric statistical learning approach.
A:
A parametric approach involves making an assumption about the functional form/shape that \(f\) takes (e.g. that \(f\) is linear in \(X\): \(f(X) = \beta_{0} + \sum_{i = 1}^{p}\beta_{i}X_{i} + \epsilon\)), then simply applying a fitting procedure (e.g. OLS, minimizing \(\sum_{i = 1}^{n}(y_{i} - \hat{y}_{i})^2\)), fitting the model by estimating the parameters.
Conversely, non-parametric approaches do not make the same assumptions about the functional form of \(f\), giving them the potential to fit a wider range of possible ‘shapes’ with \(\hat{f}\).
Q: What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
A:
Parametric method advantages:
Parametric method disadvantages:
The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Suppose we wish to use this data set to make a prediction for \(Y\) when \(X_{1} = X_{2} = X_{3} = 0\) using K-nearest neighbors.
data <- data.frame(X1 = c(0, 2, 0, 0, -1, 1),
X2 = c(3, 0, 1, 1, 0, 1),
X3 = c(0, 0, 3, 2, 1, 1),
Y = c('Red', 'Red', 'Red', 'Green', 'Green', 'Red'),
stringsAsFactors = F)
colnames(data) <- c('$X_{1}$', '$X_{2}$', '$X_{3}$', '$Y$')
library(knitr)
library(kableExtra)
kable(data, row.names = T) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
\(X_{1}\) | \(X_{2}\) | \(X_{3}\) | \(Y\) | |
---|---|---|---|---|
1 | 0 | 3 | 0 | Red |
2 | 2 | 0 | 0 | Red |
3 | 0 | 1 | 3 | Red |
4 | 0 | 1 | 2 | Green |
5 | -1 | 0 | 1 | Green |
6 | 1 | 1 | 1 | Red |
Q: Compute the Euclidean distance between each observation and the test point, \(Z = (X_{1}, X_{2}, X_{3}) = (0, 0, 0)\).
A:
\(d(a, b) = d(b, a) = \sqrt{\sum_{i = 1}^{p}(a_{i} - b_{i})^2}\)
Z <- c(0, 0, 0)
data$dist_X_Z <- sqrt((data$`$X_{1}$` - Z[1])^2 +
(data$`$X_{2}$` - Z[2])^2 +
(data$`$X_{3}$` - Z[3])^2) %>%
round(4)
colnames(data)[5] <- '$d(X, Z)$'
kable(data, row.names = T, escape = F) %>%
kable_styling(bootstrap_options = "striped", full_width = F) %>%
column_spec(6, background = "lightgreen")
\(X_{1}\) | \(X_{2}\) | \(X_{3}\) | \(Y\) | \(d(X, Z)\) | |
---|---|---|---|---|---|
1 | 0 | 3 | 0 | Red | 3.0000 |
2 | 2 | 0 | 0 | Red | 2.0000 |
3 | 0 | 1 | 3 | Red | 3.1623 |
4 | 0 | 1 | 2 | Green | 2.2361 |
5 | -1 | 0 | 1 | Green | 1.4142 |
6 | 1 | 1 | 1 | Red | 1.7321 |
Q: What is our prediction with \(K\) = 1? Why?
A:
Green, since the single nearest neighbor to \(Z\) is observation 5.
data[5, c(4, 5)] <- cell_spec(data[5, c(4, 5)], bold = T, background = "lightgreen")
kable(data, row.names = T, escape = F) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
\(X_{1}\) | \(X_{2}\) | \(X_{3}\) | \(Y\) | \(d(X, Z)\) | |
---|---|---|---|---|---|
1 | 0 | 3 | 0 | Red | 3 |
2 | 2 | 0 | 0 | Red | 2 |
3 | 0 | 1 | 3 | Red | 3.1623 |
4 | 0 | 1 | 2 | Green | 2.2361 |
5 | -1 | 0 | 1 | Green | 1.4142 |
6 | 1 | 1 | 1 | Red | 1.7321 |
Q: What is our prediction with \(K\) = 3? Why?
A:
Red, since the three nearest neighbors to \(Z\) are observations 5, 6 and 2. Two of these have \(Y\) = “Red”, so the prediction for \(Z\) would be “Red”.
data[2, c(4, 5)] <- cell_spec(data[2, c(4, 5)], bold = T, background = "lightgreen")
data[6, c(4, 5)] <- cell_spec(data[6, c(4, 5)], bold = T, background = "lightgreen")
kable(data, row.names = T, escape = F) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
\(X_{1}\) | \(X_{2}\) | \(X_{3}\) | \(Y\) | \(d(X, Z)\) | |
---|---|---|---|---|---|
1 | 0 | 3 | 0 | Red | 3 |
2 | 2 | 0 | 0 | Red | 2 |
3 | 0 | 1 | 3 | Red | 3.1623 |
4 | 0 | 1 | 2 | Green | 2.2361 |
5 | -1 | 0 | 1 | Green | 1.4142 |
6 | 1 | 1 | 1 | Red | 1.7321 |
Q: If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for \(K\) to be large or small? Why?
A:
We would expect a small \(K\) to perform better for a nonlinear decision boundary. A small \(K\) would be more ‘flexible’ and the KNN boundary would be pulled in different directions easier, whereas a large \(K\) would have a ‘smoothing’ effect and produce a more linear boundary (hence, a worse classifier), because it takes more points into account.
college
DatasetNOTE: Because some of this code in the book is a little old, here I will sometimes use code that I believe is more efficient/simple to accomplish the same tasks.
This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.
The variables are:
Private
: Public/private indicatorApps
: Number of applications receivedAccept
: Number of applicants acceptedEnroll
: Number of new students enrolledTop10perc
: New students from top 10% of high school classTop25perc
: New students from top 25% of high school classF.Undergrad
: Number of full-time undergraduatesP.Undergrad
: Number of part-time undergraduatesOutstate
: Out-of-state tuitionRoom.Board
: Room and board costsBooks
: Estimated book costsPersonal
: Estimated personal spendingPhD
: Percent of faculty with Ph.D.’sTerminal
: Percent of faculty with terminal degreeS.F.Ratio
: Student/faculty ratioperc.alumni
: Percent of alumni who donateExpend
: Instructional expenditure per studentGrad.Rate
: Graduation rate
Q: Use the read.csv()
function to read the data into R. Call the loaded data college
. Make sure that you have the directory set to the correct location for the data.
A:
college <- read.csv("./College.csv", stringsAsFactors = TRUE)
Q: Basically the title.
A:
library(tidyverse)
rownames(college) <- college$X
college$X <- NULL
glimpse(college)
## Rows: 777
## Columns: 18
## $ Private <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ...
## $ Apps <int> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, ...
## $ Accept <int> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 1...
## $ Enroll <int> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 4...
## $ Top10perc <int> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44,...
## $ Top25perc <int> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73,...
## $ F.Undergrad <int> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 1...
## $ P.Undergrad <int> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, ...
## $ Outstate <int> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1...
## $ Room.Board <int> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 3...
## $ Books <int> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, ...
## $ Personal <int> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800,...
## $ PhD <int> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79,...
## $ Terminal <int> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87...
## $ S.F.Ratio <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11....
## $ perc.alumni <int> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, ...
## $ Expend <int> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 116...
## $ Grad.Rate <int> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68,...
Q: Use the summary()
function to produce a numerical summary of the variables in the data set.
A:
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Q: Use the pairs()
function to produce a scatterplot matrix of the first ten columns or variables of the data.
A:
pairs(college[ ,1:10])
Outstate
vs Private
Q: Use the plot()
function to produce side-by-side boxplots of Outstate
versus Private
.
A:
# plot(college$Private, college$Outstate)
theme_set(theme_light()) # setting a theme for all graphs (looks cleaner)
ggplot(college, aes(x = Private, y = Outstate, fill = Private)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma_format()) +
theme(legend.position = "none") +
labs(title = "Outstate vs Private - Boxplot")
Elite
Q:
Create a new qualitative variable, called Elite
, by binning the Top10perc
variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
A:
college$Elite <- factor(ifelse(college$Top10perc <= 50, "No", "Yes"))
# above is a one-liner for the given code below
# Elite2=rep("No",nrow(college))
# Elite2[college$Top10perc >50]="Yes"
# Elite2=as.factor(Elite2)
# college=data.frame(college, Elite2)
#
# identical(college$Elite, college$Elite2)
Q: Use the summary()
function to see how many elite universities there are.
A:
summary(college$Elite)
## No Yes
## 699 78
Q: Now use the plot()
function to produce side-by-side boxplots of Outstate
versus Elite
.
A:
ggplot(college, aes(x = Elite, y = Outstate, fill = Elite)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma_format()) +
theme(legend.position = "none") +
labs(title = "Outstate vs Elite - Boxplot")
Q: Use the hist()
function to produce some histograms with differing numbers of bins for a few of the quantitative variables.
A:
apps_1 <- ggplot(college, aes(x = Apps)) +
geom_histogram(bins = 30, fill = "deepskyblue3") +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "Apps - Histogram",
subtitle = "Bins = 30",
y = "Count")
apps_2 <- ggplot(college, aes(x = Apps)) +
geom_histogram(bins = 60, fill = "deepskyblue3") +
scale_x_continuous(labels = scales::comma_format(), limits = c(0, 20000)) +
labs(title = "Apps - Histogram",
subtitle = "Bins = 60, Apps range: 0 - 20,000",
y = "Count")
phd_1 <- ggplot(college, aes(x = PhD)) +
geom_histogram(bins = 15, fill = "#16A085") +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "PhD - Histogram",
subtitle = "Bins = 15",
y = "Count")
phd_2 <- ggplot(college, aes(x = PhD)) +
geom_histogram(bins = 40, fill = "#16A085") +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "PhD - Histogram",
subtitle = "Bins = 40",
y = "Count")
room.board_1 <- ggplot(college, aes(x = Room.Board)) +
geom_histogram(bins = 10, fill = "#BB8FCE") +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "Room.Board - Histogram",
subtitle = "Bins = 10",
y = "Count")
room.board_2 <- ggplot(college, aes(x = Room.Board)) +
geom_histogram(bins = 30, fill = "#BB8FCE") +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "Room.Board - Histogram",
subtitle = "Bins = 30",
y = "Count")
library(gridExtra)
grid.arrange(apps_1, apps_2,
phd_1, phd_2,
room.board_1, room.board_2,
ncol = 2, nrow = 3)
best_predictor()
Q: Continue exploring the data, and provide a brief summary of what you discover.
A:
I wrote the following best_predictor()
function based on some work I’ve done before, and it is really useful for getting an overview of strong relationships between variables (particularly where plotting every combination of variables isn’t practical).
It will take a dataframe and a response variable \(y\) as the two arguments.
It then fits a linear or logistic regression model (depending on whether \(y\) is numeric or categorical) between each variable, trying logarithmic & quadratic transformations of the predictor if it’s numeric.
The output is a table of the \(R^2\) or \(AIC\) values with that response vs all other variables in the dataset, from best to worst.
best_predictor <- function(dataframe, response) {
if (sum(sapply(dataframe, function(x) {is.numeric(x) | is.factor(x)})) < ncol(dataframe)) {
stop("Make sure that all variables are of class numeric/factor!")
}
# pre-allocate vectors
varname <- c()
vartype <- c()
R2 <- c()
R2_log <- c()
R2_quad <- c()
AIC <- c()
AIC_log <- c()
AIC_quad <- c()
y <- dataframe[ ,response]
# # # # # NUMERIC RESPONSE # # # # #
if (is.numeric(y)) {
for (i in 1:ncol(dataframe)) {
x <- dataframe[ ,i]
varname[i] <- names(dataframe)[i]
if (class(x) %in% c("numeric", "integer")) {
vartype[i] <- "numeric"
} else {
vartype[i] <- "categorical"
}
if (!identical(y, x)) {
# linear: y ~ x
R2[i] <- summary(lm(y ~ x))$r.squared
# log-transform: y ~ log(x)
if (is.numeric(x)) {
if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
R2_log[i] <- summary(lm(y ~ log(x + abs(min(x)) + 1)))$r.squared
} else {
R2_log[i] <- summary(lm(y ~ log(x)))$r.squared
}
} else {
R2_log[i] <- NA
}
# quadratic: y ~ x + x^2
if (is.numeric(x)) {
R2_quad[i] <- summary(lm(y ~ x + I(x^2)))$r.squared
} else {
R2_quad[i] <- NA
}
} else {
R2[i] <- NA
R2_log[i] <- NA
R2_quad[i] <- NA
}
}
print(paste("Response variable:", response))
data.frame(varname,
vartype,
R2 = round(R2, 3),
R2_log = round(R2_log, 3),
R2_quad = round(R2_quad, 3)) %>%
mutate(max_R2 = pmax(R2, R2_log, R2_quad, na.rm = T)) %>%
arrange(desc(max_R2))
# # # # # CATEGORICAL RESPONSE # # # # #
} else {
for (i in 1:ncol(dataframe)) {
x <- dataframe[ ,i]
varname[i] <- names(dataframe)[i]
if (class(x) %in% c("numeric", "integer")) {
vartype[i] <- "numeric"
} else {
vartype[i] <- "categorical"
}
if (!identical(y, x)) {
# linear: y ~ x
AIC[i] <- summary(glm(y ~ x, family = "binomial"))$aic
# log-transform: y ~ log(x)
if (is.numeric(x)) {
if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
AIC_log[i] <- summary(glm(y ~ log(x + abs(min(x)) + 1), family = "binomial"))$aic
} else {
AIC_log[i] <- summary(glm(y ~ log(x), family = "binomial"))$aic
}
} else {
AIC_log[i] <- NA
}
# quadratic: y ~ x + x^2
if (is.numeric(x)) {
AIC_quad[i] <- summary(glm(y ~ x + I(x^2), family = "binomial"))$aic
} else {
AIC_quad[i] <- NA
}
} else {
AIC[i] <- NA
AIC_log[i] <- NA
AIC_quad[i] <- NA
}
}
print(paste("Response variable:", response))
data.frame(varname,
vartype,
AIC = round(AIC, 3),
AIC_log = round(AIC_log, 3),
AIC_quad = round(AIC_quad, 3)) %>%
mutate(min_AIC = pmin(AIC, AIC_log, AIC_quad, na.rm = T)) %>%
arrange(min_AIC)
}
}
Applying it to a classification situation, I want to know: what is the best single predictor of Private
?
best_predictor(college, "Private")
## [1] "Response variable: Private"
## varname vartype AIC AIC_log AIC_quad min_AIC
## 1 F.Undergrad numeric 573.904 529.782 543.025 529.782
## 2 Outstate numeric 578.891 587.785 580.690 578.891
## 3 Enroll numeric 636.124 600.760 614.453 600.760
## 4 P.Undergrad numeric 697.830 641.608 682.239 641.608
## 5 Accept numeric 722.970 697.064 712.350 697.064
## 6 S.F.Ratio numeric 723.261 714.265 698.678 698.678
## 7 Apps numeric 761.995 722.126 751.608 722.126
## 8 perc.alumni numeric 750.749 779.767 747.815 747.815
## 9 Room.Board numeric 812.384 812.717 814.367 812.384
## 10 Expend numeric 822.691 813.843 820.804 813.843
## 11 Grad.Rate numeric 822.580 840.692 815.593 815.593
## 12 Personal numeric 844.325 831.692 821.576 821.576
## 13 PhD numeric 894.541 890.556 882.192 882.192
## 14 Top10perc numeric 891.354 883.514 884.677 883.514
## 15 Terminal numeric 901.062 899.590 897.510 897.510
## 16 Top25perc numeric 907.533 906.502 908.150 906.502
## 17 Elite categorical 909.356 NA NA 909.356
## 18 Books numeric 914.487 914.575 912.451 912.451
## 19 Private categorical NA NA NA NA
Lets visualize:
private_predictors <- best_predictor(college, "Private") %>%
select(-c(vartype, min_AIC)) %>%
gather(key = "key", value = "AIC", -varname) %>%
filter(!is.na(AIC))
private_predictors_order <- best_predictor(college, "Private") %>%
select(varname, min_AIC) %>%
filter(!is.na(min_AIC)) %>%
arrange(desc(min_AIC)) %>%
pull(varname)
private_predictors$varname <- factor(private_predictors$varname,
ordered = T,
levels = private_predictors_order)
ggplot(private_predictors, aes(x = AIC, y = varname, col = key, group = varname)) +
geom_line(col = "grey15") +
geom_point(size = 2) +
theme_light() +
theme(legend.position = "bottom") +
labs(title = "Best predictors (& transformations) for Private",
col = "Predictor Transformation",
y = "Predictor") +
scale_color_discrete(labels = c("None", "Log-Transformed", "Order 2 Polynomial"))
Below we can see the distributions of F.Undergrad
for private and not private colleges. Private colleges tend to have a significantly smaller number of undergraduates.
ggplot(college, aes(x = Private, y = F.Undergrad, fill = Private)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma_format()) +
theme(legend.position = "none") +
labs(title = "F.Undergrad vs Private - Boxplot")
I’m also curious as to what the best predictor of Grad.Rate
is.
best_predictor(college, "Grad.Rate")
## [1] "Response variable: Grad.Rate"
## varname vartype R2 R2_log R2_quad max_R2
## 1 Outstate numeric 0.326 0.311 0.327 0.327
## 2 Top10perc numeric 0.245 0.226 0.249 0.249
## 3 perc.alumni numeric 0.241 0.239 0.248 0.248
## 4 Top25perc numeric 0.228 0.213 0.228 0.228
## 5 Expend numeric 0.152 0.180 0.183 0.183
## 6 Room.Board numeric 0.181 0.179 0.182 0.182
## 7 Elite categorical 0.122 NA NA 0.122
## 8 Private categorical 0.113 NA NA 0.113
## 9 PhD numeric 0.093 0.071 0.108 0.108
## 10 Terminal numeric 0.084 0.072 0.098 0.098
## 11 S.F.Ratio numeric 0.094 0.093 0.096 0.096
## 12 Personal numeric 0.073 0.073 0.082 0.082
## 13 P.Undergrad numeric 0.066 0.074 0.080 0.080
## 14 Apps numeric 0.022 0.040 0.025 0.040
## 15 Accept numeric 0.005 0.021 0.006 0.021
## 16 F.Undergrad numeric 0.006 0.000 0.007 0.007
## 17 Enroll numeric 0.000 0.004 0.001 0.004
## 18 Books numeric 0.000 0.000 0.000 0.000
## 19 Grad.Rate numeric NA NA NA NA
Visualizing again:
gradrate_predictors <- best_predictor(college, "Grad.Rate") %>%
select(-c(vartype, max_R2)) %>%
gather(key = "key", value = "R2", -varname) %>%
filter(!is.na(R2))
gradrate_predictors_order <- best_predictor(college, "Grad.Rate") %>%
select(varname, max_R2) %>%
filter(!is.na(max_R2)) %>%
arrange(max_R2) %>%
pull(varname)
gradrate_predictors$varname <- factor(gradrate_predictors$varname,
ordered = T,
levels = gradrate_predictors_order)
ggplot(gradrate_predictors, aes(x = R2, y = varname, col = key, group = varname)) +
geom_line(col = "grey15") +
geom_point(size = 2) +
theme_light() +
theme(legend.position = "bottom") +
labs(title = "Best predictors (& transformations) for Grad.Rate",
col = "Predictor Transformation",
y = "Predictor") +
scale_color_discrete(labels = c("None", "Log-Transformed", "Order 2 Polynomial"))
The strongest relationship is with Outstate
. I visualize this relationship below and add a smoothing line.
ggplot(college, aes(x = Outstate, y = Grad.Rate)) +
geom_point() +
geom_smooth() +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(breaks = seq(0, 120, 20)) +
labs(title = "Outstate vs Grad.Rate - Scatterplot")
This would suggest that the relationship would best be described as linear, and the marginal increase in \(R^2\) for the second order transformation is not meaningful (using adjusted \(R^2\) would not have this problem, as \(R^2\) will never decrease for variables added).
We can speculate that the true relationship might be logarithmic or quadratic over a larger range of Outstate
though, as it wouldn’t make sense for our regression line to predict Grad.Rate
as 120% for a sufficiently high value of Outstate
, which is what will currently happen (although this is not a problem for this range of data).
On that topic, we can see from the scatter plot that one college is reporting graduation rate of >100% - perhaps there are other errors of this sort?
The variables in a percentage format are:
PhD
perc.alumni
Grad.Rate
Checking for values outside a \([0, 100]\) range:
sum(college$PhD > 100 | college$PhD < 0)
## [1] 1
sum(college$perc.alumni > 100 | college$perc.alumni < 0)
## [1] 0
sum(college$Grad.Rate > 100 | college$Grad.Rate < 0)
## [1] 1
college %>%
rownames_to_column("college") %>%
filter(PhD > 100 | PhD < 0 | Grad.Rate > 100 | Grad.Rate < 0) %>%
select(college, PhD, Grad.Rate)
## college PhD Grad.Rate
## 1 Cazenovia College 22 118
## 2 Texas A&M University at Galveston 103 43
We can see that ‘Texas A&M University at Galveston’ is reporting that 103% of its faculty have PhD’s, and ‘Cazenovia College’ has a graduation rate of 118% - impressive stuff!
Auto
DatasetThis exercise involves the Auto
data set studied in the lab. Make sure that the missing values have been removed from the data.
library(ISLR)
glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...
sum(is.na(Auto)) # 0
## [1] 0
Q: Which of the predictors are quantitative, and which are qualitative?
A:
Typing ?ISLR::Auto
into the console shows the variable definitions:
Quantitative:
mpg
- Miles per galloncylinders
- Number of cylinders between 4 and 8displacement
- Engine displacement (cu. inches)horsepower
- Engine horsepowerweight
- Vehicle weight (lbs.)acceleration
- Time to accelerate from 0 to 60 mph (sec.)year
- Model year (modulo 100)Qualitative:
origin
- Origin of car (1. American, 2. European, 3. Japanese)name
- Vehicle name
Q: What is the range of each quantitative predictor? You can answer this using the range()
function.
A:
range_Auto <- data.frame(sapply(Auto[ ,1:7], range))
rownames(range_Auto) <- c("min:", "max:")
range_Auto
## mpg cylinders displacement horsepower weight acceleration year
## min: 9.0 3 68 46 1613 8.0 70
## max: 46.6 8 455 230 5140 24.8 82
Q: What is the mean and standard deviation of each quantitative predictor?
A:
Mean:
sapply(Auto[ ,1:7], mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
Standard Deviation:
sapply(Auto[ ,1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
Q: Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
A:
Auto_2 <- Auto[-c(10:85), ]
Range:
range_Auto_2 <- data.frame(sapply(Auto_2[ ,1:7], range))
rownames(range_Auto_2) <- c("min:", "max:")
range_Auto_2
## mpg cylinders displacement horsepower weight acceleration year
## min: 11.0 3 68 46 1649 8.5 70
## max: 46.6 8 455 230 4997 24.8 82
Mean:
sapply(Auto_2[ ,1:7], mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
Standard Deviation:
sapply(Auto_2[ ,1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
Q: Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
A:
I am curious as to how relationships between the variables might differ across the different levels of origin
. Looking at all relationships between the numeric variables is a good place to start here:
pairs(Auto[ ,1:7])
Many of these pairs clearly show a strong positive/negative relationship, and it’s unlikely that splitting by origin
will give any additional insights. For this reason, I’m actually looking for weaker relationships.
On its own, the relationship between weight
and acceleration
looks quite boring - a weak negative correlation.
Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))
ggplot(Auto, aes(x = weight, y = acceleration)) +
geom_point() +
theme(legend.position = "none") +
scale_x_continuous(labels = scales::comma_format()) +
labs(x = "Weight",
y = "Acceleration",
title = "Correlation between weight and acceleration")
However, this relationship varies based on the country of origin of the car. European cars actually have a weak positive correlation - cars with a faster 0-60 time also tended to be heavier. For American and Japanese cars, this relationship was the reverse.
ggplot(Auto, aes(x = weight, y = acceleration, col = origin)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ origin) +
theme(legend.position = "none") +
scale_x_continuous(labels = scales::comma_format()) +
labs(x = "Weight",
y = "Acceleration",
title = "Correlation between weight and acceleration, by origin")
Note also how the range for weight
is significantly smaller for Japanese and European cars - there was clearly quite large differences in the types of cars each country was developing.
Another good example of this is found by looking at trends over time. The aggregate trend of displacement
over time looks uninteresting:
ggplot(Auto, aes(x = year + 1900, y = displacement)) +
geom_jitter() +
theme(legend.position = "none") +
labs(x = "Year",
y = "Displacement",
title = "Engine Displacement (trends over time)")
This appears to be a weak negative relationship - fitting a linear model predicting displacement
with year
yields an \(R^2\) of 0.1368.
summary(lm(displacement ~ year, data = Auto))
##
## Call:
## lm(formula = displacement ~ year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.73 -75.58 -13.45 69.69 229.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 992.691 101.661 9.765 < 2e-16 ***
## year -10.506 1.336 -7.862 3.75e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97.35 on 390 degrees of freedom
## Multiple R-squared: 0.1368, Adjusted R-squared: 0.1346
## F-statistic: 61.8 on 1 and 390 DF, p-value: 3.748e-14
However, once we begin to look at this trend across different origins, it becomes more interesting:
ggplot(Auto, aes(x = year + 1900, y = displacement, col = factor(origin))) +
geom_jitter() +
geom_smooth(method = "lm") +
theme(legend.position = "none") +
labs(x = "Year",
y = "Displacement",
title = "Engine Displacement (trends over time), by Origin") +
facet_wrap(~ origin)
From a linear model standpoint, we can account for the fact that the average engine displacement was significantly higher for American cars than Japanese/European by adding the origin
variable to the model.
We can further account for the differing trends over time by adding an interaction term between year
and origin
, because the relationship between year
and displacement
is different, depending on the value of origin
!
summary(lm(displacement ~ year * origin, data = Auto))
##
## Call:
## lm(formula = displacement ~ year * origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -217.920 -26.784 -4.066 33.700 174.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1258.425 91.656 13.730 < 2e-16 ***
## year -13.373 1.211 -11.042 < 2e-16 ***
## originEuropean -1252.625 208.468 -6.009 4.33e-09 ***
## originJapanese -1215.032 190.072 -6.392 4.71e-10 ***
## year:originEuropean 14.745 2.752 5.357 1.46e-07 ***
## year:originJapanese 14.139 2.466 5.734 1.98e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.25 on 386 degrees of freedom
## Multiple R-squared: 0.5677, Adjusted R-squared: 0.5621
## F-statistic: 101.4 on 5 and 386 DF, p-value: < 2.2e-16
(year * origin
is shorthand for year + origin + year:origin
)
This boosts our \(R^2\) significantly, to 0.5677.
mpg
Q: Suppose that we wish to predict gas mileage (mpg
) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg
? Justify your answer.
A:
Referring back to the scatter matrix, it appears that all quantitative predictors have some relationship with the response.
name
(the name of the vehicle) is categorical, and has 301 unique values. This isn’t really practical in its current state as a predictor, particularly on such a small dataset.
However, we may be able to extract useful information. For example, the format of the name
variable seems to be that the first word is the brand of the car.
I extract this information, correct some mistakes and collapse the variable down into 10 levels (with the 10th being ‘uncommon’):
Auto$brand <- sapply(strsplit(as.character(Auto$name), split = " "),
function(x) x[1]) # extract the first item from each list element
Auto$brand <- factor(ifelse(Auto$brand %in% c("vokswagen", "vw"), "volkswagen",
ifelse(Auto$brand == "toyouta", "toyota",
ifelse(Auto$brand %in% c("chevroelt", "chevy"), "chevrolet",
ifelse(Auto$brand == "maxda", "mazda",
Auto$brand))))) # fixing typo's
Auto$brand <- fct_lump(Auto$brand,
n = 9,
other_level = "uncommon") # collapse into 10 categories
table(Auto$brand)
##
## amc buick chevrolet datsun dodge ford plymouth
## 27 17 47 23 28 48 31
## toyota volkswagen uncommon
## 26 22 123
Visualizing this new variable:
ggplot(Auto, aes(x = brand, y = mpg, fill = brand)) +
geom_boxplot() +
theme(legend.position = "none") +
labs(title = "Brand vs Mpg - Boxplot",
subtitle = "Engineered feature",
x = "Brand",
y = "MPG")
The variable is now in a state of usability, and looks like it would be useful in predicting mpg
.
origin
also looks useful:
ggplot(Auto, aes(x = origin, y = mpg, fill = origin)) +
geom_boxplot() +
theme(legend.position = "none") +
labs(title = "Origin vs Mpg - Boxplot",
x = "Origin",
y = "MPG")
We can use the function written earlier, best_predictor()
, to assess how these predictors do when compared with the quantitative predictors. I drop name
for reasons mentioned earlier.
best_predictor(select(Auto, -name), "mpg")
## [1] "Response variable: mpg"
## varname vartype R2 R2_log R2_quad max_R2
## 1 weight numeric 0.693 0.713 0.715 0.715
## 2 displacement numeric 0.648 0.686 0.689 0.689
## 3 horsepower numeric 0.606 0.668 0.688 0.688
## 4 cylinders numeric 0.605 0.603 0.607 0.607
## 5 year numeric 0.337 0.332 0.368 0.368
## 6 origin categorical 0.332 NA NA 0.332
## 7 brand categorical 0.260 NA NA 0.260
## 8 acceleration numeric 0.179 0.190 0.194 0.194
## 9 mpg numeric NA NA NA NA
Boston
Housing DatasetQ: How many rows are in this data set? How many columns? What do the rows and columns represent?
A:
library(MASS)
dim(Boston)
## [1] 506 14
# ?Boston
The Boston
housing dataset is 506 rows and 14 columns.
Each row represents a Boston suburb.
The columns represent:
crim
- Per capita crime rate by townzn
- Proportion of residential land zoned for lots over 25,000 sq.ft.indus
- Proportion of non-retail business acres per townchas
- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)nox
- Nitrogen oxides concentration (parts per 10 million)rm
- Average number of rooms per dwellingage
- Proportion of owner-occupied units built prior to 1940dis
- Weighted mean of distances to five Boston employment centresrad
- Index of accessibility to radial highwaystax
- Full-value property-tax rate per $10,000ptratio
- Pupil-teacher ratio by townblack
- 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by townlstat
- Lower status of the population (percent)medv
- Median value of owner-occupied homes in $1000s
Q: Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
A:
pairs(Boston)
There’s a lot going on here, and it’s hard to identify many relationships. I’ll loop through my best_predictor()
function again for every variable in the dataframe, identifying what the best predictor for it is (including transformations).
Based on the data definitions, I convert chas
to a factor format.
Boston$chas <- factor(Boston$chas)
Below we can see the best predictor for every quantitative variable:
predictor_df <- data.frame(response = character(),
varname = character(),
R2 = numeric(),
R2_log = numeric(),
R2_quad = numeric(),
max_R2 = numeric(),
stringsAsFactors = F)
for (i in setdiff(1:ncol(Boston), 4)) { # excluding 'chas' as it uses different eval metric
response <- names(Boston)[i]
predictor_info <- best_predictor(Boston, names(Boston)[i])[1, -2]
predictor_df <- rbind(predictor_df, cbind(response, predictor_info))
}
predictor_df %>%
arrange(desc(max_R2)) %>%
rename(best_predictor = varname)
## response best_predictor R2 R2_log R2_quad max_R2
## 1 rad tax 0.829 0.746 0.890 0.890
## 2 tax rad 0.829 0.723 0.832 0.832
## 3 dis nox 0.592 0.653 0.736 0.736
## 4 nox dis 0.592 0.692 0.700 0.700
## 5 lstat medv 0.544 0.648 0.679 0.679
## 6 medv lstat 0.544 0.665 0.641 0.665
## 7 age nox 0.535 0.586 0.653 0.653
## 8 indus nox 0.583 0.609 0.628 0.628
## 9 rm medv 0.484 0.399 0.488 0.488
## 10 zn dis 0.441 0.349 0.471 0.471
## 11 crim rad 0.391 0.323 0.399 0.399
## 12 ptratio medv 0.258 0.252 0.268 0.268
## 13 black crim 0.148 0.229 0.234 0.234
And the best predictor for the single qualitative variable, chas
:
best_predictor(Boston, "chas")[1, ]
## [1] "Response variable: chas"
## varname vartype AIC AIC_log AIC_quad min_AIC
## 1 medv numeric 245.312 245.016 247.312 245.016
The strongest relationship identified among the quantitative variables is a quadratic relationship between rad
and tax
. I plot the fitted line on the graph below, and looking at this shows that the strong \(R^2\) is deceptive.
g1 <- ggplot(Boston, aes(x = tax, y = rad)) +
geom_jitter(alpha = 0.1) +
geom_smooth(method = "lm", formula = "y ~ x + I(x^2)") +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "Tax vs Rad - Jitter Plot",
x = "Tax",
y = "Rad")
g2 <- Boston %>%
filter(rad < 20) %>%
ggplot(aes(x = tax, y = rad)) +
geom_jitter(alpha = 0.1) +
geom_smooth(method = "lm", formula = "y ~ x + I(x^2)") +
scale_x_continuous(labels = scales::comma_format()) +
labs(title = "Tax vs Rad - Jitter Plot",
subtitle = "Filtered for rad < 20 (132/506 obs. removed)",
x = "Tax",
y = "Rad")
grid.arrange(g1, g2, ncol = 2)
Note that jitter & reduced alpha has been applied to the plot, due to overlapping observations.
In actuality, the only strong part of the relationship between rad
and tax
is: “If tax
= 666 then rad
= 24”.
Fitting a regression line of the form rad ~ tax + tax^2
on the remainder of the data, as done in the right graph, demonstrates this:
Boston_2 <- Boston %>%
filter(rad < 20)
summary(lm(rad ~ tax + I(tax^2), data = Boston_2))
##
## Call:
## lm(formula = rad ~ tax + I(tax^2), data = Boston_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0642 -0.7413 -0.0235 0.7961 3.6424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.778e-02 8.026e-01 -0.122 0.903
## tax 2.280e-02 4.284e-03 5.323 1.78e-07 ***
## I(tax^2) -2.504e-05 5.484e-06 -4.567 6.75e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.565 on 371 degrees of freedom
## Multiple R-squared: 0.08678, Adjusted R-squared: 0.08185
## F-statistic: 17.63 on 2 and 371 DF, p-value: 4.863e-08
In essence, for those with rad
> 20, we have an \(R^2\) of close to 1, and for those < 20, we have an \(R^2\) of 0.087!
Looking at the second strongest pair of variables identified shows a ‘true’ relationship of the form dis ~ nox + nox^2
:
ggplot(Boston, aes(x = nox, y = dis)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", formula = "y ~ x + I(x^2)") +
labs(title = "Nox vs Dis - Scatter Plot",
subtitle = "Quadratic fit",
x = "Nox",
y = "Dis")
crim
Q: Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
A:
I can call my best_predictor()
function again here to give us a better idea of where to look.
best_predictor(Boston, "crim")
## [1] "Response variable: crim"
## varname vartype R2 R2_log R2_quad max_R2
## 1 rad numeric 0.391 0.323 0.399 0.399
## 2 tax numeric 0.340 0.304 0.367 0.367
## 3 medv numeric 0.151 0.279 0.358 0.358
## 4 dis numeric 0.144 0.216 0.229 0.229
## 5 lstat numeric 0.208 0.156 0.214 0.214
## 6 nox numeric 0.177 0.185 0.199 0.199
## 7 indus numeric 0.165 0.145 0.181 0.181
## 8 age numeric 0.124 0.084 0.162 0.162
## 9 black numeric 0.148 0.125 0.149 0.149
## 10 ptratio numeric 0.084 0.079 0.100 0.100
## 11 rm numeric 0.048 0.055 0.067 0.067
## 12 zn numeric 0.040 0.057 0.056 0.057
## 13 chas categorical 0.003 NA NA 0.003
## 14 crim numeric NA NA NA NA
rad
and tax
are supposedly the best predictors again, but I have already covered the quirks of these variables in some detail.
The next best predictor is medv
, which best_predictor()
suggests should have a non-linear relationship with crim
.
Note the large number of towns with a crim
value of close to zero.
ggplot(Boston, aes(x = medv, y = crim)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x + I(x^2)", col = "green") +
labs(title = "Medv vs Crim - Scatter Plot",
x = "Medv",
y = "Crim")
Those with a smaller value of medv
(where the median value of owner-occupied homes is lower) tend to have a higher value of crim
(higher crime rates). After a certain point, an increase in medv
seems to have little impact on the crime rate.
A very similar non-linear relationship can be seen between crim
and dis
(weighted mean distance to employment centres).
ggplot(Boston, aes(x = dis, y = crim)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x + I(x^2)", col = "green") +
labs(title = "Dis vs Crim - Scatter Plot",
x = "Dis",
y = "Crim")
The relationship between crim
and lstat
(percent of the population that are lower status) looks like it could be linear, although a quadratic relationship yields a slightly higher \(R^2\).
ggplot(Boston, aes(x = lstat, y = crim)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x + I(x^2)", col = "green") +
labs(title = "Lstat vs Crim - Scatter Plot",
x = "Lstat",
y = "Crim")
We are starting to build up a picture of those towns with the higher crime rates (lower-value housing, closer to unemployment centres, lower status).
Q: Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
A:
Crime rates:
We have already seen that most suburbs have a crime rate at or close to zero, but some suburbs have incredibly high crime rates:
ggplot(Boston, aes(x = "", y = crim)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(0, 100, 10)) +
labs(y = "Crim",
title = "Crim - Boxplot") +
coord_flip() +
theme(axis.title.y = element_blank(),
axis.ticks.y = element_blank())
It’s interesting to see how much variation there is in the crime rates within Boston, but I’m confused at the huge range:
range(Boston$crim)
## [1] 0.00632 88.97620
The median value of crim
is 0.25651, but there are many suburbs with values far higher than this.
“Per capita crime rate by town” would suggest to me that a value of 88.98 either means there are 88.98% as many crimes as there are people over some period, or that there were 88.98 crimes per person over some period (using either of these definitions makes it seem exceptionally high).
Tax rates:
tax
is weird, as seen already.
range(Boston$tax)
## [1] 187 711
The range may not be over many orders of magnitude, but there is still the case of a huge number of suburbs taking the same tax
values far from the others, presumably because they all fall under the same tax rate.
ggplot(Boston, aes(x = tax)) +
geom_histogram(binwidth = 10, fill = "mediumseagreen") +
scale_y_continuous(breaks = seq(0, 200, 20)) +
labs(title = "Tax - Histogram",
x = "Tax",
y = "Count")
Pupil-teacher Ratios:
ptratio
doesn’t seem to have any huge outliers or gaps and is over a smaller range:
range(Boston$ptratio)
## [1] 12.6 22.0
ggplot(Boston, aes(x = ptratio)) +
geom_histogram(binwidth = 0.25, fill = "deepskyblue") +
scale_x_continuous(breaks = seq(12, 23, 0.5)) +
scale_y_continuous(breaks = seq(0, 200, 20)) +
labs(title = "Ptratio - Histogram",
x = "Ptratio",
y = "Count")
Again, there are many observations with the same value:
Boston %>%
group_by(ptratio) %>%
count() %>%
arrange(desc(n))
## # A tibble: 46 x 2
## # Groups: ptratio [46]
## ptratio n
## <dbl> <int>
## 1 20.2 140
## 2 14.7 34
## 3 21 27
## 4 17.8 23
## 5 19.2 19
## 6 17.4 18
## 7 18.6 17
## 8 19.1 17
## 9 16.6 16
## 10 18.4 16
## # ... with 36 more rows
Checking the data dictionary, I thought this might be because these are aggregate figures (across multiple suburbs, perhaps at a town-level), which would explain the weird cases of many suburbs having the same value across multiple variables.
See below, where we can see that many (but not all) of these 140 suburbs have the same values for ptratio
, rad
, tax
, zn
& indus
!
Boston %>%
group_by(ptratio, rad, tax, zn, indus) %>%
count() %>%
arrange(desc(n))
## # A tibble: 78 x 6
## # Groups: ptratio, rad, tax, zn, indus [78]
## ptratio rad tax zn indus n
## <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 20.2 24 666 0 18.1 132
## 2 14.7 5 403 0 19.6 30
## 3 21 4 307 0 8.14 22
## 4 17.4 8 307 0 6.2 18
## 5 21.2 4 437 0 21.9 15
## 6 13 5 264 20 3.97 12
## 7 18.4 4 304 0 9.9 12
## 8 18.6 4 277 0 10.6 11
## 9 20.9 5 384 0 8.56 11
## 10 19.1 7 330 22 5.86 10
## # ... with 68 more rows
However, when I add an extra variable black
, which the data dictionary states is at a town-level, we see that there this varies within the groups of these 5 variables, so it’s not even that these are the same towns!
Boston %>%
group_by(ptratio, rad, tax, zn, indus, black) %>%
count() %>%
arrange(desc(n))
## # A tibble: 430 x 7
## # Groups: ptratio, rad, tax, zn, indus, black [430]
## ptratio rad tax zn indus black n
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 20.2 24 666 0 18.1 397. 29
## 2 19.2 6 391 0 9.69 397. 6
## 3 19.6 5 287 0 7.38 397. 6
## 4 21.2 4 437 0 21.9 397. 5
## 5 16 4 289 0 13.9 397. 4
## 6 17.9 3 233 0 6.91 397. 4
## 7 20.2 5 224 0 5.19 397. 4
## 8 14.7 5 403 0 19.6 397. 3
## 9 17.6 4 254 40 6.41 397. 3
## 10 17.8 3 193 0 2.46 397. 3
## # ... with 420 more rows
I’ve found this dataset very awkward to work with. It appears that many variables are at a suburb (row) level, and many are at different aggregate levels higher than that (towns & larger). The data dictionary doesn’t tell us which, making drawing insights from it far harder.
Q: How many of the suburbs in this data set bound the Charles river?
A:
table(Boston$chas)
##
## 0 1
## 471 35
There are 35 suburbs bound by the Charles river.
ptratio
Q: What is the median pupil-teacher ratio among the towns in this data set?
A:
Here I’m going to assume that the authors wanted the median per suburb (row), as it appears that there is no way of grouping these suburbs into towns without getting contradictions.
We should get the same number of distinct combinations of crim
, indus
, ptratio
& black
combined as we do from distinct values from just one of these variables, as they are all ‘town-level’ variables, but this isn’t the case.
median(Boston$ptratio)
## [1] 19.05
medv
SuburbsQ: Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
A:
It appears that there are two suburbs with the lowest median value of owner-occupied homes (5).
Boston[Boston$medv == min(Boston$medv), ]
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.90 30.59
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 384.97 22.98
## medv
## 399 5
## 406 5
I create a data frame Boston_percentiles
, which gives the percentile rank for every observation in the data frame, which I then filter for these two observations to see where they rank in the ranges of each variable.
Boston_percentiles <- sapply(Boston[ ,-4], function(x) rank(x)/length(x)) %>%
as.data.frame()
Boston_percentiles[c(399, 406),]
## crim zn indus nox rm age dis
## 399 0.9881423 0.3685771 0.7579051 0.8448617 0.0770751 0.958498 0.05731225
## 406 0.9960474 0.3685771 0.7579051 0.8448617 0.1363636 0.958498 0.04150198
## rad tax ptratio black lstat medv
## 399 0.8705534 0.8606719 0.7519763 0.8814229 0.9782609 0.002964427
## 406 0.8705534 0.8606719 0.7519763 0.3498024 0.8992095 0.002964427
We can see that both observations with the lowest medv
take very similar values, and for many they take quite extreme values.
crim
, age
, lstat
indus
, nox
, rad
, tax
, ptratio
chas
= 0: they are not near the Charles riverzn
, rm
, dis
black
is the only variable that differs significantly, with each observation at the 88th and 35th percentiles
rm
> 8Q: In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
A:
More than seven rooms per dwelling:
sum(Boston$rm > 7)
## [1] 64
More than eight rooms per dwelling:
Boston_gt_8rooms <- Boston[Boston$rm > 8, ]
nrow(Boston_gt_8rooms)
## [1] 13
2/13 were located near the Charles river:
prop.table(table(Boston_gt_8rooms$chas))
##
## 0 1
## 0.8461538 0.1538462
We can see the percentile rank for these 13 suburbs:
Boston_gt_8rooms_perc <- Boston_percentiles[as.numeric(rownames(Boston_gt_8rooms)), ]
glimpse(Boston_gt_8rooms_perc)
## Rows: 13
## Columns: 13
## $ crim <dbl> 0.34387352, 0.69367589, 0.03557312, 0.52766798, 0.58893281,...
## $ zn <dbl> 0.3685771, 0.3685771, 0.9950593, 0.3685771, 0.3685771, 0.36...
## $ indus <dbl> 0.09683794, 0.91798419, 0.08992095, 0.34090909, 0.34090909,...
## $ nox <dbl> 0.21541502, 0.68873518, 0.08893281, 0.38833992, 0.38833992,...
## $ rm <dbl> 0.9802372, 0.9920949, 0.9762846, 0.9861660, 0.9980237, 0.97...
## $ age <dbl> 0.48418972, 0.74604743, 0.14130435, 0.50988142, 0.55830040,...
## $ dis <dbl> 0.5454545, 0.2766798, 0.7480237, 0.4634387, 0.4634387, 0.50...
## $ rad <dbl> 0.06422925, 0.49407115, 0.27173913, 0.71640316, 0.71640316,...
## $ tax <dbl> 0.21541502, 0.63932806, 0.07806324, 0.41600791, 0.41600791,...
## $ ptratio <dbl> 0.37154150, 0.06818182, 0.06818182, 0.26778656, 0.26778656,...
## $ black <dbl> 0.8814229, 0.4100791, 0.4624506, 0.3537549, 0.3201581, 0.39...
## $ lstat <dbl> 0.075098814, 0.035573123, 0.011857708, 0.071146245, 0.09881...
## $ medv <dbl> 0.9367589, 0.9851779, 0.9851779, 0.9565217, 0.9851779, 0.93...
Taking the mean of each column to simplify:
sapply(Boston_gt_8rooms_perc, mean)
## crim zn indus nox rm age dis
## 0.53815749 0.54651870 0.33612040 0.47514442 0.98814229 0.48008513 0.47202797
## rad tax ptratio black lstat medv
## 0.57236242 0.37473396 0.24407115 0.43987534 0.09007297 0.93227425
It appears that these suburbs with more than 8 rooms per dwelling don’t take such extreme values, with most variables falling somewhere in the middle 50%.
The exceptions are lstat
, which averages in the bottom 10th percentile, and medv
which averages in the top 10th percentile.
We could infer from this that these suburbs are more affluent areas.