#Chapter 2: ex #10
#a. To begin, load in the Boston
data set. The
Boston
data set is part of the ISLR2
library
in R.
library(ISLR2)
dim(Boston)
## [1] 506 13
library(ggplot2)
library(tidyverse)
ggplot(Boston, aes(nox, rm)) +
geom_point()
ggplot(Boston, aes(ptratio, rm)) +
geom_point()
heatmap(cor(Boston, method = "spearman"), cexRow = 1.1, cexCol = 1.1)
#c. Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
#Yes
#d. Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
Boston |>
pivot_longer(cols = 1:13) |>
filter(name %in% c("crim", "tax", "ptratio")) |>
ggplot(aes(value)) +
geom_histogram(bins = 20) +
facet_wrap(~name, scales = "free", ncol = 1)
Yes, particularly crime and tax rates.
#e. How many of the census tracts in this data set bound the Charles river?
sum(Boston$chas)
## [1] 35
#f. What is the median pupil-teacher ratio among the towns in this data set?
median(Boston$ptratio)
## [1] 19.05
#Chapter 3: ex2
#The KNN classifier is categorical and assigns a value based on the most frequent observed category among \(K\) nearest neighbors, whereas KNN regression assigns a continuous variable, the average of the response variables for the \(K\) nearest neighbors.
#Chapter 3: ex10
fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
\[ \textit{Sales} = 13 + -0.054 \times \textit{Price} + \begin{cases} -0.022, & \text{if $\textit{Urban}$ is Yes, $\textit{US}$ is No} \\ 1.20, & \text{if $\textit{Urban}$ is No, $\textit{US}$ is Yes} \\ 1.18, & \text{if $\textit{Urban}$ and $\textit{US}$ is Yes} \\ 0, & \text{Otherwise} \end{cases} \]
fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
summary(fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
anova(fit, fit2)
## Analysis of Variance Table
##
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 2420.8
## 2 397 2420.9 -1 -0.03979 0.0065 0.9357
confint(fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2, 2))
plot(fit2, cex = 0.2)
#Chapter 4: ex12
- What is the log odds of
orange
versusapple
in your model?
The log odds is just \(\hat\beta_0 + \hat\beta_1x\)
- What is the log odds of
orange
versusapple
in your friend’s model?
From 4.14, log odds of our friend’s model is:
\[ (\hat\alpha_{orange0} - \hat\alpha_{apple0}) + (\hat\alpha_{orange1} - \hat\alpha_{apple1})x \]
- Suppose that in your model, \(\hat\beta_0 = 2\) and \(\hat\beta = -1\). What are the coefficient estimates in your friend’s model? Be as specific as possible.
We can say that in our friend’s model \(\hat\alpha_{orange0} - \hat\alpha_{apple0} = 2\) and \(\hat\alpha_{orange1} - \hat\alpha_{apple1} = -1\).
We are unable to know the specific value of each parameter however.
- Now suppose that you and your friend fit the same two models on a different data set. This time, your friend gets the coefficient estimates \(\hat\alpha_{orange0} = 1.2\), \(\hat\alpha_{orange1} = -2\), \(\hat\alpha_{apple0} = 3\), \(\hat\alpha_{apple1} = 0.6\). What are the coefficient estimates in your model?
The coefficients in our model would be \(\hat\beta_0 = 1.2 - 3 = -1.8\) and \(\hat\beta_1 = -2 - 0.6 = -2.6\)
- Finally, suppose you apply both models from (d) to a data set with 2,000 test observations. What fraction of the time do you expect the predicted class labels from your model to agree with those from your friend’s model? Explain your answer.
The models are identical with different parameterization so they should perfectly agree.