#Chapter 2: ex #10

#a. To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library in R.

library(ISLR2)
dim(Boston)

## [1] 506  13

b. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

library(ggplot2)
library(tidyverse)

ggplot(Boston, aes(nox, rm)) +
  geom_point()

ggplot(Boston, aes(ptratio, rm)) +
  geom_point()

heatmap(cor(Boston, method = "spearman"), cexRow = 1.1, cexCol = 1.1)

#c. Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

#Yes

#d. Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

Boston |>
  pivot_longer(cols = 1:13) |>
  filter(name %in% c("crim", "tax", "ptratio")) |>
  ggplot(aes(value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~name, scales = "free", ncol = 1)

Yes, particularly crime and tax rates.

#e. How many of the census tracts in this data set bound the Charles river?

sum(Boston$chas)

## [1] 35

#f. What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)

## [1] 19.05

#Chapter 3: ex2

Carefully explain the differences between the KNN classifier and KNN regression methods.

#The KNN classifier is categorical and assigns a value based on the most frequent observed category among $K$ nearest neighbors, whereas KNN regression assigns a continuous variable, the average of the response variables for the $K$ nearest neighbors.

#Chapter 3: ex10

fit <- lm(Sales ~ Price + Urban + US, data = Carseats)

summary(fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

\[ \textit{Sales} = 13 + -0.054 \times \textit{Price} + \begin{cases} -0.022, & \text{if $\textit{Urban}$ is Yes, $\textit{US}$ is No} \\ 1.20, & \text{if $\textit{Urban}$ is No, $\textit{US}$ is Yes} \\ 1.18, & \text{if $\textit{Urban}$ and $\textit{US}$ is Yes} \\ 0, & \text{Otherwise} \end{cases} \]

fit2 <- lm(Sales ~ Price + US, data = Carseats)

summary(fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

summary(fit2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

anova(fit, fit2)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    396 2420.8                           
## 2    397 2420.9 -1  -0.03979 0.0065 0.9357

confint(fit2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

par(mfrow = c(2, 2))
plot(fit2, cex = 0.2)

#Chapter 4: ex12

What is the log odds of orange versus apple in your model?

The log odds is just $\hat\beta_0 + \hat\beta_1x$

What is the log odds of orange versus apple in your friend’s model?

From 4.14, log odds of our friend’s model is:

\[ (\hat\alpha_{orange0} - \hat\alpha_{apple0}) + (\hat\alpha_{orange1} - \hat\alpha_{apple1})x \]

Suppose that in your model, $\hat\beta_0 = 2$ and $\hat\beta = -1$. What are the coefficient estimates in your friend’s model? Be as specific as possible.

We can say that in our friend’s model $\hat\alpha_{orange0} - \hat\alpha_{apple0} = 2$ and $\hat\alpha_{orange1} - \hat\alpha_{apple1} = -1$.

We are unable to know the specific value of each parameter however.

Now suppose that you and your friend fit the same two models on a different data set. This time, your friend gets the coefficient estimates $\hat\alpha_{orange0} = 1.2$, $\hat\alpha_{orange1} = -2$, $\hat\alpha_{apple0} = 3$, $\hat\alpha_{apple1} = 0.6$. What are the coefficient estimates in your model?

The coefficients in our model would be $\hat\beta_0 = 1.2 - 3 = -1.8$ and $\hat\beta_1 = -2 - 0.6 = -2.6$

Finally, suppose you apply both models from (d) to a data set with 2,000 test observations. What fraction of the time do you expect the predicted class labels from your model to agree with those from your friend’s model? Explain your answer.

The models are identical with different parameterization so they should perfectly agree.

Enkhjin/HW

2025-04-17

b. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

Carefully explain the differences between the KNN classifier and KNN regression methods.