Q1: Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

The null hypotheses for the p-values in Table 3.4 correspond to testing whether each predictor’s coefficient is equal to zero in a multiple linear regression model predicting sales from TV, radio, and newspaper advertising budgets.

Interpretation of p-values:

  • TV (p < 0.0001): Strong evidence that TV advertising significantly impacts sales.
  • Radio (p < 0.0001): Strong evidence that radio advertising significantly impacts sales.
  • Newspaper (p = 0.8599): No evidence that newspaper advertising impacts sales.

Interpretation:

  • Investing in TV and radio advertising leads to increased sales.
  • Newspaper advertising does not significantly impact sales and may not be a cost-effective strategy.

Q2: Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN Classifier:

  • Used for categorical responses.
  • Assigns a class based on a majority vote among the K-nearest neighbors.
  • Decision boundaries are non-linear and depend on the distribution of data.

KNN Regression:

  • Used for continuous responses.
  • Predicts a value by averaging the values of the K-nearest neighbors.
  • More sensitive to outliers since it considers numerical averages.

Key Differences:

Feature KNN Classification KNN Regression
Response Type Categorical Continuous
Decision Rule Majority Voting Averaging
Output Class Label Numerical Value
Loss Function Classification Error Mean Squared Error

Q3: Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Level (1 for College and 0 for High School), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get βˆ0 = 50, βˆ1 = 20, βˆ2 = 0.07, βˆ3 = 35, βˆ4 = 0.01, βˆ5 = −10.

(a) Which answer is correct, and why?

Given the model: \[ \hat{Y} = 50 + 20X_1 + 0.07X_2 + 35X_3 + 0.01X_1X_2 - 10X_1X_3 \]

We compare salaries for college and high school graduates while keeping IQ and GPA fixed.

  • For high school graduates (X₃ = 0): \[ \hat{Y}_{HS} = 50 + 20X_1 + 0.07X_2 + 0 + 0.01X_1X_2 - 10(0) \]
  • For college graduates (X₃ = 1): \[ \hat{Y}_{College} = 50 + 20X_1 + 0.07X_2 + 35 + 0.01X_1X_2 - 10X_1 \]
  • Difference: \[ \hat{Y}_{College} - \hat{Y}_{HS} = 35 - 10X_1 \]
  • If GPA is high enough, \(-10X_1\) dominates, making high school graduates earn more.
  • If GPA is low, college graduates earn more.

Option (iii) High school graduates earn more provided GPA is high enough, is correct.

(b) Predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.

GPA <- 4.0
IQ <- 110
Level <- 1 # College Graduate
predicted_salary <- 50 + 20*GPA + 0.07*IQ + 35*Level + 0.01*GPA*IQ - 10*GPA*Level
predicted_salary
## [1] 137.1

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

  • False, the size of the coefficient alone does not determine significance. We need a hypothesis test to confirm.

Q5: Consider the fitted values that result from performing linear regression without an intercept. In this setting, the ith fitted value takes the form as shown in the question. What is ai′?

Given: \[ \hat{y}_i = x_i \hat{\beta} \] where \[ \hat{\beta} = \frac{\sum_{i=1}^{n} x_i y_i}{\sum_{i=1}^{n} x_i^2} \]

Rewrite the equation: \[ \hat{y}_i = x_i \cdot \frac{\sum_{i'=1}^{n} x_{i'} y_{i'}}{\sum_{i'=1}^{n} x_{i'}^2} \] Simplify it to: \[ \hat{y}_i = \sum_{i'=1}^{n} a_{i'} y_{i'} \] where: \[ a_{i'} = \frac{x_i x_{i'}}{\sum_{i'=1}^{n} x_{i'}^2} \]

Interpretation:

  • The fitted values are linear combinations of the response values.
  • The weights a_{i’} depend on the predictor values and normalize based on their squared sum.