Problem 2.

KNN classification and KNN regression differences
1. Objectives
- KNN classifier is a supervised learning method for classification tasks.
- KNN regression is suited for regression tasks.
2. Output
- KNN Classifier has the output as a discrete class label.
- KNN regression has the output as a continuous numerical value.
3. Method of calculation
- KNN classifier: For a test data point, it finds the K-nearest neighbors based on a distance metric. The most common class label is assigned among these K-neighbors. Some variants use weighted voting, where closer neighbors carry more influence.
- KNN regression: It also identifies the K-nearest neighbors but it computes the average of the numerical values associated with these neighbors to predict the output.
4. Error Metrics
- KNN classifier: Performance is usually evaluated using classification metrics like accuracy, precision, recall, or F1-score.
- KNN regression: Performance is assessed using regression metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), or R-squared.
5. Data Features
- KNN Classifier: Works best when classes are clearly separable, and the data points within each class are relatively dense.
- KNN Regression: Requires data with meaningful relationships between the input features and the target numerical output.
6. Decision boundary
- KNN classifier: Creates distinct decision boundaries that separate different classes. These boundaries can be complex and non-linear depending on the data distribution.
- KNN regression: Produces a smooth approximation of the target function, as predictions are interpolated based on the neighboring values.

Problem 9.

a.

library(ISLR2) # Load the ISLR package

## Warning: package 'ISLR2' was built under R version 4.4.2

data("Auto")  # Load the Auto dataset

pairs(Auto, main = "Scatterplot Matrix for Auto Dataset")

b.

# Exclude the 'name' column and select only numeric variables
Auto_numeric <- Auto[, sapply(Auto, is.numeric)]  

# Compute the correlation matrix
correlation_matrix <- cor(Auto_numeric)

# Display the result
print(correlation_matrix)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c.

# Fit the multiple linear regression model
lm_model <- lm(mpg ~ . - name, data = Auto)

# Print the summary of the model
summary(lm_model)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comments:
i. Yes, there appears to be a significant relationship between the predictors and the response (mpg). This is evident from:

F-statistic: 252.4 with a p-value of < 2.2e-16 indicates that at least one of the predictors significantly contributes to the model.

Multiple R-squared: 0.8215 (and Adjusted R-squared: 0.8182) suggests that about 82.15% of the variance in mpg is explained by the predictors in the model, which is quite high.

The predictors with a statistically significant relationship to the response variable (mpg) are those with p-values less than 0.05.

Displacement: p-value = 0.0084, indicating a significant positive relationship with mpg.
Weight: p-value < 2e-16, showing a strong negative relationship with mpg.
Year: p-value < 2e-16, indicating a strong positive relationship with mpg (newer cars are more fuel-efficient).
Origin: p-value = 4.67e-07, suggesting a significant positive relationship with mpg based on the car’s origin.

The coefficient for the year variable is 0.750773, and it is highly statistically significant.
For each additional year (e.g., from 1970 to 1971), the mpg (miles per gallon) is expected to increase by approximately 0.751, assuming all other predictors remain constant.
Over time, cars tend to become more fuel-efficient, possibly due to advancements in technology, stricter environmental regulations, or shifting consumer preferences toward fuel economy.

d.

# Generate diagnostic plots for the linear regression model
plot(lm_model)

Comments:

Residuals vs. Fitted Plot:
Curvature in Residuals: There’s a clear pattern (curved shape) in the residuals, indicating the model may not fully capture the relationship between mpg and the predictors. This suggests potential non-linearity that a linear model struggles to address.
Outliers: Observations like 323, 327, and 336 stand out with larger residuals, indicating they don’t fit the model well. While these points might not be extreme enough to discard outright, they should be considered during model diagnostics.

Residuals vs. Leverage Plot:
High Leverage Points: Observations 327, 394, and 14 have high leverage, meaning they could disproportionately influence the regression results.
Cook’s Distance: Points like 327 and 394 are near or beyond the Cook’s distance threshold lines, identifying them as potentially influential observations. These points may have both high leverage and moderate residuals.

e.

# Assign the full interaction model to an object
full_interaction_model <- lm(mpg ~ cylinders * displacement +
                               cylinders * horsepower +
                               cylinders * weight +
                               cylinders * acceleration +
                               cylinders * year +
                               cylinders * origin +
                               displacement * horsepower +
                               displacement * weight +
                               displacement * acceleration +
                               displacement * year +
                               displacement * origin +
                               horsepower * weight +
                               horsepower * acceleration +
                               horsepower * year +
                               horsepower * origin +
                               weight * acceleration +
                               weight * year +
                               weight * origin +
                               acceleration * year +
                               acceleration * origin +
                               year * origin, 
                             data = Auto)

# Generate a summary of the model
summary(full_interaction_model)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + cylinders * horsepower + 
##     cylinders * weight + cylinders * acceleration + cylinders * 
##     year + cylinders * origin + displacement * horsepower + displacement * 
##     weight + displacement * acceleration + displacement * year + 
##     displacement * origin + horsepower * weight + horsepower * 
##     acceleration + horsepower * year + horsepower * origin + 
##     weight * acceleration + weight * year + weight * origin + 
##     acceleration * year + acceleration * origin + year * origin, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

The following interaction terms are statistically significant (p-value < 0.05):
displacement:year (0.01352)
acceleration:year (0.03033)
acceleration:origin (p-value = 0.00365).

f.

# Apply log, square root, and squared transformations to selected predictors
model_transformed <- lm(mpg ~ log(displacement) + sqrt(weight) + I(horsepower^2) +
                          acceleration + year + origin, data = Auto)

# View the summary of the transformed model
summary(model_transformed)

## 
## Call:
## lm(formula = mpg ~ log(displacement) + sqrt(weight) + I(horsepower^2) + 
##     acceleration + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9789 -1.9988  0.0399  1.7505 12.9436 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.300e+00  5.451e+00   0.605  0.54531    
## log(displacement) -1.469e+00  1.089e+00  -1.350  0.17795    
## sqrt(weight)      -6.892e-01  7.007e-02  -9.836  < 2e-16 ***
## I(horsepower^2)    1.078e-04  3.619e-05   2.978  0.00308 ** 
## acceleration       1.788e-01  8.433e-02   2.121  0.03458 *  
## year               7.866e-01  4.845e-02  16.233  < 2e-16 ***
## origin             6.598e-01  2.838e-01   2.325  0.02059 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.187 on 385 degrees of freedom
## Multiple R-squared:  0.8358, Adjusted R-squared:  0.8333 
## F-statistic: 326.6 on 6 and 385 DF,  p-value: < 2.2e-16

Findings:
Significant predictors:
sqrt(weight): Highly significant (p-value < 2e-16) with a negative coefficient (-0.6892). This suggests that mpg decreases as the square root of weight increases, indicating a strong inverse relationship between car weight and fuel efficiency.
I(horsepower^2): Significant (p-value = 0.00308) with a positive coefficient (0.0001078). The squared term captures a nonlinear relationship, showing that as horsepower increases, its effect on mpg grows more pronounced.
acceleration: Significant (p-value = 0.03458) with a positive coefficient (0.1788). This indicates that higher acceleration improves mpg slightly.
year: Extremely significant (p-value < 2e-16) with a positive coefficient (0.7866). Cars manufactured in more recent years are associated with better fuel efficiency, likely reflecting advancements in automotive technology.
origin: Significant (p-value = 0.02059) with a positive coefficient (0.6598). This implies that mpg varies depending on the region of manufacture.

Non-significant predictors:
log(displacement): The p-value (0.17795) indicates this term is not significant in the model. This suggests that the logarithmic transformation of displacement does not add meaningful explanatory power when predicting mpg.

Adjusted R-squared = 0.8333: The model explains approximately 83.3% of the variance in mpg, indicating a strong overall fit.

Problem 10.

a.

# View the structure of the Carseats data set (optional)
str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

# Fit a multiple regression model to predict Sales using Price, Urban, and US
 carseats_sales_model<- lm(Sales ~ Price + Urban + US, data = Carseats)

# Summarize the model to view the results
summary(carseats_sales_model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b.

Interpretation:
Price Coefficient (-0.054459):
This indicates that for every unit increase in price, the predicted sales decrease by 0.0545 units, holding all other variables constant.Higher prices are associated with lower sales, as expected in most cases due to price sensitivity.

UrbanYes Coefficient (-0.021916):
This represents the difference in Sales between stores located in urban areas (Urban = Yes) and non-urban areas (Urban = No), holding all other variables constant.Being in an urban location reduces sales by 0.0219 units on average compared to non-urban stores. However, this effect is not statistically significant (p-value = 0.936).

USYes Coefficient (1.200573):
This represents the difference in Sales between stores in the US (US = Yes) and stores outside the US (US = No), holding all other variables constant.Being in the US increases sales by 1.2006 units on average compared to non-US stores. This effect is statistically significant (p-value = 4.86e-06).

c.

Equation form:Y=13.043−0.05446⋅Price−0.02192⋅UrbanYes+1.206⋅USYes

Where:
UrbanYes: Urban is a qualitative variable with two levels: “Yes” and “No.”
UrbanYes = 1 if the store is in an urban area, and UrbanYes = 0 otherwise.

USYes:
US is a qualitative variable with two levels: “Yes” and “No.”
USYes = 1 if the store is in the United States, and USYes = 0 otherwise.

d.

The predictors that allow to reject the null hypothesis are Price and USYes because the p-values is less the 0.05 significance level.
Price: p-value < 2e-16.
USYes: p-value = 4.86e-6.

e.

# Fit the smaller model with significant predictors
reduced_model <- lm(Sales ~ Price + US, data = Carseats)

# Summarize the reduced model
summary(reduced_model)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f.

Both models explain a very similar proportion of the variance in Sales (Adjusted R-squared ~23.5%), indicating comparable predictive power.
The reduced model performs slightly better in terms of Adjusted R-squared and RSE, suggesting that removing the non-significant Urban variable made the model more efficient without losing explanatory power.

g.

The critical t-value for a 95% confidence interval with 397 degrees of freedom is approximately 1.966.
CI=Coeff±tx(Standard error)
Price: CI=−0.05448±1.966×0.00523=−0.05448±0.01028
CI=(−0.06476,−0.04420)
Similarly for USYes: CI=1.19964±1.966×0.25846=1.19964±0.508
CI=(0.69164,1.70764)

h.

# Generate diagnostic plots
par(mfrow = c(2, 2))  # Arrange plots in a 2x2 grid
plot(reduced_model)

Outliers: There is evidence of potential outliers, as seen in the Residuals vs. Fitted, Q-Q, and Residuals vs. Leverage plots. These are observations with large residuals (e.g., standardized residuals > ±2.5 or ±3).
High-leverage points exist (e.g., leverage > 0.02), but they don’t coincide with large residuals, so they’re not necessarily problematic for the model.

Problem 12.

a.

The coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X when if and only if the sum of the squared values of X equals the sum of the squared values of Y.
\[ \sum_{i=1}^n X_i^2 = \sum_{i=1}^n Y_i^2 \]

b.

# Set seed for reproducibility
set.seed(123)

# Generate n = 100 observations
n <- 100

# Generate X as a random normal variable
X <- rnorm(n, mean = 0, sd = 1)

# Generate Y as a linear function of X with noise, scaled to ensure sum(X^2) ≠ sum(Y^2)
Y <- 2 * X + rnorm(n, mean = 0, sd = 0.5)

# Fit regression of Y onto X (no intercept)
model_Y_on_X <- lm(Y ~ X + 0)  # + 0 removes the intercept
beta_Y_on_X <- coef(model_Y_on_X)  # Coefficient for Y ~ X

# Fit regression of X onto Y (no intercept)
model_X_on_Y <- lm(X ~ Y + 0)  # + 0 removes the intercept
beta_X_on_Y <- coef(model_X_on_Y)  # Coefficient for X ~ Y

# Calculate sums of squares to verify they are different
sum_X_squared <- sum(X^2)
sum_Y_squared <- sum(Y^2)

# Display results
cat("Sum of squared values of X:", sum_X_squared, "\n")

## Sum of squared values of X: 83.30737

cat("Sum of squared values of Y:", sum_Y_squared, "\n")

## Sum of squared values of Y: 346.0601

cat("Coefficient for Y onto X (no intercept):", beta_Y_on_X, "\n")

## Coefficient for Y onto X (no intercept): 1.968186

cat("Coefficient for X onto Y (no intercept):", beta_X_on_Y, "\n")

## Coefficient for X onto Y (no intercept): 0.4738033

c.

# Set seed for reproducibility
set.seed(123)

# Generate n = 100 observations
n <- 100

# Generate X1 as a random normal variable
X1 <- rnorm(n, mean = 0, sd = 1)

# Generate Y1 as a scaled version of X1 to ensure sum(X1^2) = sum(Y1^2)
# Use Y1 = X1 to make sum(X1^2) = sum(Y1^2)
Y1 <- X1  # No noise, perfect linear relationship through the origin

# Fit regression of Y1 onto X1 (no intercept)
model_Y1_on_X1 <- lm(Y1 ~ X1 + 0)  # + 0 removes the intercept
beta_Y1_on_X1 <- coef(model_Y1_on_X1)  # Coefficient for Y1 ~ X1

# Fit regression of X1 onto Y1 (no intercept)
model_X1_on_Y1 <- lm(X1 ~ Y1 + 0)  # + 0 removes the intercept
beta_X1_on_Y1 <- coef(model_X1_on_Y1)  # Coefficient for X1 ~ Y1

# Calculate sums of squares to verify they are equal
sum_X1_squared <- sum(X1^2)
sum_Y1_squared <- sum(Y1^2)

# Display results
cat("Sum of squared values of X1:", sum_X1_squared, "\n")

## Sum of squared values of X1: 83.30737

cat("Sum of squared values of Y1:", sum_Y1_squared, "\n")

## Sum of squared values of Y1: 83.30737

cat("Coefficient for Y1 onto X1 (no intercept):", beta_Y1_on_X1, "\n")

## Coefficient for Y1 onto X1 (no intercept): 1

cat("Coefficient for X1 onto Y1 (no intercept):", beta_X1_on_Y1, "\n")

## Coefficient for X1 onto Y1 (no intercept): 1

Predictive Modeling HW2

Minh Nguyen - zpx082

2025-02-28

Problem 2.

Problem 9.

a.

b.

c.

d.

e.

f.

Problem 10.

a.

b.

c.

d.

e.

f.

g.

h.

Problem 12.

a.

b.

c.