Problem # 2: Carefully explain the differences between the KNN classifier and KNN regression methods.
The K-Nearest Neighbors (KNN) algorithm is a simple yet powerful method used for both classification and regression tasks. It’s a non-parametric approach, meaning it doesn’t make strong assumptions about the underlying data, and it relies on the idea that similar data points tend to be close to each other.
• KNN Classifier: Imagine you’re trying to categorize something, like deciding whether an animal in a photo is a cat or a dog. KNN looks at the “k” closest examples (neighbors) it has seen before and assigns the new example to the most common category among those neighbors. It’s like asking your closest friends for their opinion and going with the majority vote.
• KNN Regression: Now, think of predicting something numerical, like the price of a house. Instead of categories, KNN looks at the “k” nearest houses it knows about and predicts the price by averaging their values. It’s like estimating the cost of your dream home by checking what similar houses nearby sold for.
• KNN Classifier: It gives you a clear-cut answer, like “cat” or “dog.”
• KNN Regression: It gives you a number, like $350,000 for a house or 72°F for the temperature.
• KNN Classifier: It uses a majority vote. If most of the “k” neighbors are cats, the new example is labeled as a cat.
• KNN Regression: It calculates the average (or a weighted average) of the neighbors’ values. For example, if three nearby houses are priced at 300k,300k,320k, and 340k,it might predict 340k,it might predict 320k.
Both versions use distance measures, like Euclidean distance (think of it as the straight-line distance between two points on a map).
• KNN Classifier: The distance helps determine which neighbors are closest, so it can pick the most common class.
• KNN Regression: The distance helps calculate an average value, giving more weight to closer neighbors.
• KNN Classifier: It’s pretty robust. If most neighbors agree on a category, a single oddball neighbor won’t sway the decision much.
• KNN Regression: It’s more sensitive. Outliers can skew the average, leading to less accurate predictions. For example, if one house is priced way higher than the others, it could throw off the predicted price.
• KNN Classifier: You’d use metrics like accuracy (how often it’s right), precision (how many of its positive predictions are correct), recall (how many actual positives it catches), and F1-score (a balance of precision and recall).
• KNN Regression: You’d use metrics like Mean Squared Error (MSE), which measures how far off the predictions are on average, or R-squared, which tells you how well the model explains the data.
• KNN Classifier: It’s great for tasks like recognizing images (e.g., is this a cat or a dog?), filtering spam emails, or helping doctors diagnose diseases based on symptoms.
• KNN Regression: It’s handy for predicting house prices, forecasting temperatures, or estimating stock prices based on historical trends.
Problem # 9
# Load necessary libraries
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Load dataset
auto <- read.csv("Auto.csv", stringsAsFactors = TRUE)
# Convert horsepower to numeric (it might have missing values as '?')
auto$horsepower <- as.numeric(as.character(auto$horsepower))
## Warning: NAs introduced by coercion
# Remove 'name' column as it is qualitative
auto <- auto[ , !(names(auto) %in% c("name"))]
# Generate a scatterplot matrix
ggpairs(auto)
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Compute correlation matrix
cor_matrix <- cor(auto, use = "complete.obs")
print(cor_matrix)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
# Fit multiple linear regression model
model <- lm(mpg ~ ., data = auto)
# Display summary of the model
summary(model)
##
## Call:
## lm(formula = mpg ~ ., data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes, there is a strong relationship between the predictors and the response variable (mpg). The F-statistic is 252.4 with a p-value < 2.2e-16, which is extremely small. This indicates that at least one of the predictors has a significant effect on mpg. The Multiple R-squared value is 0.8215, meaning that approximately 82.15% of the variation in mpg is explained by the predictor variables. The Adjusted R-squared is 0.8182, which suggests that even after adjusting for the number of predictors, the model still explains a high proportion of variance.
A predictor is statistically significant if its p-value < 0.05.
-> The significant predictors are:
Displacement (p-value = 0.00844): Positively related to mpg. Weight (p-value < 2e-16): Negatively related to mpg (most significant). Year (p-value < 2e-16): Positively related to mpg (highly significant). Origin (p-value = 4.67e-07): Positively related to mpg.
-> On the other hand, the following predictors do not show strong statistical significance:
Cylinders (p-value = 0.12780): Not statistically significant. Horsepower (p-value = 0.21963): Not statistically significant. Acceleration (p-value = 0.41548): Not statistically significant.
Thus, Weight, Year, Displacement, and Origin have a strong statistical relationship with mpg, while Cylinders, Horsepower, and Acceleration do not.
The coefficient for year is 0.750773, meaning that for each additional year, the mpg increases by approximately 0.75. This suggests that newer cars are more fuel-efficient, likely due to advancements in technology, better fuel economy, and efficiency improvements over time.
# Generate diagnostic plots
par(mfrow = c(2,2)) # Set 2x2 layout for plots
plot(model)
This plot checks for non-linearity and homoscedasticity (constant variance of residuals). Ideally, residuals should be randomly scattered around zero. In this plot, a slight curve is visible, suggesting a possible non-linearity in the model. Some large residuals (outliers) are present, as seen by points far from the center.
🔹 Issue: Possible non-linearity in the data, indicating that a transformation might improve the model.
This plot checks whether residuals follow a normal distribution. If points follow the diagonal line, the normality assumption holds. In this plot, deviation at the ends (tails) suggests the presence of outliers.
🔹 Issue: The model might not fully satisfy the normality assumption due to outliers.
This checks for homoscedasticity (constant variance of residuals). Ideally, residuals should be spread evenly across fitted values. In this plot, the residuals appear somewhat evenly spread, though there are a few large residuals at the right.
🔹 Issue: Potential heteroscedasticity (unequal variance), meaning a transformation such as log(mpg) might be needed.
This plot helps identify high-leverage points, which can have a strong influence on the model. Points beyond Cook’s distance lines are highly influential. Observations 327, 394, and 14 have high leverage and may significantly affect the regression.
🔹 Issue: A few high-leverage points exist, particularly point 14, which may distort the model’s fit.
# Fit model with interaction terms
model_interact <- lm(mpg ~ cylinders * displacement + weight * acceleration, data = auto)
# Summary of the model with interactions
summary(model_interact)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + weight * acceleration,
## data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.7750 -2.4764 -0.1929 1.9220 17.4032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.9244021 7.8048993 5.884 8.63e-09 ***
## cylinders -1.8504497 0.6119157 -3.024 0.002659 **
## displacement -0.0804117 0.0184063 -4.369 1.60e-05 ***
## weight -0.0045984 0.0022941 -2.004 0.045711 *
## acceleration 0.4822565 0.3722482 1.296 0.195906
## cylinders:displacement 0.0100242 0.0025982 3.858 0.000134 ***
## weight:acceleration -0.0000634 0.0001306 -0.485 0.627758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.163 on 390 degrees of freedom
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.717
## F-statistic: 168.2 on 6 and 390 DF, p-value: < 2.2e-16
This model includes interaction terms between:
-> Significant interaction term:
cylinders:displacement → p-value = 0.000134 (significant at 0.1% level) This suggests that the effect of displacement on mpg depends on the number of cylinders.
-> Non-significant interaction term:
weight:acceleration → p-value = 0.627758 (not significant) This suggests that weight and acceleration do not have a meaningful combined effect on mpg.
# Apply transformations and fit models
model_log <- lm(mpg ~ log(weight), data = auto)
model_sqrt <- lm(mpg ~ sqrt(weight), data = auto)
model_square <- lm(mpg ~ I(weight^2), data = auto)
# Compare summaries
summary(model_log)
##
## Call:
## lm(formula = mpg ~ log(weight), data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4658 -2.6579 -0.2947 1.9395 15.9787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.5391 5.9837 35.19 <2e-16 ***
## log(weight) -23.5050 0.7516 -31.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.203 on 395 degrees of freedom
## Multiple R-squared: 0.7123, Adjusted R-squared: 0.7116
## F-statistic: 978.1 on 1 and 395 DF, p-value: < 2.2e-16
summary(model_sqrt)
##
## Call:
## lm(formula = mpg ~ sqrt(weight), data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2769 -2.8948 -0.3705 2.0839 16.1925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.84709 1.52239 45.88 <2e-16 ***
## sqrt(weight) -0.85860 0.02793 -30.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.255 on 395 degrees of freedom
## Multiple R-squared: 0.7052, Adjusted R-squared: 0.7044
## F-statistic: 944.8 on 1 and 395 DF, p-value: < 2.2e-16
summary(model_square)
##
## Call:
## lm(formula = mpg ~ I(weight^2), data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3230 -3.1839 -0.4874 2.4089 17.2085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.453e+01 4.692e-01 73.61 <2e-16 ***
## I(weight^2) -1.155e-06 4.270e-08 -27.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.639 on 395 degrees of freedom
## Multiple R-squared: 0.6494, Adjusted R-squared: 0.6485
## F-statistic: 731.7 on 1 and 395 DF, p-value: < 2.2e-16
Findings:
-> Good model fit (R² = 0.7123) and strong statistical significance (p < 2e-16). The log transformation captures diminishing returns, meaning that small increases in weight for lighter cars impact mpg more than for heavier cars.
-> Slightly lower fit than the log model (R² = 0.7052). The square root transformation is useful but less effective than log(weight).
-> Worst fit among the three models (R² = 0.6494). The squared transformation exaggerates the effect of large weights, making the model less predictive.
Problem # 10
This question should be answered using the Carseats data set.
# Load necessary libraries
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
# Load dataset
carseats <- read.csv("Carseats.csv", stringsAsFactors = TRUE)
# Convert categorical variables to factors
carseats$Urban <- as.factor(carseats$Urban)
carseats$US <- as.factor(carseats$US)
# Fit multiple regression model
model_a <- lm(Sales ~ Price + Urban + US, data = carseats)
# Display summary
summary(model_a)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Regression Model: Sales = 13.04 - 0.0545(Price) - 0.0219(UrbanYes) + 1.2057(USYes) + ε
Each coefficient represents the expected change in Sales given a one-unit change in the predictor while holding other variables constant.
Interpretation of Each Coefficient:
-> Intercept (β₀ = 13.04)
When Price = 0, Urban = No, and US = No, the expected Sales would be 13.04 units. Since a Price of 0 is unrealistic, this is just a theoretical baseline.
-> Price (β₁ = -0.0545, p < 2e-16 → Significant)
For each $1 increase in Price, Sales decrease by 0.0545 units. Since the p-value is very small, Price has a strong negative effect on Sales.
-> UrbanYes (β₂ = -0.0219, p = 0.936 → Not Significant)
Whether the store is in an urban area (Yes) does not significantly impact Sales. The high p-value (0.936) means we cannot conclude that being in an urban area affects Sales.
-> USYes (β₃ = 1.2057, p = 4.86e-06 → Significant)
Stores in the US tend to have 1.2057 more units in Sales than non-US stores. Since the p-value is very small, this effect is statistically significant.
Sales = β₀ + β₁(Price) + β₂(UrbanYes) + β₃(USYes) + ε
where: UrbanYes = 1 if the store is urban, else 0. USYes = 1 if the store is in the US, else 0.
# Significant predictors
summary(model_a)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
-> Reject H0 (Significant Predictors):
Price (p < 2e-16) → Strong negative impact on Sales US (p = 4.86e-06) → Positive impact on Sales
-> Fail to Reject H0 (Not Significant Predictors):
Urban (p = 0.936) → No significant impact on Sales
model_e <- lm(Sales ~ Price, data = carseats)
summary(model_e)
##
## Call:
## lm(formula = Sales ~ Price, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5224 -1.8442 -0.1459 1.6503 7.5108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.641915 0.632812 21.558 <2e-16 ***
## Price -0.053073 0.005354 -9.912 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.532 on 398 degrees of freedom
## Multiple R-squared: 0.198, Adjusted R-squared: 0.196
## F-statistic: 98.25 on 1 and 398 DF, p-value: < 2.2e-16
Reduced Model Equation: Sales=13.64 − 0.0531×Price + ε
-> Price remains significant (p < 2e-16), confirming its strong negative relationship with Sales. -> Intercept (13.64): The expected Sales when Price = 0.
-> R-squared decreased: Multiple R² = 0.198 (previously 0.239 in the full model). Adjusted R² = 0.196 (previously 0.233). This suggests that the US variable contributed to explaining some variance in Sales.
-> F-statistic = 98.25, with a very low p-value (< 2.2e-16), confirming that Price is strongly associated with Sales.
# Compare models
summary(model_a)$adj.r.squared
## [1] 0.2335123
summary(model_e)$adj.r.squared
## [1] 0.195966
-> Full Model (model_a) fits the data better:
Adjusted R² = 0.2335 means 23.35% of the variance in Sales is explained by Price, Urban, and US. The inclusion of US improved the model, but Urban did not contribute much (as seen in Part d).
-> Reduced Model (model_e) explains less variance
Adjusted R² = 0.1960 means 19.6% of the variance in Sales is explained by Price alone. This confirms that removing US led to some loss in explanatory power.
confint(model_e, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 12.3978438 14.88598655
## Price -0.0635995 -0.04254653
->Intercept (12.40 to 14.89):
We are 95% confident that the true intercept lies between 12.40 and 14.89. This means that when Price = 0, the expected Sales would be within this range (though Price = 0 is unrealistic).
->Price (-0.0636 to -0.0425):
We are 95% confident that the effect of Price on Sales is between -0.0636 and -0.0425. Since the interval is entirely negative, this confirms that Price has a significant negative impact on Sales. A $1 increase in Price is expected to reduce Sales by approximately 0.0425 to 0.0636 units.
par(mfrow=c(2,2))
plot(model_e)
1. Residuals vs Fitted Plot:
This plot checks for non-linearity and heteroscedasticity. Residuals should be randomly scattered around 0. The plot looks fairly random, suggesting the model is reasonable. However, a few extreme residuals (e.g., point 377) might indicate outliers.
This plot checks normality of residuals. Points should align along the diagonal line. The tails deviate, suggesting some non-normality. This might indicate outliers affecting the distribution.
This checks for homoscedasticity (constant variance of residuals). The residuals seem evenly spread, meaning variance is mostly constant. A few points (e.g., 377) show higher residual variance, possibly reinforcing the presence of outliers.
This plot identifies high-leverage points (points with extreme influence on the model). Points with high leverage appear on the right. Some points (e.g., 311, 377, 175) are close to Cook’s distance lines, suggesting influential observations.
Problem # 12
This problem involves simple linear regression without an intercept.
For the regression coefficients to be equal, the following must hold: ∑xi2/∑xi2 = ∑yi2/∑yi2
Rearranging: ∑xi^2 = ∑yi^2
This means that the sum of squares of X must be equal to the sum of squares of Y.
-> This occurs when X and Y have equal variance. -> If all points lie on a line through the origin (e.g.,Y=±X), the variances of X and Y will be the same, making the two regression coefficients equal.
# Set seed for reproducibility
set.seed(123)
# Generate X and Y with different variances
n <- 100
X <- rnorm(n, mean = 10, sd = 5)
Y <- 2 * X + rnorm(n, mean = 0, sd = 10) # Different variance
# Regression of Y on X (without intercept)
beta_y_on_x <- sum(X * Y) / sum(X^2)
# Regression of X on Y (without intercept)
beta_x_on_y <- sum(X * Y) / sum(Y^2)
# Print coefficients
cat("β (Y ~ X):", beta_y_on_x, "\n")
## β (Y ~ X): 1.896779
cat("β (X ~ Y):", beta_x_on_y, "\n")
## β (X ~ Y): 0.4402068
# Set seed for reproducibility
set.seed(123)
# Generate X
n <- 100
X <- rnorm(n, mean = 0, sd = 5)
# Set Y equal to X (ensuring equal variance)
Y <- X
# Regression of Y on X (without intercept)
beta_y_on_x <- sum(X * Y) / sum(X^2)
# Regression of X on Y (without intercept)
beta_x_on_y <- sum(X * Y) / sum(Y^2)
# Print coefficients
cat("β (Y ~ X):", beta_y_on_x, "\n")
## β (Y ~ X): 1
cat("β (X ~ Y):", beta_x_on_y, "\n")
## β (X ~ Y): 1