This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Answer:
KNN Classification or KNN regression is a non-parametric supervised learning algorithms used for both kind of tasks, here learning the mapping between the input features and the target values based on the training data labels, based on the training data the model can now predict on test data.
KNN Classification: Classification assigns class labels to data points based on the majority class which are nearest neighbors. It uses voting to determine the class membership of each point, the back-end statistical logarithm is usually Euclidean distance. This kind of classification effective for categorical data classification, such as image recognition, text analysis usually for discrete class labels are assigned.
KNN Regression: This kind of classification predicts continuous values for data points by averaging the target values(k) of their nearest neighbors. It is calculated on a weighted average based on proximity. This classification approach is suitable for outcome which is of contentious variables predicting, e.g., stock prices,temperature,demand forecasting.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggplot2)
data("Auto", package = "ISLR") # auto dataset in the 'ISLR' package.
pairs(Auto[,1:9])
Interpretation: From the above plot it is understood that, Horsepower to weight and Displacement to weight are highly correlated.
cor(Auto[,!colnames(Auto) %in% c("name")]) # removing categorical variable 'name'.
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Auto.lm = lm(mpg ~ . -name, data=Auto)
summary(Auto.lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Answer: There exists a significant relationship between the predictors(all the variables except ‘name’) and the response(mpg), as evidenced by testing the null hypothesis that all regression coefficients are zero. The F-statistic(252.4), with a notably low p-value(< 2.2e-16), indicates strong evidence against the null hypothesis, suggesting substantial predictive power in the model.
Answer: Response(mpg) has a statistically significant relationship between the predictors ( weight, displacement,origin, & year).
Answer: The co-efficient for the year variable is 0.750773(0.751).
par(mfrow=c(2,2))
plot(Auto.lm)
Interpretation: From the QQ Plot it is understood that data is normally distributed. From the Residuals vs Fitted plot it is understood that data is slightly non-linear, and from Standardized residuals there are few outliers (between 2 and -2), with one high leverage point(observation 14)
Auto.lm2 <- lm(mpg ~ origin * displacement+ weight *displacement, data = Auto[, 1:8])
summary(Auto.lm2)
##
## Call:
## lm(formula = mpg ~ origin * displacement + weight * displacement,
## data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2593 -2.5453 -0.2765 1.8728 17.9900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.543e+01 3.564e+00 15.553 < 2e-16 ***
## origin -1.095e+00 1.248e+00 -0.878 0.381
## displacement -9.094e-02 1.925e-02 -4.724 3.25e-06 ***
## weight -9.328e-03 1.012e-03 -9.219 < 2e-16 ***
## origin:displacement 1.231e-02 1.065e-02 1.156 0.249
## displacement:weight 1.823e-05 3.323e-06 5.487 7.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.097 on 386 degrees of freedom
## Multiple R-squared: 0.7279, Adjusted R-squared: 0.7244
## F-statistic: 206.6 on 5 and 386 DF, p-value: < 2.2e-16
Interpretation: From the above summary, it is understood that displacement:weight are statistically significant.However, origin:displacement is not statistically significant.
Auto.lm3 = lm(mpg ~ . + displacement:origin + displacement:
+ displacement:weight,
data=Auto[, 1:8])
summary(Auto.lm3)
##
## Call:
## lm(formula = mpg ~ . + displacement:origin + displacement:+displacement:weight,
## data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8777 -1.8268 -0.0751 1.6038 12.5867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.552e+00 4.783e+00 -0.325 0.74570
## cylinders 1.636e-01 2.945e-01 0.556 0.57878
## displacement -8.858e-02 1.568e-02 -5.650 3.15e-08 ***
## horsepower -3.544e-02 1.243e-02 -2.851 0.00459 **
## weight -1.139e-02 8.226e-04 -13.846 < 2e-16 ***
## acceleration 8.832e-02 8.856e-02 0.997 0.31922
## year 7.773e-01 4.560e-02 17.045 < 2e-16 ***
## origin -1.056e+00 9.310e-01 -1.134 0.25742
## displacement:origin 1.454e-02 8.038e-03 1.810 0.07114 .
## displacement:weight 2.489e-05 2.558e-06 9.733 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.955 on 382 degrees of freedom
## Multiple R-squared: 0.86, Adjusted R-squared: 0.8567
## F-statistic: 260.6 on 9 and 382 DF, p-value: < 2.2e-16
Interpretation: From the above summary, it is understood that displacement:weight are statistically significant.However, displacement:origin is not statistically significant.
Auto.lm4 <- lm(mpg~ log(displacement) + sqrt(acceleration) + I(horsepower^2), data = Auto[, 1:8])
summary(Auto.lm4)
##
## Call:
## lm(formula = mpg ~ log(displacement) + sqrt(acceleration) + I(horsepower^2),
## data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5830 -2.5057 -0.6266 2.1606 18.6956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.383e+01 4.632e+00 18.100 < 2e-16 ***
## log(displacement) -1.049e+01 6.915e-01 -15.169 < 2e-16 ***
## sqrt(acceleration) -1.239e+00 8.566e-01 -1.446 0.14897
## I(horsepower^2) -1.398e-04 4.409e-05 -3.170 0.00164 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.332 on 388 degrees of freedom
## Multiple R-squared: 0.6943, Adjusted R-squared: 0.6919
## F-statistic: 293.8 on 3 and 388 DF, p-value: < 2.2e-16
Interpretation: Sqrt(acceleration) & acceleration both remains un-significant, I(horsepower^2) is significant & horsepower is not significant, and log(displacement) & displacement both remains significant.
data("Carseats", package = "ISLR")
Carseats.lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Carseats.lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Interpretation: There exists a significant relationship between the predictors(Price, Urban and US) and the response(Sales), as evidenced by testing the null hypothesis that all regression coefficients are zero. The F-statistic(41.52), with a notably low p-value(< 2.2e-16), indicates strong evidence against the null hypothesis, suggesting substantial predictive power in the model.
Answer:
From the above summmary, it is understood that coefficient for the variable “price” is -0.054459, which indicates that for every one-unit increase in price, sales are expected to decrease by an average of 0.054459 units.
Coming to “Urban” variable, the coefficient is -0.021916, suggesting that being located in an urban area is associated with an average sales decrease of 0.021916 units compared to rural areas.
Regarding the “US” variable, the coefficient is 1.200573, indicating that being located in the US is associated with an average sales increase of 1.200573 units compared to locations outside the US.
Answer:
Sales=13.043469-0.054459∗Price-0.021916∗UrbanYes+1.200573* USYes
Answer: Based on p-value: US and Price
Carseats.lm2 <- lm(Sales ~ Price + US, data = Carseats)
summary(Carseats.lm2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Interpretation: There exists a significant relationship between the predictors(Price) and the response(Sales), as evidenced by testing the null hypothesis that all regression coefficients are zero. The F-statistic(41.52), with a notably low p-value(< 2.2e-16), indicates strong evidence against the null hypothesis, suggesting substantial predictive power in the model.
Answer: Model (a) & (e) nearly explains only 23% of the variance in the response variable(sales), here models are not performing well with low r-square.
confint(Carseats.lm2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow=c(2,2))
plot(Carseats.lm2)
Interpretation: There is no evidence of outliers or high leverage observations in the model from (e).
Answer:The coefficient remains consistent across both cases only when X and Y exhibit a perfect one-to-one relationship. This scenario implies that for every one-unit increase or decrease in X, there is an equivalent one-unit increase or decrease in Y.
Also, based on:
∑ x^2=∑ y^2
set.seed(123)
n <- 100
X <- rnorm(n)
Y <- 2*X + rnorm(n)
lm_XY <- lm(X ~ Y)
coef_XY <- coef(lm_XY)[2]
coef_XY
## Y
## 0.3964576
lm_YX <- lm(Y ~ X)
coef_YX <- coef(lm_YX)[2]
coef_YX
## X
## 1.947528
Interpretation:
For the regression of X onto Y, the coefficient estimate for X is approximately 1.9475. This implies that, on average, for every one-unit increase in Y, we expect X to increase by approximately 1.9475 units.
For the regression of Y onto X, the coefficient estimate for Y is approximately 0.3965. This suggests that, on average, for every one-unit increase in X, we expect Y to increase by approximately 0.3965 units.
set.seed(100)
n2 <- 100
X2 <- rnorm(n2)
Y2 <- X2
lm_XY2 <- lm(X2 ~ Y2)
coef_XY2 <- coef(lm_XY2)[2]
coef_XY2
## Y2
## 1
lm_YX2 <- lm(Y2 ~ X2)
coef_YX2 <- coef(lm_YX2)[2]
coef_YX2
## X2
## 1
Interpretation:
The coefficient estimate for X (or Y) is 1. This implies that, on average, for every one-unit increase in Y (or X), we expect X (or Y) to increase by one unit.