Q(1)

The following questions are about the KNN classifier and KNN regression methods.

(a)

Explain the differences between the KNN classifier and KNN regression methods.

Solution: KNN classifier is used for prediction if the response variable is a categorical variable whereas KNN regression is used for prediction if the response variable is continuous. For example: if we have to predict a “cat” or a “dog”, we use KNN classifier.But, if we have to predict house prices, we use KNN regression. The KNN classifier predicts the categorical value by taking the majority votes from its nearest neighbors whereas the KNN regressor predicts the continuous response variable value such as house price by calculating the average of its nearest neighbor values.


(b)

Suppose that the observed x is (0.1, 0.5, 0.55, 0.2, 0.65), and the observed \(y_1\) is (1,5.5, 6.1, 2,7.5), what is the predicted KNN value for \(x_0\)=0.4 with k=1, k=3, and k=5 respectively?

Solution:

#For interactive data table
library(DT)

x <- c(0.1, 0.5, 0.55, 0.2, 0.65)
y1 <- c(1,5.5, 6.1, 2,7.5)

data_frame <- data.frame(cbind(x,y1))
datatable(data_frame)

Calculating distance between \(x_0\) and x:

distance <- function(x0,x){
  d <- abs(x0-x)
  return(d)
}
#Calculate the distance between x0 and x
x0<- 0.4
x<-c(0.1, 0.5, 0.55, 0.2, 0.65)
Distance_of_x_from_x0 <- distance(x0,x)

#Combine the distance with the existing data frame
new_data_frame <- cbind(Distance_of_x_from_x0,data_frame)

#Sort the data frame by Distance_of_x_from_x0
df_sorted <-new_data_frame[order(new_data_frame$Distance_of_x_from_x0),]

#Data table
library(DT)
datatable(df_sorted)

For k = 1

The closest neighbor for \(x_0\) = 0.4 is x = 0.5 which corresponds to y = 5.5. Therefore the predicted value of y is 5.5

For k = 3

The three closest neighbors to \(x_0\) are x = 0.5,0.55,and 0.2 which correspond to y=5.5,6.1,2. The predicted y value is calculated as:

# Predicted y value
y_predicted <- mean(c(5.5, 6.1, 2))

# Print the result
cat("Predicted y value for x0 = 0.4 when k = 3 is:", y_predicted, "\n")
## Predicted y value for x0 = 0.4 when k = 3 is: 4.533333

For k = 5

The 5 closest neighbors are the x values 0.5,0.55,0.2,0.65,0.1 which corresponds to y values 5.5,6.1,2,7.5,1. The predicted y_value is calculated as:

y_predicted <- mean(c(5.5,6.1,2,7.5,1))
cat("Predicted y value for x0 = 0.4 when k = 5 is:", y_predicted, "\n")
## Predicted y value for x0 = 0.4 when k = 5 is: 4.42

(c)

Suppose that the observed x is (0.1, 0.5, 0.55, 0.2, 0.65), and the observed \(y_2\) is (Disease, No-Disease, Disease, No-Disease, Disease), what is the predicted KNN class for \(x_0\)=0.4 with k=1, k=3, and k=5 respectively?

Solution:

x <- c(0.1, 0.5, 0.55, 0.2, 0.65)
y2 <- c("Disease", "No-Disease", "Disease", "No-Disease", "Disease")
Distance_of_x_from_x0 <- distance(x0,x)

Data_Frame <- data.frame(cbind(Distance_of_x_from_x0,x,y2))

#Sort the data frame by Distance_of_x_from_x0
df_sorted <-Data_Frame[order(new_data_frame$Distance_of_x_from_x0),]

#data table
library(DT)
datatable(df_sorted)

For k = 1

The nearest neighbor of \(x_0\) = 0.4 is x=0.5 which corresponds to \(y_2\) = No-Disease. So, the predicted response is No-Disease.

For k=3

The three nearest neighbors of \(x_0\) = 0.4 are x=0.5,0.55,0.2 which correspond to the responses No-Disease,Disease,No-Disease respectively.Here, majority is No-Disease.So, the predicted response in this case is No-Disease.

For k=5

The five nearest neighbors of \(x_0\) are x=0.5,0.55,0.2,0.65,0.1 which correspond to the responses No-Disease,Disease,No-Disease,Disease,Disease. Here the majority of responses is Disease. So, the predicted response is Disease.


Q(2)

The MASS library contains the Boston data set, which records medv (median house value) for 506 neighborhoods around Boston. In this problem, we seek to predict medv using the variable lstat (percent of households with low economic status).

(a) Create a training and test data set using approximately 75% of the Boston data set for training and the remaining data for testing.

Solution:

library(MASS)
library(DT)
data <- Boston
datatable(head(data))
#Data Cleaning
#Check for missing values
sum(is.na(data))
## [1] 0
#Check the duplicate rows
sum(duplicated(data))
## [1] 0

There are no missing values and no duplicate rows.So, the data is clean and ready for analysis.

#Extract only two columns lstat and medv
data_subset <- data[,c("lstat","medv")]
datatable(head(data_subset))

Train Test split

Training data:

set.seed(1)

#First sample 75% indices
training_indices <- sample(1:nrow(data),size = 0.75*nrow(data))

#Get the training data
training_data <- data_subset[training_indices,]
datatable(head(training_data))

Testing data:

#Test data
#We exclude the training rows by using minus sign which extracts rest 25% rows for testing
test_data <- data_subset[-training_indices,]
datatable(head(test_data))

(b) Fit a simple linear regression model to predict medv using lstat using the training data. With the test data, calculate the MSE and RMSE of this model.

model <- lm(medv ~ lstat,data = training_data)
prediction <-predict(model,newdata=test_data)
residual <- test_data$medv - prediction
residual_square <- residual**2
MSE <- mean(residual_square)
RMSE <-sqrt(MSE)

print(c(MSE = MSE,RMSE = RMSE))
##       MSE      RMSE 
## 31.080750  5.575011

(c)

Use the knn.reg() function from the FNN library to fit a KNN regression model to predict medv using lstat. Consider possible values of K by sequencing values from 1 to 45 by 2. Calculate the MSE and RMSE of these model candidates for both the testing and training data. What is the most optimal value of K?

Solution:

First Part:

library(FNN)
#Extract lstat column of training data and change into matrix
x_train <- as.matrix(training_data$lstat)

#Extract lstat column from test data set and change into matrix
x_test <- as.matrix(test_data$lstat)

#Extract medv column from training data,but do not convert into a matrix
y_train <- training_data$medv

#Extract medv column from test_data,but do not convert to a matrix
y_test <- test_data$medv

#Fit a KNN regression model
knn_model <- knn.reg(train = x_train,test = x_test,y = y_train, k = 1)
knn_model$pred
##   [1] 29.4 25.0 17.1 12.7 23.7 15.1 13.5 21.7 35.2 19.5 20.0 20.9 18.2 19.9 27.5
##  [16] 19.9 16.6 25.3 24.8 27.9 18.2 21.4 28.4 19.8 19.6 19.9 23.0 18.5 32.0 23.0
##  [31] 19.6 24.5 21.5 15.6 13.5 13.5 17.2 13.2 18.3 17.5 35.4 23.0 16.0 23.9 36.1
##  [46] 27.5 28.7 18.7 27.9 29.8 43.5  5.0 25.0 19.3 43.5 23.7 25.0 22.6 23.3 50.0
##  [61] 17.1 19.6 43.5 36.1 23.9 23.9 22.2 18.5 50.0 44.8 16.5 23.4 19.3 22.6 22.9
##  [76] 19.6 14.3 22.4 29.1 23.3 24.4 28.7 24.8 25.0 19.6 22.6 22.9 22.8 23.1 36.0
##  [91] 22.8 28.0 50.0 11.3 21.7 16.2 10.2 14.4 17.8 14.4 11.7 23.7  7.2 13.4  8.3
## [106] 23.2 13.4 12.3 21.7 17.9 12.7 16.4 15.6 15.6 23.1 24.4 13.3 14.1 24.3  9.6
## [121] 18.5 23.7 16.1 10.9 20.1 18.9 21.6

Remaining Part:

K_values <- seq(1, 45, 2)

MSE <- sapply(K_values, function(k) {
  mean((y_test - knn.reg(train = x_train, test = x_test, y = y_train, k=k)$pred)^2)
})

print(c(MSE = MSE))
##     MSE1     MSE2     MSE3     MSE4     MSE5     MSE6     MSE7     MSE8 
## 47.13850 33.19976 29.76716 27.43274 27.21124 25.06878 24.81589 24.27465 
##     MSE9    MSE10    MSE11    MSE12    MSE13    MSE14    MSE15    MSE16 
## 24.02081 24.31930 24.34094 24.28029 24.23051 24.31243 24.17224 24.10699 
##    MSE17    MSE18    MSE19    MSE20    MSE21    MSE22    MSE23 
## 23.95873 24.11006 23.98409 23.96527 24.12082 24.50094 24.60231
print(c(RMSE = sqrt(MSE)))
##    RMSE1    RMSE2    RMSE3    RMSE4    RMSE5    RMSE6    RMSE7    RMSE8 
## 6.865749 5.761923 5.455929 5.237628 5.216439 5.006873 4.981555 4.926931 
##    RMSE9   RMSE10   RMSE11   RMSE12   RMSE13   RMSE14   RMSE15   RMSE16 
## 4.901103 4.931461 4.933654 4.927503 4.922450 4.930763 4.916528 4.909886 
##   RMSE17   RMSE18   RMSE19   RMSE20   RMSE21   RMSE22   RMSE23 
## 4.894765 4.910200 4.897356 4.895434 4.911295 4.949843 4.960072
optimal_K <- K_values[which.min(MSE)]

print(c(optimal_K=optimal_K))
## optimal_K 
##        33
# Fit KNN with optimal K and compute MSE & RMSE
knn_model <- knn.reg(train=x_train, test=x_test,y = y_train, k=optimal_K)
MSE_KNN <- mean((y_test - knn_model$pred)^2)

# Print results
print(c(Optimal_K = optimal_K, MSE = MSE_KNN, RMSE = sqrt(MSE_KNN)))
## Optimal_K       MSE      RMSE 
## 33.000000 23.958727  4.894765

(d)

Do you think the simple linear regression model or the KNN regression model (with the most optimal value of K) is a better choice for predicting medv using lstat? Explain your reasoning.

Solution

For KNN Regression:

MSE = 23.958727, RMSE = 4.894765

For simple linear regression:

MSE = 31.080750 ,RMSE = 5.575011

By comparison, we noticed that MSE and RMSE of KNN regression model with the most optimal value of k is smaller than that of the simple linear regression. So,KNN regression model is a better choice for the given data.


(3)

This question involves the use of multiple linear regression on the Auto data set from the ISLR library.

(a)

Produce a scatterplot matrix which includes the following variables: mpg, displacement, horsepower, weight, acceleration.

library(ISLR)
library(plotly)
library(GGally)
library(DT)

# Load data
data <- Auto

Data Cleaning:

#Data cleaning:Check missing values and duplicate rows
sum(is.na(data))
## [1] 0
sum(duplicated(data))
## [1] 0

There are no missing values and no duplicate rows. Data is clean and ready for analysis.

# Display first few rows in an interactive table
datatable(head(data))
# Create an interactive scatterplot matrix
plot_matrix <- ggpairs(data, columns = c("mpg", "displacement", "horsepower", "weight", "acceleration"),aes(col=I("blue")))

# Convert to plotly interactive plot
ggplotly(plot_matrix)

Association with response variable: displacement,horsepower,and weight are highly correlated with mpg.

Multicollinearity:There is high multicollinearity among the predictors.


(b)

Compute the matrix of correlations between the variables in part (a) using the function cor().

#Create correlation matrix excluding the categorical variable
correlation_matrix <-cor(data[,c("mpg","displacement","horsepower","weight","acceleration")])
datatable(round(correlation_matrix,1))

(c)

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables from part (a) as predictors. Use the summary() function to print the results and comment on the output. For instance:

model<- lm(mpg ~ displacement+horsepower+ weight + acceleration, data=data)
summary(model)
## 
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.378  -2.793  -0.333   2.193  16.256 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.2511397  2.4560447  18.424  < 2e-16 ***
## displacement -0.0060009  0.0067093  -0.894  0.37166    
## horsepower   -0.0436077  0.0165735  -2.631  0.00885 ** 
## weight       -0.0052805  0.0008109  -6.512  2.3e-10 ***
## acceleration -0.0231480  0.1256012  -0.184  0.85388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.704 
## F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16

(i)

Is there a relationship between the predictors and the response?

Solution:

Summary of the model shows that horsepower and weight are significant showing that they have a relationship with y. As horsepower and weight increase, mileage(mpg) decreases.But, displacement and acceleration are not significant in the model.

(ii)

Which predictors appear to be statistically significant?

Solution “weight” and “horsepower” are statistically significant.


(d)

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(model,lwd=3,col="green")

A residual plot is the plot obtained by plotting the Fitted values against Residuals(Top left).In an ideal model(or in a well fitted model), residuals should be randomly scattered around zero, with no clear pattern.In this plot,the residuals are not randomly scattered.Instead they have made a pattern.The red line is not flat.Instead,it is a curvature. Having this pattern indicates that there is a problem in the model.The model is lacking linearity.Next, as we go further to the right, the residuals are spreading out more than at the beginning which suggests that the model violates the assumption of constant variance as well as \(E(\epsilon) = 0\). Thus, the plot indicates that the model is not fully capturing the relationship between the predictors and the response variables.

In the residual plot(top left), there are few points 323,387,327 which could be the point of interest.However there are no distinct outliers. Likewise, in the bottom right plot(Residual vs leverage), no point is located outside the threshold of Cook’s distance i.e. there is no point outside the dotted curve. So, there is no evidence of high leverage point.


(e)

Fit a smaller model using the predictors that appear to have a statistically significant relationship to mpg and include an interaction between those predictors. Do any interactions appear to be statistically significant? Use the summary() function to print the results and comment on the output.

Solution:

new_model <- lm(mpg ~ horsepower + weight + horsepower : weight, data = data)

The following is equivalent to the above model.

new_model2 <- lm(mpg ~ horsepower * weight,data = data)
summary(new_model)
## 
## Call:
## lm(formula = mpg ~ horsepower + weight + horsepower:weight, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7725  -2.2074  -0.2708   1.9973  14.7314 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.356e+01  2.343e+00  27.127  < 2e-16 ***
## horsepower        -2.508e-01  2.728e-02  -9.195  < 2e-16 ***
## weight            -1.077e-02  7.738e-04 -13.921  < 2e-16 ***
## horsepower:weight  5.355e-05  6.649e-06   8.054 9.93e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared:  0.7484, Adjusted R-squared:  0.7465 
## F-statistic: 384.8 on 3 and 388 DF,  p-value: < 2.2e-16

From the summary of the model, we notice that horsepower,weight and their interaction is significant.


Page 120(Textbook

Q(3)

Suppose we have a data set with five predictors, \(X_1\) = GPA, \(X_2\) = IQ, \(X_3\) = Gender (1 for Female and 0 for Male), \(X_4\) = Interaction

between GPA and IQ, and \(X_5\) = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars).

Suppose we use least squares to fit the model, and get \(\hat{\beta}_0\) = 50, \(\hat{\beta}_1\) = 20, \(\hat{\beta}_2\) = 0.07, \(\hat{\beta}_3\) = 35,

\(\hat{\beta}_4\) = 0.01, \(\hat{\beta}_5\) = −10

(a)

Which answer is correct, and why?

Solution:

Please see the hand written solution for this question:


(b) and (c)

Please see the handwritten solution.


Q(8)

This question involves the use of simple linear regression on the Auto data set.

(a)

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.For example:

(i)

Is there a relationship between the predictor and the response?

Solution:

Model <- lm(mpg ~ horsepower,data = data)
summary(Model)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Result:

Summary of the model shows that horsepower is significant predictor. This shows that there is a relationship between horsepower and mpg. The negative sign of the coefficient indicates a negative relationship i.e. vehicle with high horsepower have less fuel efficiency.More clearly,for every additional unit increase in horsepower, the mpg is expected to decrease by approximately 0.157845 units, on average.


(ii)

How strong is the relationship between the predictor and the response?

Solution:

The relationship is moderately strong which can be measured by the value of adjusted \(R^2\) which is 60.49%. This shows that 60.49% of variability

in mpg is explained by horsepower.


(iii)

Is the relationship between the predictor and the response positive or negative?

Solution:

The negative sign of the coefficient indicates that the relationship is negative. The summary of the model suggests that for every unit increase

in horsepower, there is a decrease in mpg by 0.157845 units.


(iv)

What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

Solution:

Predicted mpg associated with a horsepower of 98:

new_data <- data.frame(horsepower = 98)
predicted_mpg <- predict(Model,newdata = new_data)
cat("Therefore, the predicted mpg is:", predicted_mpg, "\n")
## Therefore, the predicted mpg is: 24.46708
#confidence interval
C.I <- predict(Model,newdata = new_data,interval="confidence")
C.I<- c(C.I[2],C.I[3])
C.I
## [1] 23.97308 24.96108
#prediction interval
P.I <- predict(Model,newdata=new_data,interval="prediction")
P.I <- c(P.I[2],P.I[3])
P.I
## [1] 14.80940 34.12476

(b)

Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(data$horsepower,data$mpg,xlab="horsepower",ylab = "mpg",main="Relation between mpg and horsepower")
abline(Model,col="red",lwd=3)


(c)

Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

par(mfrow=c(2,2))
plot(Model,lwd = 3,col="green")

The residual plot(Top left) is not flat. Instead, it is a curvature.So, this indicates that the model lacks linearity.Next, when we go further to the right the variability increase which indicates that the model violates the assumption of constant variance.Since the residuals are scattered unevenly, the model is possibly violating the assumption that the expected error is zero.

From the QQ plot(Top right), it is clear that the residuals are deviating away from the diagonal line towards the end which indicates that the residual are not normally distributed. These are the indications that the model has some problems.


(10) This question should be answered using the Carseats data set.

(a)

Fit a multiple regression model to predict Sales using Price,Urban, and US.

my_data <- Carseats
datatable(head(my_data))

Data cleaning

#Check for missing values
sum(is.na(my_data))
## [1] 0
#Check for duplicate rows
sum(duplicated(my_data))
## [1] 0

The result shows that there are no missing values and the duplicate rows.So, the data is clean and ready for analysis.

# Check the structure and data types
str(my_data)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Price is numeric, Urban, and US are factors.So, R automatically treats Urban and US as categorical variable while fitting a linear model.

my_model1 <- lm(Sales ~ Price + Urban + US, data = my_data)

Equivalently:

my_model2 <- lm(Sales ~ Price + as.factor(Urban) + as.factor(US),data = my_data)
summary(my_model1)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b)

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Solution:

For the interpretation of the coefficients, please see the handwritten answer.


(c)

Write out the model in equation form, being careful to handle the qualitative variables properly.

Solution:

\(Sales = \beta_0 + \beta_1 .Price + \beta_2.Urban + \beta_3 . US + \varepsilon\)


(d)

For which of the predictors can you reject the null hypothesis \(H_0 : β_j = 0\)?

Solution:

For Price and US, we can reject the null hypothesis \(H_0 : β_j = 0\) because they are significant due to very small p-value.


(e)

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

smaller_model <-lm(Sales ~ Price + US, data = my_data)
summary(smaller_model)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f)

How well do the models in (a) and (e) fit the data?

Solution:

In part(a), the adjusted \(R^2\) is 23.35% and in part(e), the adjusted \(R^2\) is 23.54% .In both of these models only about 23% of variability in Sales has been explained by the predictors. So,these models are not able to capture the significant amount of variability of Car seat Sales.

We noticed that by removing the predictor “Urban” did not improve the model significantly. The adjusted \(R^2\) value improved only by a small amount. Neither of the models adequately captures the variability in Sales.


(g)

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(smaller_model,level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h)

Is there evidence of outliers or high leverage observations in the model from (e)?

Solution:

par(mfrow=c(2,2))
plot(smaller_model,lwd = 3,col="green")

Solution:

In the residual plot(top left), there are no outliers. There are few points such as point 377 and point 51.They could be the points of interest,but not outlier. If we see the Residual vs Leverage plot(bottom right), there is no curve that determines the cooks distance threshold and so there is no high leverage point.