Question 1

Question 1a: Load up data mtcars and generate its descriptive statistics as shown below.

#summary data
sum <- apply(mtcars,2,summary)

#calculate standard deviation
sds <- apply(mtcars,2,sd)

#calculate standard error
se <- sds/sqrt(2)

#combine to have the descriptive statistics
descriptive <- rbind(sum,sds,se)
descriptive <- t(descriptive)
descriptive
##        Min.   1st Qu.  Median       Mean 3rd Qu.    Max.         sds         se
## mpg  10.400  15.42500  19.200  20.090625   22.80  33.900   6.0269481  4.2616958
## cyl   4.000   4.00000   6.000   6.187500    8.00   8.000   1.7859216  1.2628373
## disp 71.100 120.82500 196.300 230.721875  326.00 472.000 123.9386938 87.6378909
## hp   52.000  96.50000 123.000 146.687500  180.00 335.000  68.5628685 48.4812692
## drat  2.760   3.08000   3.695   3.596563    3.92   4.930   0.5346787  0.3780750
## wt    1.513   2.58125   3.325   3.217250    3.61   5.424   0.9784574  0.6918739
## qsec 14.500  16.89250  17.710  17.848750   18.90  22.900   1.7869432  1.2635597
## vs    0.000   0.00000   0.000   0.437500    1.00   1.000   0.5040161  0.3563932
## am    0.000   0.00000   0.000   0.406250    1.00   1.000   0.4989909  0.3528399
## gear  3.000   3.00000   4.000   3.687500    4.00   5.000   0.7378041  0.5217063
## carb  1.000   2.00000   2.000   2.812500    4.00   8.000   1.6152000  1.1421189

Question 1b: As we need to assess the relationship between mpg and other predictors, the visualization check will be implemented. Based on the plot, which predictors may have transformations like 𝑙𝑜𝑔(𝑥), √𝑥 or 𝑥2.

Because we use the linear model for our data, to identify whether we have to transform our data or not, we check whether the linearity between mpg with predictors is violated or not.

To check the violation of linearity, we use the diagnostic plots including:

1.Residuals vs. Fitted Plot: This plot shows the relationship between the fitted values (predicted values) and the residuals (difference between observed and predicted values). It is used to check if the assumption of linearity is valid. If there is a clear pattern or curve in the plot, it suggests that the linearity assumption has been violated.

2.Normal Q-Q Plot: This plot shows if the residuals are normally distributed. If the residuals fall close to the diagonal line, then the assumption of normality is met.

3.Scale-Location Plot: This plot checks if the residuals are spread out equally along the range of predictors. If the plot shows a horizontal line, then the assumption of equal variance is met.

4.Residuals vs. Leverage Plot: This plot shows the leverage of each observation and the corresponding residuals. High leverage points are those that have extreme values of predictors, and can have a large impact on the regression line. This plot is used to identify any influential points, which are those that have both high leverage and high residuals.

mtcars %>% ggplot(aes(mpg,cyl)) + geom_point() + geom_smooth(method= "lm")
## `geom_smooth()` using formula 'y ~ x'

par(mfrow=c(2,2))
cyl_lm <- lm(mpg~cyl,data= mtcars)
plot(cyl_lm)

Because cyl is categorical data, there is only weak linear relation between cyl and mpg. The transformation of cyl into other kinds cannot help to fit the linearity relationship, even make it more complicated for the model. Hence, we do not need to transfer the data. Other model or dummy variable may better fit to consider the relation between cyl and mpg.

par(mfrow=c(2,1))
hist(mtcars$disp)
hist(log(mtcars$disp))

par(mfrow=c(2,2))
disp_lm <- lm(mpg~log(disp),data= mtcars)
plot(disp_lm)

Because disp is continuous data, I check the distribution of the data. I see that disp does not present a normal distribution. Hence, I transform data disp into log(disp). After transformation, I check the linearity validation between mpg and log(disp). The diagnostic plots show a linear relation between mpg and log(disp)

par(mfrow=c(2,1))
hist(mtcars$hp)
hist(log(mtcars$hp))

par(mfrow=c(2,2))
hp_lm <- lm(mpg~log(hp),data= mtcars)
plot(hp_lm)

Same with disp, after we check the distribution of hp data, we can see a huge improvement after transforming hp to log(hp). The linearity validation between mpg and log(hp) is also confirmed through 4 diagnostic charts when observations are followed the straight line in Normal Q-Q and the residuals are spread out equally along the range of predictors in other 3 charts.

par(mfrow=c(2,2))
drat_lm <- lm(mpg~drat,data= mtcars)
plot(drat_lm)

After seeing four figures, we can see in Normal Q-Q, the observations follow the straight line and do not go too much far from the straight line, indicating a linear relation. Additionally, three other charts also show a random distribution of data around red lines with no clear pattern. Hence, we do not need to transform drat.

par(mfrow=c(2,2))
wt_lm <- lm(mpg~wt,data= mtcars)
plot(wt_lm)

From the 4 charts we can see a linear relation between wt and mpg; hence, we do not need to transform wt data.

par(mfrow=c(2,2))
qsec_lm <- lm(mpg~qsec,data= mtcars)
plot(qsec_lm)

From the 4 charts we can see a linear relation between qsec and mpg; hence, we do not need to transform qsec data.

mtcars %>% ggplot(aes(mpg,vs)) + geom_point() + geom_smooth(method= "lm")
## `geom_smooth()` using formula 'y ~ x'

par(mfrow=c(2,2))
vs_lm <- lm(mpg~vs,data= mtcars)
plot(vs_lm)

Same with cyl, vs is categorical data type so it does not present a perfect linearity. However, looking at the diagnostic plots, it is acceptable to use linear relation between these two variables.

mtcars %>% ggplot(aes(mpg,am)) + geom_point() + geom_smooth(method= "lm")
## `geom_smooth()` using formula 'y ~ x'

par(mfrow=c(2,2))
am_lm <- lm(mpg~am,data= mtcars)
plot(am_lm)

Same analysis with vs is pointed out for am predictor.

par(mfrow=c(2,2))
gear_lm <- lm(mpg~gear,data= mtcars)
plot(gear_lm)

Same analysis with vs and am is pointed out for gear predictor.

par(mfrow=c(1,2))
hist(mtcars$carb)
hist(log(mtcars$carb))

par(mfrow=c(2,2))
carb_lm <- lm(mpg~log(carb),data= mtcars)
plot(carb_lm)

Because the distribution of carb is not normal, I take the logarit to normalize the data. Then, we check the linearity between carb and mpg through diagnostic plots. The linearity is confirmed after data transformation.

To conclude:

Among predictors, I take logarit for disp, hp, and carb to transform them into more predictable. I keep other variables as their original forms.

Question 1c: Run the multiple regression on mpg across all predictors and show the estimated results

#Run the linear regression model:
model_c <- lm(mpg~.,data=mtcars)

#lm_c model result:
summary(model_c)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

As can be seen from the result, there are no significant variables in the model, indicating there are some problems in the model.

Question 1d: Using car::vif (variance inflation factor) to estimate if there is any multicollinearity among predictors. Find predictors with vif higher than 10

Variance inflation factor (VIF) is a measure of multicollinearity in a linear regression model. It measures how much the variance of the estimated regression coefficient is increased due to collinearity among the independent variables. Specifically, VIF measures the ratio of the variance of the coefficient estimate in the presence of multicollinearity to the variance of the coefficient estimate when the predictor variables are uncorrelated.

A VIF value of 1 indicates no multicollinearity, while values greater than 1 suggest increasing levels of multicollinearity. Generally, a VIF value greater than 5 or 10 is considered high and may indicate that the corresponding predictor variable should be removed from the model.

#Using vif to detect multicollinearity problem:
vif(model_c)
##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873  4.648487 
##      gear      carb 
##  5.357452  7.908747

From the result, we can see that there are multicollinearity problems among predictors in our model: 3 variables having vif > 10 and 7 variables having vif > 5.

3 variables cyl, disp, and wt having vif larger than 10.

Question 1e: Rerun the multiple regressions by (1) excluding disp, and (2) excluding disp and cyl from predictors. Are there any improvements observed from the regression results?

#Run the multiple regression by excluding disp
model_e_1 <- lm(mpg~. - disp,data=mtcars)
summary(model_e_1)
## 
## Call:
## lm(formula = mpg ~ . - disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7863 -1.4055 -0.2635  1.2029  4.4753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.55052   18.52585   0.677   0.5052  
## cyl          0.09627    0.99715   0.097   0.9240  
## hp          -0.01295    0.01834  -0.706   0.4876  
## drat         0.92864    1.60794   0.578   0.5694  
## wt          -2.62694    1.19800  -2.193   0.0392 *
## qsec         0.66523    0.69335   0.959   0.3478  
## vs           0.16035    2.07277   0.077   0.9390  
## am           2.47882    2.03513   1.218   0.2361  
## gear         0.74300    1.47360   0.504   0.6191  
## carb        -0.61686    0.60566  -1.018   0.3195  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.623 on 22 degrees of freedom
## Multiple R-squared:  0.8655, Adjusted R-squared:  0.8105 
## F-statistic: 15.73 on 9 and 22 DF,  p-value: 1.183e-07
#Run the multiple regression by excluding disp and cyl
model_e_2 <- lm(mpg~.,data=mtcars)
summary(model_e_2)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

I call the full model is model_c, the model without disp is model_e_1, and the model without disp and cyl is model_e_2.

From the result, I have two comments:

First, according to significant variable, model_c and model_e_2 both show no significant variables in the model, while model_e_1 shows wt is a significant variable (with 5% significant level). Hence, model_e_1 is considered to be best model among 3 models.

Second, according to adjusted R-squared, we have the value of model_c, model_e_1, and model_e_2 are 0.8066, 0.8105, and 0.8066, respectively. Hence, the model_e_1 can explain more the variation of mpg, meaning that model_e_1 is the best model among 3 models.

In conclusion, excluding disp can help to improve the regression result. cyl is more important to explain the variation of mpg; thus; it is recommended not to exclude cyl from the model.

Question 2

Separate the data

Before running the model, we first seperate the data into two parts with the proportion of 20:80. In which, 20% of data is called test data and 80% of data is called trained data. We will use train data to train the model and then fit the model into test data to see how well is its performance.

set.seed(123)
Carseats_split <- initial_split(Carseats,prop=0.8, strata= Sales)
Carseats_train <- training(Carseats_split)
Carseats_test <- testing(Carseats_split)

Question 2a: Fit a multiple regression model to predict Sales using Price, Urban and US

#build model 
lm_model <- linear_reg() %>% set_engine("lm")
lm_a <- lm_model %>% fit(Sales ~ Price+ Urban+ US, data = Carseats_train)
summary(lm_a$fit)
## 
## Call:
## stats::lm(formula = Sales ~ Price + Urban + US, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7784 -1.6018 -0.0234  1.6380  7.2425 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.938978   0.736088  17.578  < 2e-16 ***
## Price       -0.051814   0.005931  -8.736  < 2e-16 ***
## UrbanYes    -0.252232   0.316369  -0.797 0.425894    
## USYes        1.107562   0.293321   3.776 0.000191 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.497 on 315 degrees of freedom
## Multiple R-squared:  0.2194, Adjusted R-squared:  0.212 
## F-statistic: 29.52 on 3 and 315 DF,  p-value: < 2.2e-16
#predict by fitting the test data into the model
predict_a <- Carseats_test %>% select(Sales) %>% bind_cols(predict(lm_a,new_data=Carseats_test))

#predict result format
colnames(predict_a) <- c("Actual_Sales", "Prediction_Sales")
predict_result_a <- data.frame(predict_a) %>% 
         kbl(format = "html", caption = "Prediction Sales Value of model lm_a ") %>%
         row_spec(row = 0, bold = TRUE, color = "black", background = "#F9EBEA") %>%
         kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
predict_result_a
Prediction Sales Value of model lm_a
Actual_Sales Prediction_Sales
1 9.50 7.576673
5 4.15 6.054602
6 10.81 10.315959
7 6.63 7.090875
10 4.69 7.621650
14 10.96 9.338336
19 13.91 10.523213
22 12.13 8.398855
31 13.55 8.075333
33 6.20 6.948073
34 8.77 7.162164
41 2.07 6.410461
44 4.12 6.851282
61 8.32 7.421232
64 8.47 8.561132
67 8.85 7.971706
69 13.39 6.851282
76 8.55 9.279686
78 7.70 9.435127
82 7.52 6.054602
84 4.42 8.923827
89 6.56 8.042995
93 4.53 6.210043
102 6.20 6.572738
108 8.55 7.090875
109 3.47 7.349943
119 7.57 8.664759
124 8.19 6.015428
129 4.96 7.265791
134 7.62 8.768386
136 6.44 7.828905
139 10.27 8.146623
145 9.09 6.565902
154 5.93 6.274496
162 2.93 5.756360
163 3.63 4.966516
168 6.71 7.868079
170 11.48 9.804659
177 5.61 6.067242
179 10.66 9.849636
181 4.94 6.074078
189 8.07 8.023520
190 12.11 8.657923
195 7.23 7.162164
197 4.10 6.903096
203 4.10 7.310769
204 2.05 4.552007
208 8.19 7.913056
211 4.36 7.673464
223 7.49 6.281332
226 6.68 8.438029
233 13.14 8.353877
242 12.01 7.816265
248 5.04 4.862889
251 9.16 5.711383
260 5.12 8.865177
262 5.71 7.680300
263 6.37 6.954909
267 9.10 8.243414
269 6.53 7.246315
276 6.67 6.954909
285 6.97 7.964870
287 7.53 8.191600
293 11.82 9.960100
295 12.66 8.664759
296 4.21 6.948073
300 9.40 9.072432
305 11.93 6.851282
318 6.41 5.892325
330 11.27 9.182895
331 4.99 7.135852
341 7.50 7.971706
348 6.88 7.135852
350 9.32 9.072432
367 5.98 7.103514
384 9.35 9.163419
387 5.32 4.396566
390 8.44 8.250250
396 12.57 7.162164
397 6.14 7.828905
398 7.41 5.555942
#Evaluate the prediction
rmse_a <- rmse(predict_a, 
     truth = Actual_Sales, 
     estimate = Prediction_Sales)
rsq_a <- rsq(predict_a, 
             truth = Actual_Sales, 
             estimate = Prediction_Sales)

#Evaluation result
Evaluation_a <- rbind(rmse_a, rsq_a)
Evaluation_a
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       2.39 
## 2 rsq     standard       0.314

Question 2b: Provide an interpretation of each coefficient in the model.

To interpret the coefficient of independent variables. First, we define two dummy variables:

UrbanYes is the dummy variable where they have Urban data Yes =1 , No = 0

USYes is the dummy variable where they have US data Yes=1, No =0

From the model we have:

  • Coefficient of Price is -0.0518. It means that the price have inverse relation with the Sales. Specifically, when we increase 1 unit of Price, the Sales will decrease 0.0518 unit.

  • Coefficient of UrbanYes is -0.2522. It means that the Sales of car in Urban (Urban= Yes) is -0.2522 unit lower than the Sales of car not in Urban (Urban= No). However, the value is not significant; thus, we cannot compare the sales of two areas

  • Coefficient of USYes is 1.1076. It means that the Sales of car in US (US= Yes) is 1.1076 unit higher than the Sales of car not in US (US= No).

Question 2c: Write out the model in equation form

The model in equation form: Sales= 12.9390 + -0.0518 * Price+ -0.2522 * UrbanYes+ 1.1076 * USYes

In which: data of UrbanYes is transformed from the data Urban: Yes=1, No =0; data of USYes is transformed from the data US: Yes=1, No =0

Question 2d: For which of the predictors can you reject the null hypothesis 𝐻0: 𝛽𝑗 = 0?

#We run the model with all predictors:
lm_d <- lm_model %>% fit(Sales ~ ., data = Carseats_train)
summary(lm_d$fit)
## 
## Call:
## stats::lm(formula = Sales ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6936 -0.7194  0.0321  0.7002  3.3218 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.7465910  0.6717671   8.554 5.72e-16 ***
## CompPrice        0.0958758  0.0046215  20.745  < 2e-16 ***
## Income           0.0138907  0.0021414   6.487 3.51e-10 ***
## Advertising      0.1222171  0.0122055  10.013  < 2e-16 ***
## Population      -0.0002724  0.0004241  -0.642    0.521    
## Price           -0.0973130  0.0030369 -32.044  < 2e-16 ***
## ShelveLocGood    4.9762295  0.1741229  28.579  < 2e-16 ***
## ShelveLocMedium  2.1242422  0.1416622  14.995  < 2e-16 ***
## Age             -0.0482196  0.0035436 -13.608  < 2e-16 ***
## Education       -0.0198925  0.0218967  -0.908    0.364    
## UrbanYes         0.1224443  0.1292138   0.948    0.344    
## USYes           -0.1913763  0.1659768  -1.153    0.250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.009 on 307 degrees of freedom
## Multiple R-squared:  0.8758, Adjusted R-squared:  0.8714 
## F-statistic: 196.8 on 11 and 307 DF,  p-value: < 2.2e-16

After running the model, we see the p-value column. If p-value <0.05 (meaning 95% significance), we can reject H0.

Hence, in model lm_d, the predictors can reject H0 are: CompPrice, Income, Advertising, Price, ShelveLoc, Age.

Question 2e: On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

#Build model:
lm_e <- lm_model %>% fit(Sales~ CompPrice+Income+Advertising+Price+ShelveLoc+Age, data=Carseats_train)
summary(lm_e$fit)
## 
## Call:
## stats::lm(formula = Sales ~ CompPrice + Income + Advertising + 
##     Price + ShelveLoc + Age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5656 -0.7168 -0.0098  0.7051  3.2387 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.371596   0.552951   9.714  < 2e-16 ***
## CompPrice        0.096378   0.004579  21.047  < 2e-16 ***
## Income           0.013815   0.002124   6.504 3.12e-10 ***
## Advertising      0.112813   0.008533  13.220  < 2e-16 ***
## Price           -0.097468   0.003023 -32.242  < 2e-16 ***
## ShelveLocGood    4.964289   0.172590  28.764  < 2e-16 ***
## ShelveLocMedium  2.118822   0.139967  15.138  < 2e-16 ***
## Age             -0.047803   0.003518 -13.587  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 311 degrees of freedom
## Multiple R-squared:  0.8747, Adjusted R-squared:  0.8718 
## F-statistic:   310 on 7 and 311 DF,  p-value: < 2.2e-16
#predict by fitting the test data into the model:
predict_e <- Carseats_test %>% select(Sales) %>% bind_cols(predict(lm_e,new_data=Carseats_test))

#predict result format
colnames(predict_e) <- c("Actual_Sales", "Prediction_Sales")
predict_result_e <- data.frame(predict_e) %>% 
         kbl(format = "html", caption = "Prediction Sales Value of model lm_e ") %>%
         row_spec(row = 0, bold = TRUE, color = "black", background = "#F9EBEA") %>%
         kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
predict_result_e
Prediction Sales Value of model lm_e
Actual_Sales Prediction_Sales
1 9.50 7.217363
5 4.15 5.891126
6 10.81 9.603865
7 6.63 6.103979
10 4.69 6.054428
14 10.96 12.131338
19 13.91 13.630415
22 12.13 11.417162
31 13.55 13.573059
33 6.20 6.031703
34 8.77 8.448503
41 2.07 2.423936
44 4.12 5.285041
61 8.32 7.307499
64 8.47 8.543057
67 8.85 9.454907
69 13.39 11.976861
76 8.55 7.293121
78 7.70 9.320238
82 7.52 6.688363
84 4.42 5.324904
89 6.56 6.353872
93 4.53 6.468892
102 6.20 7.985133
108 8.55 8.488621
109 3.47 3.854714
119 7.57 7.113043
124 8.19 7.505085
129 4.96 4.999746
134 7.62 7.254869
136 6.44 6.203692
139 10.27 9.587087
145 9.09 10.383388
154 5.93 7.418913
162 2.93 3.329014
163 3.63 3.310224
168 6.71 6.782331
170 11.48 11.623185
177 5.61 6.027367
179 10.66 10.984033
181 4.94 4.137205
189 8.07 6.776312
190 12.11 11.130524
195 7.23 7.042283
197 4.10 2.559416
203 4.10 3.692439
204 2.05 2.632459
208 8.19 5.149819
211 4.36 3.975625
223 7.49 7.112802
226 6.68 6.316159
233 13.14 12.622932
242 12.01 10.489708
248 5.04 2.458111
251 9.16 7.471632
260 5.12 5.567375
262 5.71 6.101094
263 6.37 6.651465
267 9.10 10.904849
269 6.53 8.033955
276 6.67 5.288536
285 6.97 4.330774
287 7.53 7.421147
293 11.82 13.097780
295 12.66 13.470748
296 4.21 3.796421
300 9.40 11.033609
305 11.93 10.451071
318 6.41 7.356169
330 11.27 10.909265
331 4.99 5.498732
341 7.50 8.340062
348 6.88 7.919901
350 9.32 11.109457
367 5.98 5.861652
384 9.35 8.912473
387 5.32 6.283298
390 8.44 9.207389
396 12.57 12.992552
397 6.14 7.217852
398 7.41 7.407104
#Evaluate the prediction
rmse_e <- rmse(predict_e, 
     truth = Actual_Sales, 
     estimate = Prediction_Sales)
rsq_e <- rsq(predict_e, 
             truth = Actual_Sales, 
             estimate = Prediction_Sales)

#Evaluation result
Evaluation_e <- rbind(rmse_e, rsq_e)
Evaluation_e 
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1.09 
## 2 rsq     standard       0.859

Question 2f: How well do the models in (a) and (e) fit the data?

Evaluation_compare <- rbind(Evaluation_a, Evaluation_e)
Evaluation_compare
## # A tibble: 4 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       2.39 
## 2 rsq     standard       0.314
## 3 rmse    standard       1.09 
## 4 rsq     standard       0.859

To evalue how well do the models in (a) and (e) fit the data, we use these metrics following:

RMSE: measure the average squared difference between the predicted values and the actual values. Hence, the lower the metric, the better the model.

RMSE in model e is 1.09 while in model a is 2.39. It means that model e is better than model a because the difference between actual values and predicted values in model e is smaller than in model a.

Rsquare: measures the proportion of variation in the dependent variable that is explained by the independent variables. A higher R^2 indicates a better fit. Because we already separate the data into 2 parts to test the model, it will prevent overfitting problem.

In our models: Rsquared of model a is 0.314 while Rsquared of model e is 0.859. These means that indepedant variables in model a can predict 31.4% the variant of Sales, while that number for model e is 85.9%. Thus, model e is better than model a

Question 2g: Using the model from (e), obtain 95% confidence intervals for the coefficient(s)

#Combine confidence interval:
predict_e_int <- predict_e%>% bind_cols(predict(lm_e,Carseats_test,type="pred_int"))

#Format the table 
colnames(predict_e_int) <- c("Actual_Sales","Prediction_Sales","Lower_Bound","Upper_Bound")

predict_e_int <- data.frame(predict_e_int) %>% 
         kbl(format = "html", caption = "Prediction Sales Value of model lm_e with confidence interval 95% ") %>%
         row_spec(row = 0, bold = TRUE, color = "black", background = "#F9EBEA") %>%
         kable_styling(bootstrap_options = "striped", full_width = F, position = "center")

predict_e_int
Prediction Sales Value of model lm_e with confidence interval 95%
Actual_Sales Prediction_Sales Lower_Bound Upper_Bound
1 9.50 7.217363 5.2173018 9.217424
5 4.15 5.891126 3.8903264 7.891925
6 10.81 9.603865 7.5768546 11.630876
7 6.63 6.103979 4.1031975 8.104760
10 4.69 6.054428 4.0476544 8.061202
14 10.96 12.131338 10.1180327 14.144643
19 13.91 13.630415 11.6082762 15.652554
22 12.13 11.417162 9.4058121 13.428512
31 13.55 13.573059 11.5565614 15.589557
33 6.20 6.031703 4.0108559 8.052549
34 8.77 8.448503 6.4397830 10.457222
41 2.07 2.423936 0.4147083 4.433163
44 4.12 5.285041 3.2904295 7.279651
61 8.32 7.307499 5.2913129 9.323685
64 8.47 8.543057 6.5520334 10.534080
67 8.85 9.454907 7.4568043 11.453010
69 13.39 11.976861 9.9596755 13.994047
76 8.55 7.293121 5.2534078 9.332833
78 7.70 9.320238 7.3251296 11.315346
82 7.52 6.688363 4.6797174 8.697008
84 4.42 5.324904 3.3201012 7.329706
89 6.56 6.353872 4.3627539 8.344991
93 4.53 6.468892 4.4527559 8.485028
102 6.20 7.985133 5.9857888 9.984478
108 8.55 8.488621 6.4877828 10.489459
109 3.47 3.854714 1.8528079 5.856619
119 7.57 7.113043 5.1199997 9.106087
124 8.19 7.505085 5.4787382 9.531432
129 4.96 4.999746 2.9988382 7.000654
134 7.62 7.254869 5.2491435 9.260595
136 6.44 6.203692 4.1855039 8.221880
139 10.27 9.587087 7.5906487 11.583526
145 9.09 10.383388 8.3787832 12.387992
154 5.93 7.418913 5.4119178 9.425908
162 2.93 3.329014 1.3210643 5.336964
163 3.63 3.310224 1.3063470 5.314101
168 6.71 6.782331 4.7873380 8.777324
170 11.48 11.623185 9.6014000 13.644971
177 5.61 6.027367 4.0224632 8.032270
179 10.66 10.984033 8.9726194 12.995447
181 4.94 4.137205 2.1211869 6.153223
189 8.07 6.776312 4.7746122 8.778012
190 12.11 11.130524 9.1129754 13.148072
195 7.23 7.042283 5.0343011 9.050265
197 4.10 2.559416 0.5493750 4.569456
203 4.10 3.692439 1.6929930 5.691884
204 2.05 2.632459 0.6121440 4.652774
208 8.19 5.149819 3.1444366 7.155202
211 4.36 3.975625 1.9750898 5.976160
223 7.49 7.112802 5.1022210 9.123383
226 6.68 6.316159 4.2990073 8.333311
233 13.14 12.622932 10.6171251 14.628739
242 12.01 10.489708 8.4876222 12.491794
248 5.04 2.458111 0.4325565 4.483665
251 9.16 7.471632 5.4547515 9.488512
260 5.12 5.567375 3.5572899 7.577459
262 5.71 6.101094 4.1108002 8.091389
263 6.37 6.651465 4.6546712 8.648258
267 9.10 10.904849 8.8981617 12.911536
269 6.53 8.033955 6.0402738 10.027635
276 6.67 5.288536 3.2743117 7.302760
285 6.97 4.330774 2.3183658 6.343183
287 7.53 7.421147 5.4194976 9.422796
293 11.82 13.097780 11.0760862 15.119474
295 12.66 13.470748 11.4520393 15.489457
296 4.21 3.796421 1.7869752 5.805867
300 9.40 11.033609 9.0251680 13.042050
305 11.93 10.451071 8.4392111 12.462930
318 6.41 7.356169 5.3370397 9.375298
330 11.27 10.909265 8.9002311 12.918299
331 4.99 5.498732 3.4956498 7.501813
341 7.50 8.340062 6.3197782 10.360346
348 6.88 7.919901 5.8904723 9.949329
350 9.32 11.109457 9.0955681 13.123346
367 5.98 5.861652 3.8696936 7.853610
384 9.35 8.912473 6.8991056 10.925840
387 5.32 6.283298 4.2635511 8.303046
390 8.44 9.207389 7.2120575 11.202721
396 12.57 12.992552 10.9766500 15.008454
397 6.14 7.217852 5.2200291 9.215675
398 7.41 7.407104 5.3933592 9.420849

Question 2h: Is there evidence of ouitliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(lm_e$fit)

First we look at Normal Q-Q, we can see that all observations approximately follow the straight line and no observation are too much far from straight line. It means that there are no outliers. However, in terms of high leverage observations,we can see some are far from other observations for example the points 298, 358.

Looking at three other plots, we also can see the similar results that the points 298 and 358 are detected as high leverage observations.