#summary data
sum <- apply(mtcars,2,summary)
#calculate standard deviation
sds <- apply(mtcars,2,sd)
#calculate standard error
se <- sds/sqrt(2)
#combine to have the descriptive statistics
descriptive <- rbind(sum,sds,se)
descriptive <- t(descriptive)
descriptive
## Min. 1st Qu. Median Mean 3rd Qu. Max. sds se
## mpg 10.400 15.42500 19.200 20.090625 22.80 33.900 6.0269481 4.2616958
## cyl 4.000 4.00000 6.000 6.187500 8.00 8.000 1.7859216 1.2628373
## disp 71.100 120.82500 196.300 230.721875 326.00 472.000 123.9386938 87.6378909
## hp 52.000 96.50000 123.000 146.687500 180.00 335.000 68.5628685 48.4812692
## drat 2.760 3.08000 3.695 3.596563 3.92 4.930 0.5346787 0.3780750
## wt 1.513 2.58125 3.325 3.217250 3.61 5.424 0.9784574 0.6918739
## qsec 14.500 16.89250 17.710 17.848750 18.90 22.900 1.7869432 1.2635597
## vs 0.000 0.00000 0.000 0.437500 1.00 1.000 0.5040161 0.3563932
## am 0.000 0.00000 0.000 0.406250 1.00 1.000 0.4989909 0.3528399
## gear 3.000 3.00000 4.000 3.687500 4.00 5.000 0.7378041 0.5217063
## carb 1.000 2.00000 2.000 2.812500 4.00 8.000 1.6152000 1.1421189
Because we use the linear model for our data, to identify whether we have to transform our data or not, we check whether the linearity between mpg with predictors is violated or not.
To check the violation of linearity, we use the diagnostic plots including:
1.Residuals vs. Fitted Plot: This plot shows the relationship between the fitted values (predicted values) and the residuals (difference between observed and predicted values). It is used to check if the assumption of linearity is valid. If there is a clear pattern or curve in the plot, it suggests that the linearity assumption has been violated.
2.Normal Q-Q Plot: This plot shows if the residuals are normally distributed. If the residuals fall close to the diagonal line, then the assumption of normality is met.
3.Scale-Location Plot: This plot checks if the residuals are spread out equally along the range of predictors. If the plot shows a horizontal line, then the assumption of equal variance is met.
4.Residuals vs. Leverage Plot: This plot shows the leverage of each observation and the corresponding residuals. High leverage points are those that have extreme values of predictors, and can have a large impact on the regression line. This plot is used to identify any influential points, which are those that have both high leverage and high residuals.
mtcars %>% ggplot(aes(mpg,cyl)) + geom_point() + geom_smooth(method= "lm")
## `geom_smooth()` using formula 'y ~ x'
par(mfrow=c(2,2))
cyl_lm <- lm(mpg~cyl,data= mtcars)
plot(cyl_lm)
Because cyl is categorical data, there is only weak linear relation between cyl and mpg. The transformation of cyl into other kinds cannot help to fit the linearity relationship, even make it more complicated for the model. Hence, we do not need to transfer the data. Other model or dummy variable may better fit to consider the relation between cyl and mpg.
par(mfrow=c(2,1))
hist(mtcars$disp)
hist(log(mtcars$disp))
par(mfrow=c(2,2))
disp_lm <- lm(mpg~log(disp),data= mtcars)
plot(disp_lm)
Because disp is continuous data, I check the distribution of the data. I see that disp does not present a normal distribution. Hence, I transform data disp into log(disp). After transformation, I check the linearity validation between mpg and log(disp). The diagnostic plots show a linear relation between mpg and log(disp)
par(mfrow=c(2,1))
hist(mtcars$hp)
hist(log(mtcars$hp))
par(mfrow=c(2,2))
hp_lm <- lm(mpg~log(hp),data= mtcars)
plot(hp_lm)
Same with disp, after we check the distribution of hp data, we can see a huge improvement after transforming hp to log(hp). The linearity validation between mpg and log(hp) is also confirmed through 4 diagnostic charts when observations are followed the straight line in Normal Q-Q and the residuals are spread out equally along the range of predictors in other 3 charts.
par(mfrow=c(2,2))
drat_lm <- lm(mpg~drat,data= mtcars)
plot(drat_lm)
After seeing four figures, we can see in Normal Q-Q, the observations follow the straight line and do not go too much far from the straight line, indicating a linear relation. Additionally, three other charts also show a random distribution of data around red lines with no clear pattern. Hence, we do not need to transform drat.
par(mfrow=c(2,2))
wt_lm <- lm(mpg~wt,data= mtcars)
plot(wt_lm)
From the 4 charts we can see a linear relation between wt and mpg; hence, we do not need to transform wt data.
par(mfrow=c(2,2))
qsec_lm <- lm(mpg~qsec,data= mtcars)
plot(qsec_lm)
From the 4 charts we can see a linear relation between qsec and mpg; hence, we do not need to transform qsec data.
mtcars %>% ggplot(aes(mpg,vs)) + geom_point() + geom_smooth(method= "lm")
## `geom_smooth()` using formula 'y ~ x'
par(mfrow=c(2,2))
vs_lm <- lm(mpg~vs,data= mtcars)
plot(vs_lm)
Same with cyl, vs is categorical data type so it does not present a perfect linearity. However, looking at the diagnostic plots, it is acceptable to use linear relation between these two variables.
mtcars %>% ggplot(aes(mpg,am)) + geom_point() + geom_smooth(method= "lm")
## `geom_smooth()` using formula 'y ~ x'
par(mfrow=c(2,2))
am_lm <- lm(mpg~am,data= mtcars)
plot(am_lm)
Same analysis with vs is pointed out for am predictor.
par(mfrow=c(2,2))
gear_lm <- lm(mpg~gear,data= mtcars)
plot(gear_lm)
Same analysis with vs and am is pointed out for gear predictor.
par(mfrow=c(1,2))
hist(mtcars$carb)
hist(log(mtcars$carb))
par(mfrow=c(2,2))
carb_lm <- lm(mpg~log(carb),data= mtcars)
plot(carb_lm)
Because the distribution of carb is not normal, I take the logarit to normalize the data. Then, we check the linearity between carb and mpg through diagnostic plots. The linearity is confirmed after data transformation.
Among predictors, I take logarit for disp, hp, and carb to transform them into more predictable. I keep other variables as their original forms.
#Run the linear regression model:
model_c <- lm(mpg~.,data=mtcars)
#lm_c model result:
summary(model_c)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
As can be seen from the result, there are no significant variables in the model, indicating there are some problems in the model.
Variance inflation factor (VIF) is a measure of multicollinearity in a linear regression model. It measures how much the variance of the estimated regression coefficient is increased due to collinearity among the independent variables. Specifically, VIF measures the ratio of the variance of the coefficient estimate in the presence of multicollinearity to the variance of the coefficient estimate when the predictor variables are uncorrelated.
A VIF value of 1 indicates no multicollinearity, while values greater than 1 suggest increasing levels of multicollinearity. Generally, a VIF value greater than 5 or 10 is considered high and may indicate that the corresponding predictor variable should be removed from the model.
#Using vif to detect multicollinearity problem:
vif(model_c)
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
From the result, we can see that there are multicollinearity problems among predictors in our model: 3 variables having vif > 10 and 7 variables having vif > 5.
3 variables cyl, disp, and wt having vif larger than 10.
#Run the multiple regression by excluding disp
model_e_1 <- lm(mpg~. - disp,data=mtcars)
summary(model_e_1)
##
## Call:
## lm(formula = mpg ~ . - disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7863 -1.4055 -0.2635 1.2029 4.4753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.55052 18.52585 0.677 0.5052
## cyl 0.09627 0.99715 0.097 0.9240
## hp -0.01295 0.01834 -0.706 0.4876
## drat 0.92864 1.60794 0.578 0.5694
## wt -2.62694 1.19800 -2.193 0.0392 *
## qsec 0.66523 0.69335 0.959 0.3478
## vs 0.16035 2.07277 0.077 0.9390
## am 2.47882 2.03513 1.218 0.2361
## gear 0.74300 1.47360 0.504 0.6191
## carb -0.61686 0.60566 -1.018 0.3195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.623 on 22 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8105
## F-statistic: 15.73 on 9 and 22 DF, p-value: 1.183e-07
#Run the multiple regression by excluding disp and cyl
model_e_2 <- lm(mpg~.,data=mtcars)
summary(model_e_2)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
I call the full model is model_c, the model without disp is model_e_1, and the model without disp and cyl is model_e_2.
From the result, I have two comments:
First, according to significant variable, model_c and model_e_2 both show no significant variables in the model, while model_e_1 shows wt is a significant variable (with 5% significant level). Hence, model_e_1 is considered to be best model among 3 models.
Second, according to adjusted R-squared, we have the value of model_c, model_e_1, and model_e_2 are 0.8066, 0.8105, and 0.8066, respectively. Hence, the model_e_1 can explain more the variation of mpg, meaning that model_e_1 is the best model among 3 models.
In conclusion, excluding disp can help to improve the regression result. cyl is more important to explain the variation of mpg; thus; it is recommended not to exclude cyl from the model.
Before running the model, we first seperate the data into two parts with the proportion of 20:80. In which, 20% of data is called test data and 80% of data is called trained data. We will use train data to train the model and then fit the model into test data to see how well is its performance.
set.seed(123)
Carseats_split <- initial_split(Carseats,prop=0.8, strata= Sales)
Carseats_train <- training(Carseats_split)
Carseats_test <- testing(Carseats_split)
#build model
lm_model <- linear_reg() %>% set_engine("lm")
lm_a <- lm_model %>% fit(Sales ~ Price+ Urban+ US, data = Carseats_train)
summary(lm_a$fit)
##
## Call:
## stats::lm(formula = Sales ~ Price + Urban + US, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7784 -1.6018 -0.0234 1.6380 7.2425
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.938978 0.736088 17.578 < 2e-16 ***
## Price -0.051814 0.005931 -8.736 < 2e-16 ***
## UrbanYes -0.252232 0.316369 -0.797 0.425894
## USYes 1.107562 0.293321 3.776 0.000191 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.497 on 315 degrees of freedom
## Multiple R-squared: 0.2194, Adjusted R-squared: 0.212
## F-statistic: 29.52 on 3 and 315 DF, p-value: < 2.2e-16
#predict by fitting the test data into the model
predict_a <- Carseats_test %>% select(Sales) %>% bind_cols(predict(lm_a,new_data=Carseats_test))
#predict result format
colnames(predict_a) <- c("Actual_Sales", "Prediction_Sales")
predict_result_a <- data.frame(predict_a) %>%
kbl(format = "html", caption = "Prediction Sales Value of model lm_a ") %>%
row_spec(row = 0, bold = TRUE, color = "black", background = "#F9EBEA") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
predict_result_a
| Actual_Sales | Prediction_Sales | |
|---|---|---|
| 1 | 9.50 | 7.576673 |
| 5 | 4.15 | 6.054602 |
| 6 | 10.81 | 10.315959 |
| 7 | 6.63 | 7.090875 |
| 10 | 4.69 | 7.621650 |
| 14 | 10.96 | 9.338336 |
| 19 | 13.91 | 10.523213 |
| 22 | 12.13 | 8.398855 |
| 31 | 13.55 | 8.075333 |
| 33 | 6.20 | 6.948073 |
| 34 | 8.77 | 7.162164 |
| 41 | 2.07 | 6.410461 |
| 44 | 4.12 | 6.851282 |
| 61 | 8.32 | 7.421232 |
| 64 | 8.47 | 8.561132 |
| 67 | 8.85 | 7.971706 |
| 69 | 13.39 | 6.851282 |
| 76 | 8.55 | 9.279686 |
| 78 | 7.70 | 9.435127 |
| 82 | 7.52 | 6.054602 |
| 84 | 4.42 | 8.923827 |
| 89 | 6.56 | 8.042995 |
| 93 | 4.53 | 6.210043 |
| 102 | 6.20 | 6.572738 |
| 108 | 8.55 | 7.090875 |
| 109 | 3.47 | 7.349943 |
| 119 | 7.57 | 8.664759 |
| 124 | 8.19 | 6.015428 |
| 129 | 4.96 | 7.265791 |
| 134 | 7.62 | 8.768386 |
| 136 | 6.44 | 7.828905 |
| 139 | 10.27 | 8.146623 |
| 145 | 9.09 | 6.565902 |
| 154 | 5.93 | 6.274496 |
| 162 | 2.93 | 5.756360 |
| 163 | 3.63 | 4.966516 |
| 168 | 6.71 | 7.868079 |
| 170 | 11.48 | 9.804659 |
| 177 | 5.61 | 6.067242 |
| 179 | 10.66 | 9.849636 |
| 181 | 4.94 | 6.074078 |
| 189 | 8.07 | 8.023520 |
| 190 | 12.11 | 8.657923 |
| 195 | 7.23 | 7.162164 |
| 197 | 4.10 | 6.903096 |
| 203 | 4.10 | 7.310769 |
| 204 | 2.05 | 4.552007 |
| 208 | 8.19 | 7.913056 |
| 211 | 4.36 | 7.673464 |
| 223 | 7.49 | 6.281332 |
| 226 | 6.68 | 8.438029 |
| 233 | 13.14 | 8.353877 |
| 242 | 12.01 | 7.816265 |
| 248 | 5.04 | 4.862889 |
| 251 | 9.16 | 5.711383 |
| 260 | 5.12 | 8.865177 |
| 262 | 5.71 | 7.680300 |
| 263 | 6.37 | 6.954909 |
| 267 | 9.10 | 8.243414 |
| 269 | 6.53 | 7.246315 |
| 276 | 6.67 | 6.954909 |
| 285 | 6.97 | 7.964870 |
| 287 | 7.53 | 8.191600 |
| 293 | 11.82 | 9.960100 |
| 295 | 12.66 | 8.664759 |
| 296 | 4.21 | 6.948073 |
| 300 | 9.40 | 9.072432 |
| 305 | 11.93 | 6.851282 |
| 318 | 6.41 | 5.892325 |
| 330 | 11.27 | 9.182895 |
| 331 | 4.99 | 7.135852 |
| 341 | 7.50 | 7.971706 |
| 348 | 6.88 | 7.135852 |
| 350 | 9.32 | 9.072432 |
| 367 | 5.98 | 7.103514 |
| 384 | 9.35 | 9.163419 |
| 387 | 5.32 | 4.396566 |
| 390 | 8.44 | 8.250250 |
| 396 | 12.57 | 7.162164 |
| 397 | 6.14 | 7.828905 |
| 398 | 7.41 | 5.555942 |
#Evaluate the prediction
rmse_a <- rmse(predict_a,
truth = Actual_Sales,
estimate = Prediction_Sales)
rsq_a <- rsq(predict_a,
truth = Actual_Sales,
estimate = Prediction_Sales)
#Evaluation result
Evaluation_a <- rbind(rmse_a, rsq_a)
Evaluation_a
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.39
## 2 rsq standard 0.314
To interpret the coefficient of independent variables. First, we define two dummy variables:
UrbanYes is the dummy variable where they have Urban data Yes =1 , No = 0
USYes is the dummy variable where they have US data Yes=1, No =0
From the model we have:
Coefficient of Price is -0.0518. It means that the price have inverse relation with the Sales. Specifically, when we increase 1 unit of Price, the Sales will decrease 0.0518 unit.
Coefficient of UrbanYes is -0.2522. It means that the Sales of car in Urban (Urban= Yes) is -0.2522 unit lower than the Sales of car not in Urban (Urban= No). However, the value is not significant; thus, we cannot compare the sales of two areas
Coefficient of USYes is 1.1076. It means that the Sales of car in US (US= Yes) is 1.1076 unit higher than the Sales of car not in US (US= No).
The model in equation form: Sales= 12.9390 + -0.0518 * Price+ -0.2522 * UrbanYes+ 1.1076 * USYes
In which: data of UrbanYes is transformed from the data Urban: Yes=1, No =0; data of USYes is transformed from the data US: Yes=1, No =0
#We run the model with all predictors:
lm_d <- lm_model %>% fit(Sales ~ ., data = Carseats_train)
summary(lm_d$fit)
##
## Call:
## stats::lm(formula = Sales ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6936 -0.7194 0.0321 0.7002 3.3218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7465910 0.6717671 8.554 5.72e-16 ***
## CompPrice 0.0958758 0.0046215 20.745 < 2e-16 ***
## Income 0.0138907 0.0021414 6.487 3.51e-10 ***
## Advertising 0.1222171 0.0122055 10.013 < 2e-16 ***
## Population -0.0002724 0.0004241 -0.642 0.521
## Price -0.0973130 0.0030369 -32.044 < 2e-16 ***
## ShelveLocGood 4.9762295 0.1741229 28.579 < 2e-16 ***
## ShelveLocMedium 2.1242422 0.1416622 14.995 < 2e-16 ***
## Age -0.0482196 0.0035436 -13.608 < 2e-16 ***
## Education -0.0198925 0.0218967 -0.908 0.364
## UrbanYes 0.1224443 0.1292138 0.948 0.344
## USYes -0.1913763 0.1659768 -1.153 0.250
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.009 on 307 degrees of freedom
## Multiple R-squared: 0.8758, Adjusted R-squared: 0.8714
## F-statistic: 196.8 on 11 and 307 DF, p-value: < 2.2e-16
After running the model, we see the p-value column. If p-value <0.05 (meaning 95% significance), we can reject H0.
Hence, in model lm_d, the predictors can reject H0 are: CompPrice, Income, Advertising, Price, ShelveLoc, Age.
#Build model:
lm_e <- lm_model %>% fit(Sales~ CompPrice+Income+Advertising+Price+ShelveLoc+Age, data=Carseats_train)
summary(lm_e$fit)
##
## Call:
## stats::lm(formula = Sales ~ CompPrice + Income + Advertising +
## Price + ShelveLoc + Age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5656 -0.7168 -0.0098 0.7051 3.2387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.371596 0.552951 9.714 < 2e-16 ***
## CompPrice 0.096378 0.004579 21.047 < 2e-16 ***
## Income 0.013815 0.002124 6.504 3.12e-10 ***
## Advertising 0.112813 0.008533 13.220 < 2e-16 ***
## Price -0.097468 0.003023 -32.242 < 2e-16 ***
## ShelveLocGood 4.964289 0.172590 28.764 < 2e-16 ***
## ShelveLocMedium 2.118822 0.139967 15.138 < 2e-16 ***
## Age -0.047803 0.003518 -13.587 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 311 degrees of freedom
## Multiple R-squared: 0.8747, Adjusted R-squared: 0.8718
## F-statistic: 310 on 7 and 311 DF, p-value: < 2.2e-16
#predict by fitting the test data into the model:
predict_e <- Carseats_test %>% select(Sales) %>% bind_cols(predict(lm_e,new_data=Carseats_test))
#predict result format
colnames(predict_e) <- c("Actual_Sales", "Prediction_Sales")
predict_result_e <- data.frame(predict_e) %>%
kbl(format = "html", caption = "Prediction Sales Value of model lm_e ") %>%
row_spec(row = 0, bold = TRUE, color = "black", background = "#F9EBEA") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
predict_result_e
| Actual_Sales | Prediction_Sales | |
|---|---|---|
| 1 | 9.50 | 7.217363 |
| 5 | 4.15 | 5.891126 |
| 6 | 10.81 | 9.603865 |
| 7 | 6.63 | 6.103979 |
| 10 | 4.69 | 6.054428 |
| 14 | 10.96 | 12.131338 |
| 19 | 13.91 | 13.630415 |
| 22 | 12.13 | 11.417162 |
| 31 | 13.55 | 13.573059 |
| 33 | 6.20 | 6.031703 |
| 34 | 8.77 | 8.448503 |
| 41 | 2.07 | 2.423936 |
| 44 | 4.12 | 5.285041 |
| 61 | 8.32 | 7.307499 |
| 64 | 8.47 | 8.543057 |
| 67 | 8.85 | 9.454907 |
| 69 | 13.39 | 11.976861 |
| 76 | 8.55 | 7.293121 |
| 78 | 7.70 | 9.320238 |
| 82 | 7.52 | 6.688363 |
| 84 | 4.42 | 5.324904 |
| 89 | 6.56 | 6.353872 |
| 93 | 4.53 | 6.468892 |
| 102 | 6.20 | 7.985133 |
| 108 | 8.55 | 8.488621 |
| 109 | 3.47 | 3.854714 |
| 119 | 7.57 | 7.113043 |
| 124 | 8.19 | 7.505085 |
| 129 | 4.96 | 4.999746 |
| 134 | 7.62 | 7.254869 |
| 136 | 6.44 | 6.203692 |
| 139 | 10.27 | 9.587087 |
| 145 | 9.09 | 10.383388 |
| 154 | 5.93 | 7.418913 |
| 162 | 2.93 | 3.329014 |
| 163 | 3.63 | 3.310224 |
| 168 | 6.71 | 6.782331 |
| 170 | 11.48 | 11.623185 |
| 177 | 5.61 | 6.027367 |
| 179 | 10.66 | 10.984033 |
| 181 | 4.94 | 4.137205 |
| 189 | 8.07 | 6.776312 |
| 190 | 12.11 | 11.130524 |
| 195 | 7.23 | 7.042283 |
| 197 | 4.10 | 2.559416 |
| 203 | 4.10 | 3.692439 |
| 204 | 2.05 | 2.632459 |
| 208 | 8.19 | 5.149819 |
| 211 | 4.36 | 3.975625 |
| 223 | 7.49 | 7.112802 |
| 226 | 6.68 | 6.316159 |
| 233 | 13.14 | 12.622932 |
| 242 | 12.01 | 10.489708 |
| 248 | 5.04 | 2.458111 |
| 251 | 9.16 | 7.471632 |
| 260 | 5.12 | 5.567375 |
| 262 | 5.71 | 6.101094 |
| 263 | 6.37 | 6.651465 |
| 267 | 9.10 | 10.904849 |
| 269 | 6.53 | 8.033955 |
| 276 | 6.67 | 5.288536 |
| 285 | 6.97 | 4.330774 |
| 287 | 7.53 | 7.421147 |
| 293 | 11.82 | 13.097780 |
| 295 | 12.66 | 13.470748 |
| 296 | 4.21 | 3.796421 |
| 300 | 9.40 | 11.033609 |
| 305 | 11.93 | 10.451071 |
| 318 | 6.41 | 7.356169 |
| 330 | 11.27 | 10.909265 |
| 331 | 4.99 | 5.498732 |
| 341 | 7.50 | 8.340062 |
| 348 | 6.88 | 7.919901 |
| 350 | 9.32 | 11.109457 |
| 367 | 5.98 | 5.861652 |
| 384 | 9.35 | 8.912473 |
| 387 | 5.32 | 6.283298 |
| 390 | 8.44 | 9.207389 |
| 396 | 12.57 | 12.992552 |
| 397 | 6.14 | 7.217852 |
| 398 | 7.41 | 7.407104 |
#Evaluate the prediction
rmse_e <- rmse(predict_e,
truth = Actual_Sales,
estimate = Prediction_Sales)
rsq_e <- rsq(predict_e,
truth = Actual_Sales,
estimate = Prediction_Sales)
#Evaluation result
Evaluation_e <- rbind(rmse_e, rsq_e)
Evaluation_e
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.09
## 2 rsq standard 0.859
Evaluation_compare <- rbind(Evaluation_a, Evaluation_e)
Evaluation_compare
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.39
## 2 rsq standard 0.314
## 3 rmse standard 1.09
## 4 rsq standard 0.859
To evalue how well do the models in (a) and (e) fit the data, we use these metrics following:
RMSE: measure the average squared difference between the predicted values and the actual values. Hence, the lower the metric, the better the model.
RMSE in model e is 1.09 while in model a is 2.39. It means that model e is better than model a because the difference between actual values and predicted values in model e is smaller than in model a.
Rsquare: measures the proportion of variation in the dependent variable that is explained by the independent variables. A higher R^2 indicates a better fit. Because we already separate the data into 2 parts to test the model, it will prevent overfitting problem.
In our models: Rsquared of model a is 0.314 while Rsquared of model e is 0.859. These means that indepedant variables in model a can predict 31.4% the variant of Sales, while that number for model e is 85.9%. Thus, model e is better than model a
#Combine confidence interval:
predict_e_int <- predict_e%>% bind_cols(predict(lm_e,Carseats_test,type="pred_int"))
#Format the table
colnames(predict_e_int) <- c("Actual_Sales","Prediction_Sales","Lower_Bound","Upper_Bound")
predict_e_int <- data.frame(predict_e_int) %>%
kbl(format = "html", caption = "Prediction Sales Value of model lm_e with confidence interval 95% ") %>%
row_spec(row = 0, bold = TRUE, color = "black", background = "#F9EBEA") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
predict_e_int
| Actual_Sales | Prediction_Sales | Lower_Bound | Upper_Bound | |
|---|---|---|---|---|
| 1 | 9.50 | 7.217363 | 5.2173018 | 9.217424 |
| 5 | 4.15 | 5.891126 | 3.8903264 | 7.891925 |
| 6 | 10.81 | 9.603865 | 7.5768546 | 11.630876 |
| 7 | 6.63 | 6.103979 | 4.1031975 | 8.104760 |
| 10 | 4.69 | 6.054428 | 4.0476544 | 8.061202 |
| 14 | 10.96 | 12.131338 | 10.1180327 | 14.144643 |
| 19 | 13.91 | 13.630415 | 11.6082762 | 15.652554 |
| 22 | 12.13 | 11.417162 | 9.4058121 | 13.428512 |
| 31 | 13.55 | 13.573059 | 11.5565614 | 15.589557 |
| 33 | 6.20 | 6.031703 | 4.0108559 | 8.052549 |
| 34 | 8.77 | 8.448503 | 6.4397830 | 10.457222 |
| 41 | 2.07 | 2.423936 | 0.4147083 | 4.433163 |
| 44 | 4.12 | 5.285041 | 3.2904295 | 7.279651 |
| 61 | 8.32 | 7.307499 | 5.2913129 | 9.323685 |
| 64 | 8.47 | 8.543057 | 6.5520334 | 10.534080 |
| 67 | 8.85 | 9.454907 | 7.4568043 | 11.453010 |
| 69 | 13.39 | 11.976861 | 9.9596755 | 13.994047 |
| 76 | 8.55 | 7.293121 | 5.2534078 | 9.332833 |
| 78 | 7.70 | 9.320238 | 7.3251296 | 11.315346 |
| 82 | 7.52 | 6.688363 | 4.6797174 | 8.697008 |
| 84 | 4.42 | 5.324904 | 3.3201012 | 7.329706 |
| 89 | 6.56 | 6.353872 | 4.3627539 | 8.344991 |
| 93 | 4.53 | 6.468892 | 4.4527559 | 8.485028 |
| 102 | 6.20 | 7.985133 | 5.9857888 | 9.984478 |
| 108 | 8.55 | 8.488621 | 6.4877828 | 10.489459 |
| 109 | 3.47 | 3.854714 | 1.8528079 | 5.856619 |
| 119 | 7.57 | 7.113043 | 5.1199997 | 9.106087 |
| 124 | 8.19 | 7.505085 | 5.4787382 | 9.531432 |
| 129 | 4.96 | 4.999746 | 2.9988382 | 7.000654 |
| 134 | 7.62 | 7.254869 | 5.2491435 | 9.260595 |
| 136 | 6.44 | 6.203692 | 4.1855039 | 8.221880 |
| 139 | 10.27 | 9.587087 | 7.5906487 | 11.583526 |
| 145 | 9.09 | 10.383388 | 8.3787832 | 12.387992 |
| 154 | 5.93 | 7.418913 | 5.4119178 | 9.425908 |
| 162 | 2.93 | 3.329014 | 1.3210643 | 5.336964 |
| 163 | 3.63 | 3.310224 | 1.3063470 | 5.314101 |
| 168 | 6.71 | 6.782331 | 4.7873380 | 8.777324 |
| 170 | 11.48 | 11.623185 | 9.6014000 | 13.644971 |
| 177 | 5.61 | 6.027367 | 4.0224632 | 8.032270 |
| 179 | 10.66 | 10.984033 | 8.9726194 | 12.995447 |
| 181 | 4.94 | 4.137205 | 2.1211869 | 6.153223 |
| 189 | 8.07 | 6.776312 | 4.7746122 | 8.778012 |
| 190 | 12.11 | 11.130524 | 9.1129754 | 13.148072 |
| 195 | 7.23 | 7.042283 | 5.0343011 | 9.050265 |
| 197 | 4.10 | 2.559416 | 0.5493750 | 4.569456 |
| 203 | 4.10 | 3.692439 | 1.6929930 | 5.691884 |
| 204 | 2.05 | 2.632459 | 0.6121440 | 4.652774 |
| 208 | 8.19 | 5.149819 | 3.1444366 | 7.155202 |
| 211 | 4.36 | 3.975625 | 1.9750898 | 5.976160 |
| 223 | 7.49 | 7.112802 | 5.1022210 | 9.123383 |
| 226 | 6.68 | 6.316159 | 4.2990073 | 8.333311 |
| 233 | 13.14 | 12.622932 | 10.6171251 | 14.628739 |
| 242 | 12.01 | 10.489708 | 8.4876222 | 12.491794 |
| 248 | 5.04 | 2.458111 | 0.4325565 | 4.483665 |
| 251 | 9.16 | 7.471632 | 5.4547515 | 9.488512 |
| 260 | 5.12 | 5.567375 | 3.5572899 | 7.577459 |
| 262 | 5.71 | 6.101094 | 4.1108002 | 8.091389 |
| 263 | 6.37 | 6.651465 | 4.6546712 | 8.648258 |
| 267 | 9.10 | 10.904849 | 8.8981617 | 12.911536 |
| 269 | 6.53 | 8.033955 | 6.0402738 | 10.027635 |
| 276 | 6.67 | 5.288536 | 3.2743117 | 7.302760 |
| 285 | 6.97 | 4.330774 | 2.3183658 | 6.343183 |
| 287 | 7.53 | 7.421147 | 5.4194976 | 9.422796 |
| 293 | 11.82 | 13.097780 | 11.0760862 | 15.119474 |
| 295 | 12.66 | 13.470748 | 11.4520393 | 15.489457 |
| 296 | 4.21 | 3.796421 | 1.7869752 | 5.805867 |
| 300 | 9.40 | 11.033609 | 9.0251680 | 13.042050 |
| 305 | 11.93 | 10.451071 | 8.4392111 | 12.462930 |
| 318 | 6.41 | 7.356169 | 5.3370397 | 9.375298 |
| 330 | 11.27 | 10.909265 | 8.9002311 | 12.918299 |
| 331 | 4.99 | 5.498732 | 3.4956498 | 7.501813 |
| 341 | 7.50 | 8.340062 | 6.3197782 | 10.360346 |
| 348 | 6.88 | 7.919901 | 5.8904723 | 9.949329 |
| 350 | 9.32 | 11.109457 | 9.0955681 | 13.123346 |
| 367 | 5.98 | 5.861652 | 3.8696936 | 7.853610 |
| 384 | 9.35 | 8.912473 | 6.8991056 | 10.925840 |
| 387 | 5.32 | 6.283298 | 4.2635511 | 8.303046 |
| 390 | 8.44 | 9.207389 | 7.2120575 | 11.202721 |
| 396 | 12.57 | 12.992552 | 10.9766500 | 15.008454 |
| 397 | 6.14 | 7.217852 | 5.2200291 | 9.215675 |
| 398 | 7.41 | 7.407104 | 5.3933592 | 9.420849 |
par(mfrow=c(2,2))
plot(lm_e$fit)
First we look at Normal Q-Q, we can see that all observations approximately follow the straight line and no observation are too much far from straight line. It means that there are no outliers. However, in terms of high leverage observations,we can see some are far from other observations for example the points 298, 358.
Looking at three other plots, we also can see the similar results that the points 298 and 358 are detected as high leverage observations.