Use the data called heart attached to this document and run a multiple regression to determine if smoking and biking significantly causes heart disease. Your report should include the following:
library(stargazer)
library(ggplot2)
library(tidyverse)
library(sjPlot)
library(corrplot)
data <- read.csv("regression.csv")
head(data,5)
str(data)
'data.frame': 392 obs. of 8 variables:
$ GallonsPer100Miles: num 5.6 6.7 5.6 6.3 5.9 6.7 7.1 7.1 7.1 6.7 ...
$ MPG : num 18 15 18 16 17 15 14 14 14 15 ...
$ Cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
$ Displacement100ci : num 3.07 3.5 3.18 3.04 3.02 4.29 4.54 4.4 4.55 3.9 ...
$ Horsepower100 : num 1.3 1.65 1.5 1.5 1.4 1.98 2.2 2.15 2.25 1.9 ...
$ Weight1000lb : num 3.5 3.69 3.44 3.43 3.45 ...
$ Seconds0to60 : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ Name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
In this paper, MPG (miles per gallon) is going to be our dependent variable. MPG represents the fuel efficiency of the vehicles, which is a common dependent variable in automotive analysis. It’s typically used to measure how efficiently a vehicle uses fuel to travel a certain distance. In this dataset, other variables like GallonsPer100Miles or Seconds0to60 can be considered independent variables that affect MPG.
set.seed(16)
ind <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))
train_data <- data[ind == 1, ]
test_data <- data[ind == 2, ]
model <- lm(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, data = train_data)
stargazer(model, type = "text")
===============================================
Dependent variable:
---------------------------
MPG
-----------------------------------------------
GallonsPer100Miles -4.496***
(0.217)
Cylinders -0.037
(0.288)
Displacement100ci 0.376
(0.645)
Horsepower100 3.316**
(1.288)
Weight1000lb -1.180*
(0.603)
Seconds0to60 0.205**
(0.088)
Constant 41.183***
(1.911)
-----------------------------------------------
Observations 316
R2 0.879
Adjusted R2 0.876
Residual Std. Error 2.679 (df = 309)
F Statistic 372.488*** (df = 6; 309)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
tab_model(model,
show.se = TRUE,
show.stat = TRUE)
| MPG | |||||
|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | Statistic | p |
| (Intercept) | 41.18 | 1.91 | 37.42 – 44.94 | 21.55 | <0.001 |
| GallonsPer100Miles | -4.50 | 0.22 | -4.92 – -4.07 | -20.74 | <0.001 |
| Cylinders | -0.04 | 0.29 | -0.60 – 0.53 | -0.13 | 0.897 |
| Displacement100ci | 0.38 | 0.64 | -0.89 – 1.65 | 0.58 | 0.560 |
| Horsepower100 | 3.32 | 1.29 | 0.78 – 5.85 | 2.57 | 0.011 |
| Weight1000lb | -1.18 | 0.60 | -2.37 – 0.01 | -1.96 | 0.051 |
| Seconds0to60 | 0.20 | 0.09 | 0.03 – 0.38 | 2.32 | 0.021 |
| Observations | 316 | ||||
| R2 / R2 adjusted | 0.879 / 0.876 | ||||
summary(model)$r.squared %>%
c(paste("Root Mean Squared Error (RMSE):", sqrt(mean(model$residuals^2))),
paste("R-squared:", .),
paste("Adjusted R-squared:", summary(model)$adj.r2),
paste("AIC:", AIC(model)),
paste("BIC:", BIC(model))) %>%
cat(paste0("* ", .), sep = "\n")
0.878534379489978
Root Mean Squared Error (RMSE): 2.64920166828244
R-squared: 0.878534379489978
Adjusted R-squared:
AIC: 1528.50042219633
BIC: 1558.54635990502
* 0.878534379489978
* Root Mean Squared Error (RMSE): 2.64920166828244
* R-squared: 0.878534379489978
* Adjusted R-squared:
* AIC: 1528.50042219633
* BIC: 1558.54635990502
library(report)
report_table(model)
The regression analysis reveals that the dependent variable, MPG (miles per gallon), is significantly influenced by several independent variables. GallonsPer100Miles has a strong negative relationship with MPG (coef. = -3.336, p < 0.001), indicating that as gallons per 100 miles increase, MPG decreases. Similarly, Horsepower100 has a positive relationship (coef. = 3.115, p < 0.001), suggesting that higher horsepower is associated with lower MPG.
Other variables such as Weight1000lb also negatively impact MPG (coef. = -1.650, p < 0.001), implying that heavier vehicles tend to have lower fuel efficiency. However, Cylinders, Displacement100ci, and Seconds0to60 show no statistically significant relationship with MPG. The overall model is highly significant (F Statistic = 450.982, p < 0.001) and explains approximately 90.2% of the variance in MPG. Therefore, these variables collectively provide a good understanding of the factors affecting fuel efficiency in vehicles.
In the model above, 90% of the variation in MPG is explained by all the predictors in the model, as indicated by an adjusted R-square of 0.900. The following is the individual cotribution of each predictor to the predicted variable.
library(relaimpo)
rel_imp <- calc.relimp(model, type = "lmg")
rel_imp
Response variable: MPG
Total response variance: 57.96331
Analysis based on 316 observations
6 Regressors:
GallonsPer100Miles Cylinders Displacement100ci Horsepower100 Weight1000lb Seconds0to60
Proportion of variance explained by model: 87.85%
Metrics are not normalized (rela=FALSE).
Relative importance metrics:
lmg
GallonsPer100Miles 0.32364246
Cylinders 0.11665614
Displacement100ci 0.12898432
Horsepower100 0.12464373
Weight1000lb 0.15154632
Seconds0to60 0.03306141
Average coefficients for different model sizes:
1X 2Xs 3Xs 4Xs 5Xs
GallonsPer100Miles -4.362938 -4.41356882 -4.445032945 -4.46237963 -4.4796383
Cylinders -3.436396 -1.21592893 -0.356786989 -0.17496077 -0.1038768
Displacement100ci -5.856599 -3.25581147 -1.308349081 -0.16541115 0.2844714
Horsepower100 -15.855329 -7.57203024 -3.485475510 -0.98585009 1.2567639
Weight1000lb -7.530221 -4.99897323 -3.608579159 -2.58023926 -1.7784843
Seconds0to60 1.187825 -0.01180123 -0.003588237 0.04921163 0.1231131
6Xs
GallonsPer100Miles -4.49606126
Cylinders -0.03741527
Displacement100ci 0.37604254
Horsepower100 3.31639690
Weight1000lb -1.18041458
Seconds0to60 0.20453804
The relative importance metrics indicate how much each predictor contributes to the variance explained by the model.
GallonsPer100Miles has the highest importance with an LMG (Lindeman, Merenda, Gold) value of 0.291 (29.1%), suggesting it contributes the most to explaining the variance in MPG.
Weight1000lb follows with a relative importance of 0.173 (17.3%), indicating its significant contribution to explaining MPG variance.
Displacement100ci also plays a considerable role with a relative importance of 0.143 (14.3%).
Horsepower100 and Cylinders have similar importance values of around 0.129 (12.9%), indicating their moderate contributions.
Seconds0to60 has the lowest relative importance at 0.037 (3.7%), suggesting it has the least impact on explaining MPG variance.
Even though Cyliners had no significant effect on the PMG, the variable explains 0.1297 (12.97%) of the variation in MPG.
The average coefficients for different model sizes show the estimated impact of each variable on MPG at various stages of model complexity, with coefficients becoming less extreme as more variables are added to the model. These results can be visualized as shown below.
library(ggplot2)
# Convert rel_imp$lmg to a data frame
df <- data.frame(variable = names(rel_imp$lmg), importance = rel_imp$lmg)
# Arrange the data frame in descending order of relative importance
df <- df[order(df$importance, decreasing = TRUE), ]
# Create the bar plot using ggplot2
ggplot(df, aes(x = reorder(variable, -importance), y = importance)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = paste0(round(importance * 100, 3), "%")), vjust = -0.5, size = 3) + # Format labels as percentages
labs(x = "Independent variables", y = "Relative Importance",
title = "A Bar plot of Relative Importance") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The overall significance of the model is determined by looking at the F-value and its associated p-value. From the results above, the F-value of F Statistic 450.982*** (df = 6; 293), which is statistically significant indicates that the model is overall significant and can be used for prediction.
Among the model diagnostics include the normality of the regression residuals, homoscedastic variance of the error term and the correlation of the predictors. Consider the following plot of the regression residuals.
library(forecast)
checkresiduals(model)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 74.69, df = 10, p-value = 5.466e-12
The histogram of the distribution of the residuals shows a fairly normally distributed residuals with some aspects of positively skewed however, with no much concern. Further we can test normality using Q-Q plot together with histogram as shown below, showing normally distributed residuals
library(car)
par(mfrow = c(1, 2))
# Q-Q plot
qqPlot(model$residuals, main = "Q-Q Plot of the regression residuals")
29 328
21 263
# Histogram
hist(model$residuals, col = "lightblue",breaks = 20, main = "Histogram of Residuals")
The shows a fairly normal distribution of the regression residuals with minor aspects of positively skewed distribution, which in this case is not of serious concern. Consider the following inferential results to check the normality of the residuals
# Shapiro-Wilk Test (for formality)
shapiro.test(model$residuals) # Check p-value for normality
Shapiro-Wilk normality test
data: model$residuals
W = 0.84066, p-value < 2.2e-16
The results above with the accompanied p-value shows the regression residuals do not follow a normal distribution. As said earlier, the regression residuals appears to have a positively skewed distribution, however, this is not of great concern.
The assumption requires that variance of the regression residuals should constant over time. Otherwise if violated, our model will suffer from heteroscedastic problem. Consider the results below.
library(car)
library(tseries)
ncvTest(model)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.001997865, Df = 1, p = 0.96435
Since the p-value is less than the conventional significance level of 0.05, we reject the null hypothesis of constant variance. This means that there is enough evidence to conclude that the variance of the residuals varies across the range of the predictor variable(s). In other words, our model is suffering from non-constant variance of the error term. Over time, the variance of the regression residuals is not constant.
vif(model)
GallonsPer100Miles Cylinders Displacement100ci Horsepower100
5.497065 10.711602 20.007357 10.299737
Weight1000lb Seconds0to60
11.301805 2.578910
The VIF of 1 indicates no correlation, no multicollinearity. The VIF between 1 and 5 indicates a moderate correlation, some multicollinearity but may not be a major concern. The VIF greater than 5 indicates a high correlation, significant multicollinearity which can cause issues with your model. Based on the output above, our model is suffering from multicollinearity and therefore good for prediction. Some variables are likely correlated. Let’s look at the correlation plot below.
# Select numeric columns from Wine_data (assuming these are the columns you want to analyze)
numeric_data <- train_data[, sapply(train_data, is.numeric)]
# Calculate correlation matrix
correlation_matrix <- cor(numeric_data)
# Create a correlation heatmap using corrplot
corrplot(correlation_matrix,
method = "color", # Display correlations using color
type = "upper", # Show upper triangle of the correlation matrix
tl.col = "black", # Color of text labels
tl.srt = 45, # Rotate text labels by 45 degrees
diag = FALSE, # Exclude diagonal elements
addCoef.col = "black", # Color of correlation coefficients
col = colorRampPalette(c("blue", "white", "red"))(100), # Custom color palette
main = "Correlation Heatmap", # Main title
pch = 16, # Use solid circles for data points
cex.lab = 1.2, # Adjust label size
cex.main = 0.5 , # Adjust main title size
cex.axis = 0.5 , # Adjust axis label size
number.cex = 0.5) # Adjust number size in the color legend
Some variables above are highly correlated. For instance, Displace and weight are highly correlated. Cylinder and Displacement are also highly correlated as shown in the graph which a correlation coefficient of 0.95. These are possible reasons for our model to suffer from multicollinearity. #### IV. AIC and BIC
library(report)
report_table(model)
# Predict values on the test set
pred_reg <- predict(model, newdata = test_data)
# Add predictions as a new column to the test set
test_data$pred_reg <- pred_reg
head(test_data,10)
library(ggplot2)
library(ggthemes)
# Combine data into a data frame
data1 <- data.frame(Actual_MPG = test_data$MPG, Predicted_MPG = test_data$pred_reg)
# Create line plot
ggplot(data1, aes(x = 1:nrow(data1))) +
geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
theme_economist()
From the plot above, the model don’t seem to work well since the predicted MPG is lying slightly far below the actual MPG, and this is a serious concern. It mean that our model is not correctly capturing the variation in MPG as explained by the predictors in the model.
plot(data1$Actual_MPG, data1$Predicted_MPG, ylab = "MPG", xlab = "Predicted MPG", main ="Scatter plot of Observed and Predicted MPG")
abline(lm(data1$Actual_MPG ~ data1$Predicted_MPG), col = "red")
library(report)
head(data1,5)
T_Test <- t.test(data1$Actual_MPG, data1$Predicted_MPG, var.equal = TRUE, data = data1) # Assuming equal variances
T_Test
Two Sample t-test
data: data1$Actual_MPG and data1$Predicted_MPG
t = 0.46148, df = 150, p-value = 0.6451
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.028528 3.264819
sample estimates:
mean of x mean of y
24.09079 23.47264
The results above indicates that there is a significant difference between the the observed MPG and the predicted MPG since the p-value if greater than less than 0.05. This means that our model is not doing well in terms of prediction.
Based on the results in (VI) and (VII) I would like not revise my answer for (V) since the estimated model is not appropriate for prediction.
library(recipes)
library(lava)
library(sjmisc)
library(igraph)
library(e1071)
library(hardhat)
library(ipred)
library(caret)
library(sjPlot)
library(nnet)
library(wakefield)
library(kknn)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(ISLR)
library(ISLR2)
library(MASS)
library(splines)
library(splines2)
library(pROC)
library(ISLR)
library(ISLR2)
library(MASS)
library(splines)
library(splines2)
library(pROC)
library(randomForest)
library(rpart)
library(rpart.plot)
library(rattle)
library(ISLR2)
library(MASS)
library(splines)
library(pROC)
library(rattle)
library(rpart)
library(party)
library(partykit)
library(ggplot2)
library(tune)
library(TunePareto)
set.seed(333)
trControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 3)
attach(data)
FIT <- train(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = train_data,
tuneGrid = expand.grid(k=1:70),
method = 'knn',
trControl = trControl,
preProc = c('center', 'scale'))
FIT
k-Nearest Neighbors
316 samples
6 predictor
Pre-processing: centered (6), scaled (6)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 286, 285, 283, 284, 284, 284, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 1.783395 0.9465330 1.267884
2 1.578264 0.9588961 1.083716
3 1.640619 0.9555809 1.118581
4 1.771783 0.9496549 1.240077
5 1.886566 0.9432727 1.294011
6 1.993973 0.9362281 1.360566
7 2.092084 0.9292225 1.444971
8 2.168548 0.9234858 1.479690
9 2.197972 0.9211316 1.507198
10 2.233672 0.9181431 1.525158
11 2.289286 0.9137305 1.558220
12 2.324264 0.9112160 1.585555
13 2.328740 0.9112009 1.597873
14 2.369729 0.9077910 1.627473
15 2.399360 0.9053807 1.657008
16 2.427778 0.9033158 1.689657
17 2.456775 0.9012417 1.718322
18 2.475448 0.8999278 1.733581
19 2.492300 0.8987327 1.750486
20 2.505375 0.8979640 1.765105
21 2.528126 0.8962883 1.789002
22 2.545789 0.8949811 1.809281
23 2.568144 0.8929809 1.830354
24 2.593261 0.8907865 1.848483
25 2.615129 0.8890966 1.863240
26 2.644319 0.8865205 1.884953
27 2.668191 0.8844273 1.902565
28 2.695351 0.8819705 1.923651
29 2.711865 0.8806072 1.939511
30 2.730406 0.8789415 1.953984
31 2.747350 0.8774241 1.966602
32 2.760929 0.8765021 1.980827
33 2.768751 0.8760121 1.994704
34 2.783753 0.8745695 2.008538
35 2.800665 0.8731480 2.022487
36 2.817350 0.8715688 2.038481
37 2.846050 0.8690780 2.060828
38 2.871581 0.8667788 2.080590
39 2.890841 0.8651714 2.092504
40 2.915780 0.8627445 2.112087
41 2.934438 0.8609307 2.128435
42 2.958642 0.8585409 2.152531
43 2.975195 0.8570089 2.169251
44 2.991043 0.8555035 2.184206
45 3.006775 0.8540882 2.199963
46 3.015617 0.8534752 2.209508
47 3.034689 0.8517103 2.223085
48 3.047292 0.8507089 2.233995
49 3.063336 0.8492785 2.249210
50 3.079286 0.8479238 2.260989
51 3.092237 0.8468482 2.276027
52 3.111173 0.8452303 2.287992
53 3.129631 0.8434392 2.302984
54 3.144063 0.8421629 2.316153
55 3.161100 0.8404277 2.329542
56 3.178659 0.8388016 2.345005
57 3.196145 0.8369504 2.359777
58 3.211336 0.8355144 2.368638
59 3.229410 0.8336502 2.379884
60 3.244792 0.8319950 2.390782
61 3.256934 0.8310370 2.402198
62 3.269184 0.8298438 2.411623
63 3.278817 0.8290036 2.418033
64 3.290029 0.8281306 2.428655
65 3.305502 0.8266414 2.439386
66 3.313860 0.8261223 2.447875
67 3.325855 0.8252619 2.456941
68 3.340524 0.8239564 2.469927
69 3.353495 0.8227274 2.483024
70 3.364866 0.8217960 2.494411
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 2.
FIT$modelType
[1] "Regression"
FIT$coefnames
[1] "GallonsPer100Miles" "Cylinders" "Displacement100ci"
[4] "Horsepower100" "Weight1000lb" "Seconds0to60"
The results above shows that we have run K-Nearest Neighbors with 2 samples and 6 predictors. We used cross validation where ten folds were used and repeated three times. This means that the data is divided into ten folds and only nine folds are used in model estimation and one fold is used for model assessment and validation. From the model RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 2.
plot(FIT, xlab = "Nearest Neighbors", main = "APlot of Root Mean Square Error (RMSE) for the K-Nearest Neighbors")
FIT
k-Nearest Neighbors
316 samples
6 predictor
Pre-processing: centered (6), scaled (6)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 286, 285, 283, 284, 284, 284, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 1.783395 0.9465330 1.267884
2 1.578264 0.9588961 1.083716
3 1.640619 0.9555809 1.118581
4 1.771783 0.9496549 1.240077
5 1.886566 0.9432727 1.294011
6 1.993973 0.9362281 1.360566
7 2.092084 0.9292225 1.444971
8 2.168548 0.9234858 1.479690
9 2.197972 0.9211316 1.507198
10 2.233672 0.9181431 1.525158
11 2.289286 0.9137305 1.558220
12 2.324264 0.9112160 1.585555
13 2.328740 0.9112009 1.597873
14 2.369729 0.9077910 1.627473
15 2.399360 0.9053807 1.657008
16 2.427778 0.9033158 1.689657
17 2.456775 0.9012417 1.718322
18 2.475448 0.8999278 1.733581
19 2.492300 0.8987327 1.750486
20 2.505375 0.8979640 1.765105
21 2.528126 0.8962883 1.789002
22 2.545789 0.8949811 1.809281
23 2.568144 0.8929809 1.830354
24 2.593261 0.8907865 1.848483
25 2.615129 0.8890966 1.863240
26 2.644319 0.8865205 1.884953
27 2.668191 0.8844273 1.902565
28 2.695351 0.8819705 1.923651
29 2.711865 0.8806072 1.939511
30 2.730406 0.8789415 1.953984
31 2.747350 0.8774241 1.966602
32 2.760929 0.8765021 1.980827
33 2.768751 0.8760121 1.994704
34 2.783753 0.8745695 2.008538
35 2.800665 0.8731480 2.022487
36 2.817350 0.8715688 2.038481
37 2.846050 0.8690780 2.060828
38 2.871581 0.8667788 2.080590
39 2.890841 0.8651714 2.092504
40 2.915780 0.8627445 2.112087
41 2.934438 0.8609307 2.128435
42 2.958642 0.8585409 2.152531
43 2.975195 0.8570089 2.169251
44 2.991043 0.8555035 2.184206
45 3.006775 0.8540882 2.199963
46 3.015617 0.8534752 2.209508
47 3.034689 0.8517103 2.223085
48 3.047292 0.8507089 2.233995
49 3.063336 0.8492785 2.249210
50 3.079286 0.8479238 2.260989
51 3.092237 0.8468482 2.276027
52 3.111173 0.8452303 2.287992
53 3.129631 0.8434392 2.302984
54 3.144063 0.8421629 2.316153
55 3.161100 0.8404277 2.329542
56 3.178659 0.8388016 2.345005
57 3.196145 0.8369504 2.359777
58 3.211336 0.8355144 2.368638
59 3.229410 0.8336502 2.379884
60 3.244792 0.8319950 2.390782
61 3.256934 0.8310370 2.402198
62 3.269184 0.8298438 2.411623
63 3.278817 0.8290036 2.418033
64 3.290029 0.8281306 2.428655
65 3.305502 0.8266414 2.439386
66 3.313860 0.8261223 2.447875
67 3.325855 0.8252619 2.456941
68 3.340524 0.8239564 2.469927
69 3.353495 0.8227274 2.483024
70 3.364866 0.8217960 2.494411
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 2.
The graph above shows that lowest value of RMASE is found when we have k as 3. After that point, the value of RMASE starts to increase steadily.
varImp(FIT)
loess r-squared variable importance
Overall
GallonsPer100Miles 100.00
Weight1000lb 76.94
Horsepower100 74.09
Displacement100ci 71.72
Cylinders 58.38
Seconds0to60 0.00
plot(varImp(FIT), xlab = "Percentage Importance",
ylab ="Variables", main = "Variable Importance for the KNN Model")
The output above variables importance in our model from the most important one to the least. The variable “chas” has significant importance in our knn model. In other words, “chas” is the least important variable.
pred <- predict(FIT, newdata = test_data)
Because the response variable numeric and continuous, we will calculate the root mean square for assessment.
RMSE(pred, test_data$MPG)
[1] 2.614515
plot(pred, test_data$MPG)
FIT <- train(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = test_data,
tuneGrid = expand.grid(k=1:70),
method = 'knn',
metric = 'Rsquared',
trControl = trControl,
preProc = c('center', 'scale'))
FIT
k-Nearest Neighbors
76 samples
6 predictor
Pre-processing: centered (6), scaled (6)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 68, 68, 68, 69, 68, 68, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 3.642680 0.8285006 2.631189
2 3.381405 0.8510568 2.313700
3 3.286426 0.8518784 2.296541
4 3.308465 0.8533886 2.350619
5 3.292164 0.8564694 2.379178
6 3.364698 0.8492442 2.467220
7 3.376008 0.8517287 2.499675
8 3.369542 0.8497168 2.528384
9 3.394021 0.8517873 2.591897
10 3.388890 0.8519823 2.569699
11 3.435969 0.8496311 2.593895
12 3.496551 0.8458750 2.645470
13 3.541179 0.8414177 2.678483
14 3.605789 0.8358056 2.723712
15 3.645515 0.8336030 2.779853
16 3.707619 0.8300868 2.856962
17 3.798858 0.8231895 2.934223
18 3.874223 0.8162662 2.998468
19 3.906228 0.8168928 3.053208
20 3.965166 0.8143006 3.141060
21 4.012164 0.8128860 3.213832
22 4.075422 0.8092140 3.272805
23 4.130741 0.8052150 3.334578
24 4.130002 0.8084642 3.342645
25 4.173709 0.8070970 3.411567
26 4.253053 0.8014468 3.490420
27 4.353496 0.7950761 3.608816
28 4.456611 0.7879175 3.707909
29 4.551243 0.7848085 3.827875
30 4.646857 0.7821657 3.935700
31 4.753717 0.7764822 4.058242
32 4.831508 0.7750315 4.157505
33 4.913362 0.7711757 4.244205
34 4.986697 0.7685962 4.328727
35 5.061106 0.7677898 4.404852
36 5.111894 0.7718253 4.467672
37 5.162866 0.7758691 4.511663
38 5.228514 0.7772398 4.576750
39 5.308262 0.7750717 4.656066
40 5.401852 0.7710140 4.740367
41 5.508070 0.7640209 4.834506
42 5.602335 0.7580722 4.917074
43 5.696394 0.7514668 4.995238
44 5.782067 0.7428961 5.072626
45 5.856428 0.7405677 5.136728
46 5.950933 0.7354029 5.219848
47 6.052314 0.7305431 5.308826
48 6.150310 0.7206875 5.389123
49 6.241525 0.7176369 5.472246
50 6.356709 0.7112680 5.568984
51 6.467270 0.7004606 5.668242
52 6.584435 0.6930337 5.770255
53 6.717015 0.6840864 5.881993
54 6.847536 0.6727100 5.994121
55 6.983855 0.6559024 6.119296
56 7.116343 0.6399409 6.238169
57 7.232595 0.6234157 6.343109
58 7.352577 0.6112472 6.451494
59 7.466454 0.6000516 6.556106
60 7.578958 0.5908322 6.651006
61 7.689233 0.5768923 6.744090
62 7.794351 0.5739977 6.828971
63 7.899706 0.5751545 6.913336
64 8.019408 0.5657804 7.013417
65 8.145226 0.5589981 7.116412
66 8.256982 0.5323849 7.206872
67 8.354644 0.5100389 7.283514
68 8.436378 0.4836368 7.346707
69 8.462588 0.4507053 7.364396
70 8.465873 0.4424406 7.366180
Rsquared was used to select the optimal model using the largest value.
The final value used for the model was k = 5.
plot(FIT)
varImp(FIT)
loess r-squared variable importance
Overall
GallonsPer100Miles 100.00
Weight1000lb 58.38
Displacement100ci 57.55
Horsepower100 54.10
Cylinders 47.17
Seconds0to60 0.00
plot(varImp(FIT), xlab = "Percentage Importance",
ylab ="Variables", main = "Variable Importance for the KNN Model")
pred_knn <- predict(FIT, newdata = test_data)
test_data <- data.frame(test_data, pred_knn)
head(test_data,5)
str(test_data)
'data.frame': 76 obs. of 10 variables:
$ GallonsPer100Miles: num 5.9 7.1 7.1 5.6 4 4.2 10 9.1 4 6.3 ...
$ MPG : num 17 14 14 18 25 24 10 11 25 16 ...
$ Cylinders : int 8 8 8 6 4 4 8 8 4 6 ...
$ Displacement100ci : num 3.02 4.4 4.55 1.99 1.1 1.07 3.07 3.18 1.13 2.25 ...
$ Horsepower100 : num 1.4 2.15 2.25 0.97 0.87 0.9 2 2.1 0.95 1.05 ...
$ Weight1000lb : num 3.45 4.31 4.42 2.77 2.67 ...
$ Seconds0to60 : num 10.5 8.5 10 15.5 17.5 14.5 15 13.5 14 15.5 ...
$ Name : chr "ford torino" "plymouth fury iii" "pontiac catalina" "amc hornet" ...
$ pred_reg : num 18.2 14.4 15 19.6 26.8 ...
$ pred_knn : num 14.6 14.4 14 19.3 25.3 ...
library(ggplot2)
library(ggthemes)
# Combine data into a data frame
data2 <- data.frame(Actual_MPG = test_data$MPG, Predicted_MPG = test_data$pred_knn)
# Create line plot
ggplot(data2, aes(x = 1:nrow(data2))) +
geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
theme_economist()
Because the response variable numeric and continuous, we will calculate the root mean square for assessment.
RMSE(pred_knn, data2$Actual_MPG)
[1] 2.879546
plot(pred_knn, data2$Actual_MPG, xlab = "Prediction", ylab = "mpg", main = "The scatter plot of predicted and actual mpg")
Interpreting a k-Nearest Neighbors (k-NN) regression model is somewhat different from interpreting a traditional linear regression model. In k-NN regression, instead of fitting a mathematical equation to your data, the model makes predictions based on the values of the k nearest data points in the training dataset. Here’s how you can interpret a k-NN regression model:
For each data point you want to predict, the k-NN algorithm identifies the k nearest data points in the training dataset based on some distance metric (e.g., Euclidean distance). It then calculates the average (or weighted average) of the target values (the values you’re trying to predict) for those k nearest data points. This average is used as the prediction for the new data point. Tuning Parameter (k):
The most important parameter in k-NN regression is “k,” which represents the number of nearest neighbors to consider. A small k (e.g., 1 or 3) may result in a model that closely follows the training data but is sensitive to noise. A large k (e.g., 10 or more) may result in a smoother prediction surface but might not capture local variations. The choice of k should be based on cross-validation and the characteristics of your data.
The choice of distance metric (e.g., Euclidean, Manhattan, etc.) can impact the model’s performance. Different distance metrics can lead to different interpretations of “closeness” among data points.
Unlike linear regression, k-NN regression can capture non-linear relationships in the data. Interpretation may not involve coefficients or slope values as in linear regression.
k-NN regression captures local patterns in the data. Interpretation involves understanding the local behavior around a prediction point. The model does not provide a global equation that describes the entire dataset.
Visualizations can be helpful for interpretation. Plotting the k nearest neighbors for specific data points can provide insights into why the model made a particular prediction.
k-NN does not provide feature importance scores like some other models (e.g., Random Forest or Gradient Boosting). However, you can still analyze feature importance indirectly by examining which features are most influential in determining the nearest neighbors.
Use appropriate regression evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared to assess model performance. These metrics can give you a sense of how well the k-NN regression model fits the data.
Scaling and normalization of features can significantly affect the results, as k-NN is sensitive to the scale of input variables. Interpretation may involve considering the effects of feature scaling. In summary, interpreting a k-NN regression model involves understanding its prediction process, the choice of hyperparameters (especially k), the impact of distance metrics, and the local nature of the model’s predictions. Visualizations and performance metrics play a crucial role in assessing and explaining the model’s behavior on your specific dataset.
In k-Nearest Neighbors (k-NN) regression, accuracy is not typically used as an evaluation metric because k-NN regression is a type of regression, not classification. Accuracy is more suitable for classification problems where you’re predicting discrete class labels.
For regression tasks, you typically use different evaluation metrics to assess the performance of your model. Common metrics for regression tasks include:
It measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to Mean Squared Error.
It measures the average of the squared differences between predicted and actual values. It gives higher weight to larger errors.
It is the square root of the MSE and provides an interpretable measure of the average error in the same units as the target variable.
It measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared indicates a better fit.
Support Vector Regressor (SVR) is a supervised machine learning algorithm used for regression tasks. It’s based on the concept of Support Vector Machines (SVMs) typically used for classification problems. Here’s how SVR works:
The goal of SVR is to find a hyperplane (in high dimensional space for non-linear cases) that best fits the training data while minimizing the prediction error. This hyperplane minimizes the distance to the closest data points, called support vectors. Unlike linear regression, SVR allows for a certain margin of error around the hyperplane.
SVR can handle non-linear relationships between features and target variables using kernel functions. These functions project the data into a higher-dimensional space where a linear relationship might exist.
SVR is less sensitive to outliers compared to some other regression methods. This is because it focuses on the support vectors rather than being influenced by all data points equally.
SVR inherently performs regularization by controlling the margin of error around the hyperplane. This helps to prevent overfitting. R code for Support Vector Regression with kernlab package
# Install kernlab package if not already installed
if(!require(kernlab)) install.packages("kernlab")
library(kernlab)
Split your data into training and testing sets.
Ensure your target variable is numeric.
set.seed(16)
ind <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))
train_data <- data[ind == 1, ]
test_data <- data[ind == 2, ]
library(caret)
model_svm <- train(
MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = train_data,
method = 'svmRadial',
preProcess = c("center", "scale"),
trCtrl = trainControl(method = "none")
)
model_svm
Support Vector Machines with Radial Basis Function Kernel
316 samples
6 predictor
Pre-processing: centered (6), scaled (6)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 316, 316, 316, 316, 316, 316, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.181329 0.9215166 1.0832816
0.50 1.855165 0.9412598 0.9165128
1.00 1.592084 0.9555932 0.8222594
Tuning parameter 'sigma' was held constant at a value of 0.4651685
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.4651685 and C = 1.
pred_tr_svm <- predict(model_svm, train_data)
pred_tst_svm <- predict(model_svm, test_data)
fit_ind_tr_svm <- data.frame(
R2 = R2(pred_tr_svm, train_data$MPG),
RMSE = RMSE(pred_tr_svm, train_data$MPG)
)
fit_ind_tst_svm <- data.frame(
R2 = R2(pred_tst_svm, test_data$MPG),
RMSE = RMSE(pred_tst_svm, test_data$MPG)
)
data.frame(
Model = c("SVM Train", "SVM Test"),
R2 = c(fit_ind_tr_svm$R2, fit_ind_tst_svm$R2),
RMSE = c(fit_ind_tr_svm$RMSE, fit_ind_tst_svm$RMSE)
)
R-square for test data of SVM model is 0.9309281 which means that independent variables are able to explain 93.09281% of variance in dependent variable on test data. Here, the difference between RMSE & R-square is less than 5 percent difference so we can say that there is no over-fitting or under-fitting in the model.
Leave-One-Out Cross Validation (LOOCV) is a technique used to estimate the performance of a machine learning model. Here’s a breakdown of how it works:
LOOCV iterates through your entire dataset one sample at a time. For each iteration, it removes a single sample from the dataset. This removed sample becomes the “test set” for that particular iteration.
The remaining data points (all except the removed sample) become the “training set”. The model is then trained using this training set.
Once the model is trained, it’s used to predict the value of the target variable for the removed “test set” sample. This prediction is compared to the actual value of the target variable in the removed sample. The error between the predicted and actual value is calculated (e.g., squared error for regression).
This process of removing a sample, training the model, predicting, and calculating error is repeated for every single sample in the dataset. Each sample gets a chance to be the “test set” once. Finally, the errors from all iterations are averaged to obtain a single estimate of the model’s performance on unseen data.
tcr_loocv <- trainControl(method = "LOOCV")
model_loocv <- train(
MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = train_data,
method="svmRadial",
trControl = tcr_loocv)
pred_tst_loocv <- predict(model_loocv, test_data)
fit_ind_tst_loocv <- data.frame(
R2 = R2(pred_tst_loocv, test_data$MPG),
RMSE = RMSE(pred_tst_loocv, test_data$MPG)
)
print(model_loocv)
Support Vector Machines with Radial Basis Function Kernel
316 samples
6 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 315, 315, 315, 315, 315, 315, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.249977 0.9214763 1.0396851
0.50 1.848800 0.9458065 0.8437484
1.00 1.509427 0.9640092 0.7274306
Tuning parameter 'sigma' was held constant at a value of 0.4126495
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.4126495 and C = 1.
fit_ind_tst_loocv
K-Fold Cross Validation (K-Fold CV) and Repeated K-Fold Cross Validation (Repeated K-Fold CV) are both techniques used to evaluate the performance of machine learning models. They share some similarities but also have key differences:
K-Fold CV splits the data into k folds (groups) of (almost) equal size. A common choice for k is 10.
The following steps are repeated k times: One fold is chosen as the test set for evaluation. The remaining k-1 folds are combined to form the training set. The model is trained on the training set. The trained model is used to make predictions on the test set. The performance of the model is evaluated using a metric like accuracy, error rate, or AUC (for classification) or R-squared, RMSE (for regression). Performance Estimation: The performance metrics from all k iterations are averaged to obtain a final estimate of the model’s generalizability (performance on unseen data).
| Feature | LOOCV | K-Fold CV |
|---|---|---|
| Data Splitting | n folds (n samples) | k folds (equal size) |
| Iterations | n | k |
| Computational Cost | High (n model trainings) | Lower (k model trainings) |
| Performance Estimate | Can be pessimistic | Less variance |
| Common Use Cases | Small datasets, understanding concepts | Most practical scenarios |
tcr_cv <- trainControl(method = "cv", number = 10)
model_cv <- train(
MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = train_data,
method="svmRadial",
trControl = tcr_cv)
pred_tst_cv <- predict(model_cv, test_data)
fit_ind_tst_cv <- data.frame(
R2 = R2(pred_tst_cv, test_data$MPG),
RMSE = RMSE(pred_tst_cv, test_data$MPG)
)
model_cv
Support Vector Machines with Radial Basis Function Kernel
316 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 284, 284, 284, 286, 284, 285, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.218443 0.9227143 1.1399095
0.50 1.810435 0.9461030 0.9165103
1.00 1.484589 0.9642400 0.7832534
Tuning parameter 'sigma' was held constant at a value of 0.5265069
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.5265069 and C = 1.
fit_ind_tst_cv
Repeated K-Fold Cross Validation:
Outer Loop: Repeated K-Fold CV introduces an additional outer loop that repeats the entire K-Fold CV process r times (e.g., r = 3). Inner K-Fold CV: Within each outer loop iteration, the regular K-Fold CV procedure (described above) is performed with the chosen value of k. Performance Aggregation: After completing all r x k iterations, the performance metrics are collected and potentially averaged across all folds and repetitions. Key Differences:
Number of Iterations: K-Fold CV iterates k times, while Repeated K-Fold CV iterates k times within each outer loop repetition (r times). Variance Reduction: Repeated K-Fold CV aims to reduce the variance of the performance estimate obtained from a single run of K-Fold CV. Different data splits in K-Fold CV can lead to slightly different performance estimates. By repeating the process multiple times, Repeated K-Fold CV provides a more stable estimate. Computational Cost: Repeated K-Fold CV is computationally more expensive than K-Fold CV due to the additional outer loop repetitions. Choosing Between Them:
K-Fold CV: A good starting point for most cases. It’s simpler to implement and less computationally expensive. Repeated K-Fold CV: Consider using it if: You suspect high variance in the performance estimates from K-Fold CV. You have sufficient computational resources. In summary, K-Fold CV provides a basic and efficient way to evaluate model performance. Repeated K-Fold CV adds an extra layer of stability by averaging performance estimates from multiple K-Fold CV runs.
| Feature | K-Fold Cross Validation | Repeated K-Fold Cross Validation |
|---|---|---|
| Definition | Divides the dataset into k subsets | Similar to k-fold, but the process is repeated multiple times with different random splits of the data |
| Number of Iterations | One iteration | Multiple iterations (repeats) |
| Randomization | Typically, data is shuffled once and divided into k folds | Randomization occurs multiple times, creating new random splits for each iteration |
| Variability Reduction | Provides a single estimate of model performance | Reduces variability by averaging performance metrics over multiple iterations |
| Performance Estimation | Estimates model performance based on one split of the data | Provides more robust estimates of model performance by averaging over multiple splits |
| Computationally Expensive | Less computationally expensive compared to repeated k-fold | More computationally expensive due to multiple iterations, especially with large datasets |
| Use Cases | Suitable for initial model evaluation or when computational resources are limited | Recommended when a more reliable estimate of model performance is required, or when dealing with small datasets |
| Implementation | Can be easily implemented using built-in functions in most machine learning libraries | Implementation may require custom code or specialized functions that support repeated cross-validation |
tcr_cv <- trainControl(method = "repeatedcv", number = 10, repeats=3)
model_rep_cv <- train(
MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = train_data,
method="svmRadial",
trControl = tcr_cv)
pred_tst_rep_cv <- predict(model_rep_cv, test_data)
fit_ind_tst_rep_cv <- data.frame(
R2 = R2(pred_tst_rep_cv, test_data$MPG),
RMSE = RMSE(pred_tst_rep_cv, test_data$MPG)
)
model_rep_cv
Support Vector Machines with Radial Basis Function Kernel
316 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 284, 283, 284, 284, 286, 284, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.369185 0.9118999 1.2007868
0.50 1.984428 0.9363234 0.9875723
1.00 1.668257 0.9555694 0.8582554
Tuning parameter 'sigma' was held constant at a value of 0.6495909
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.6495909 and C = 1.
fit_ind_tst_rep_cv
fit_indices_table <- data.frame(
Model = c("SVM", "LOOCV", "k-fold CV", "Repeated k-fold CV"),
R2 = c(fit_ind_tst_svm$R2, fit_ind_tst_loocv$R2, fit_ind_tst_cv$R2, fit_ind_tst_rep_cv$R2),
RMSE = c(fit_ind_tst_svm$RMSE, fit_ind_tst_loocv$RMSE, fit_ind_tst_cv$RMSE, fit_ind_tst_rep_cv$RMSE)
)
fit_indices_table
best_model <- fit_indices_table[which.max(fit_indices_table$R2), ]
print(best_model)
Model R2 RMSE
2 LOOCV 0.9630812 1.770673
Hence, the best model is Leave Out One Cross Validation (LOOCV) and can be used for better prediction results.
pred_tst_loocv <- predict(model_loocv, test_data)
# Combine data into a data frame
data3 <- data.frame(Actual_MPG = test_data$MPG, Predicted_MPG = pred_tst_loocv)
head(data3,15)
# Create line plot
ggplot(data3, aes(x = 1:nrow(data3))) +
geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
theme_economist()
This predicts the target variable for your testing data (test_data).