Use the data called heart attached to this document and run a multiple regression to determine if smoking and biking significantly causes heart disease. Your report should include the following:
library(stargazer)
library(ggplot2)
library(tidyverse)
library(sjPlot)
library(corrplot)
data <- read.csv("regression.csv")
head(data,5)
str(data)
'data.frame': 392 obs. of 8 variables:
$ GallonsPer100Miles: num 5.6 6.7 5.6 6.3 5.9 6.7 7.1 7.1 7.1 6.7 ...
$ MPG : num 18 15 18 16 17 15 14 14 14 15 ...
$ Cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
$ Displacement100ci : num 3.07 3.5 3.18 3.04 3.02 4.29 4.54 4.4 4.55 3.9 ...
$ Horsepower100 : num 1.3 1.65 1.5 1.5 1.4 1.98 2.2 2.15 2.25 1.9 ...
$ Weight1000lb : num 3.5 3.69 3.44 3.43 3.45 ...
$ Seconds0to60 : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ Name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
In this paper, MPG (miles per gallon) is going to be our dependent variable. MPG represents the fuel efficiency of the vehicles, which is a common dependent variable in automotive analysis. It’s typically used to measure how efficiently a vehicle uses fuel to travel a certain distance. In this dataset, other variables like GallonsPer100Miles or Seconds0to60 can be considered independent variables that affect MPG.
# Select the first 200 rows
sample_data <- data[1:300, ]
head(sample_data,5)
attach(sample_data)
model <- lm(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, data = sample_data)
stargazer(model, type = "text")
===============================================
Dependent variable:
---------------------------
MPG
-----------------------------------------------
GallonsPer100Miles -3.336***
(0.175)
Cylinders 0.195
(0.220)
Displacement100ci -0.541
(0.468)
Horsepower100 3.115***
(0.896)
Weight1000lb -1.650***
(0.441)
Seconds0to60 0.143**
(0.070)
Constant 37.912***
(1.481)
-----------------------------------------------
Observations 300
R2 0.902
Adjusted R2 0.900
Residual Std. Error 1.999 (df = 293)
F Statistic 450.982*** (df = 6; 293)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
tab_model(model,
show.se = TRUE,
show.stat = TRUE)
| MPG | |||||
|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | CI | Statistic | p |
| (Intercept) | 37.91 | 1.48 | 35.00 – 40.83 | 25.60 | <0.001 |
| GallonsPer100Miles | -3.34 | 0.17 | -3.68 – -2.99 | -19.08 | <0.001 |
| Cylinders | 0.20 | 0.22 | -0.24 – 0.63 | 0.89 | 0.375 |
| Displacement100ci | -0.54 | 0.47 | -1.46 – 0.38 | -1.16 | 0.249 |
| Horsepower100 | 3.12 | 0.90 | 1.35 – 4.88 | 3.48 | 0.001 |
| Weight1000lb | -1.65 | 0.44 | -2.52 – -0.78 | -3.74 | <0.001 |
| Seconds0to60 | 0.14 | 0.07 | 0.01 – 0.28 | 2.04 | 0.042 |
| Observations | 300 | ||||
| R2 / R2 adjusted | 0.902 / 0.900 | ||||
summary(model)$r.squared %>%
c(paste("Root Mean Squared Error (RMSE):", sqrt(mean(model$residuals^2))),
paste("R-squared:", .),
paste("Adjusted R-squared:", summary(model)$adj.r2),
paste("AIC:", AIC(model)),
paste("BIC:", BIC(model))) %>%
cat(paste0("* ", .), sep = "\n")
0.902297279731393
Root Mean Squared Error (RMSE): 1.97551731955362
R-squared: 0.902297279731393
Adjusted R-squared:
AIC: 1275.86129872615
BIC: 1305.4915585234
* 0.902297279731393
* Root Mean Squared Error (RMSE): 1.97551731955362
* R-squared: 0.902297279731393
* Adjusted R-squared:
* AIC: 1275.86129872615
* BIC: 1305.4915585234
library(report)
report_table(model)
The regression analysis reveals that the dependent variable, MPG (miles per gallon), is significantly influenced by several independent variables. GallonsPer100Miles has a strong negative relationship with MPG (coef. = -3.336, p < 0.001), indicating that as gallons per 100 miles increase, MPG decreases. Similarly, Horsepower100 has a positive relationship (coef. = 3.115, p < 0.001), suggesting that higher horsepower is associated with lower MPG.
Other variables such as Weight1000lb also negatively impact MPG (coef. = -1.650, p < 0.001), implying that heavier vehicles tend to have lower fuel efficiency. However, Cylinders, Displacement100ci, and Seconds0to60 show no statistically significant relationship with MPG. The overall model is highly significant (F Statistic = 450.982, p < 0.001) and explains approximately 90.2% of the variance in MPG. Therefore, these variables collectively provide a good understanding of the factors affecting fuel efficiency in vehicles.
In the model above, 90% of the variation in MPG is explained by all the predictors in the model, as indicated by an adjusted R-square of 0.900. The following is the individual cotribution of each predictor to the predicted variable.
library(relaimpo)
rel_imp <- calc.relimp(model, type = "lmg")
rel_imp
Response variable: MPG
Total response variance: 40.07791
Analysis based on 300 observations
6 Regressors:
GallonsPer100Miles Cylinders Displacement100ci Horsepower100 Weight1000lb Seconds0to60
Proportion of variance explained by model: 90.23%
Metrics are not normalized (rela=FALSE).
Relative importance metrics:
lmg
GallonsPer100Miles 0.29081991
Cylinders 0.12976856
Displacement100ci 0.14284670
Horsepower100 0.12872477
Weight1000lb 0.17336539
Seconds0to60 0.03677194
Average coefficients for different model sizes:
1X 2Xs 3Xs 4Xs 5Xs
GallonsPer100Miles -3.738533 -3.4945755 -3.37725750 -3.33128765 -3.32524451
Cylinders -2.935299 -1.1715011 -0.39684515 -0.13716122 0.03160863
Displacement100ci -4.891259 -2.9386899 -1.52718236 -0.75064198 -0.50623353
Horsepower100 -12.582389 -5.5819036 -2.23306878 -0.29850580 1.45099326
Weight1000lb -6.290453 -4.6067756 -3.59593620 -2.77444613 -2.12232776
Seconds0to60 1.037865 -0.1049534 -0.09138239 -0.03052353 0.05194459
6Xs
GallonsPer100Miles -3.3364268
Cylinders 0.1954997
Displacement100ci -0.5408844
Horsepower100 3.1154685
Weight1000lb -1.6502999
Seconds0to60 0.1430547
The relative importance metrics indicate how much each predictor contributes to the variance explained by the model.
GallonsPer100Miles has the highest importance with an LMG (Lindeman, Merenda, Gold) value of 0.291 (29.1%), suggesting it contributes the most to explaining the variance in MPG.
Weight1000lb follows with a relative importance of 0.173 (17.3%), indicating its significant contribution to explaining MPG variance.
Displacement100ci also plays a considerable role with a relative importance of 0.143 (14.3%).
Horsepower100 and Cylinders have similar importance values of around 0.129 (12.9%), indicating their moderate contributions.
Seconds0to60 has the lowest relative importance at 0.037 (3.7%), suggesting it has the least impact on explaining MPG variance.
Even though Cyliners had no significant effect on the PMG, the variable explains 0.1297 (12.97%) of the variation in MPG.
The average coefficients for different model sizes show the estimated impact of each variable on MPG at various stages of model complexity, with coefficients becoming less extreme as more variables are added to the model. These results can be visualized as shown below.
library(ggplot2)
# Convert rel_imp$lmg to a data frame
df <- data.frame(variable = names(rel_imp$lmg), importance = rel_imp$lmg)
# Arrange the data frame in descending order of relative importance
df <- df[order(df$importance, decreasing = TRUE), ]
# Create the bar plot using ggplot2
ggplot(df, aes(x = reorder(variable, -importance), y = importance)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = paste0(round(importance * 100, 3), "%")), vjust = -0.5, size = 3) + # Format labels as percentages
labs(x = "Independent variables", y = "Relative Importance",
title = "A Bar plot of Relative Importance") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The overall significance of the model is determined by looking at the F-value and its associated p-value. From the results above, the F-value of F Statistic 450.982*** (df = 6; 293), which is statistically significant indicates that the model is overall significant and can be used for prediction.
Among the model diagnostics include the normality of the regression residuals, homoscedastic variance of the error term and the correlation of the predictors. Consider the following plot of the regression residuals.
library(forecast)
checkresiduals(model)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 62.869, df = 10, p-value = 1.034e-09
The histogram of the distribution of the residuals shows a fairly normally distributed residuals with some aspects of positively skewed however, with no much concern. Further we can test normality using Q-Q plot together with histogram as shown below, showing normally distributed residuals
library(car)
par(mfrow = c(1, 2))
# Q-Q plot
qqPlot(model$residuals, main = "Q-Q Plot of the regression residuals")
[1] 243 246
# Histogram
hist(model$residuals, col = "lightblue",breaks = 20, main = "Histogram of Residuals")
The shows a fairly normal distribution of the regression residuals with minor aspects of positively skewed distribution, which in this case is not of serious concern. Consider the following inferential results to check the normality of the residuals
# Shapiro-Wilk Test (for formality)
shapiro.test(model$residuals) # Check p-value for normality
Shapiro-Wilk normality test
data: model$residuals
W = 0.86732, p-value = 2.179e-15
The results above with the accompanied p-value shows the regression residuals do not follow a normal distribution. As said earlier, the regression residuals appears to have a positively skewed distribution, however, this is not of great concern.
The assumption requires that variance of the regression residuals should constant over time. Otherwise if violated, our model will suffer from heteroscedastic problem. Consider the results below.
library(car)
library(tseries)
ncvTest(model)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 14.1872, Df = 1, p = 0.00016549
Since the p-value is less than the conventional significance level of 0.05, we reject the null hypothesis of constant variance. This means that there is enough evidence to conclude that the variance of the residuals varies across the range of the predictor variable(s). In other words, our model is suffering from non-constant variance of the error term. Over time, the variance of the regression residuals is not constant.
vif(model)
GallonsPer100Miles Cylinders Displacement100ci Horsepower100
5.849554 11.193871 19.488669 9.754632
Weight1000lb Seconds0to60
11.374802 2.807855
The VIF of 1 indicates no correlation, no multicollinearity. The VIF between 1 and 5 indicates a moderate correlation, some multicollinearity but may not be a major concern. The VIF greater than 5 indicates a high correlation, significant multicollinearity which can cause issues with your model. Based on the output above, our model is suffering from multicollinearity and therefore good for prediction. Some variables are likely correlated. Let’s look at the correlation plot below.
# Select numeric columns from Wine_data (assuming these are the columns you want to analyze)
numeric_data <- data[, sapply(data, is.numeric)]
# Calculate correlation matrix
correlation_matrix <- cor(numeric_data)
# Create a correlation heatmap using corrplot
corrplot(correlation_matrix,
method = "color", # Display correlations using color
type = "upper", # Show upper triangle of the correlation matrix
tl.col = "black", # Color of text labels
tl.srt = 45, # Rotate text labels by 45 degrees
diag = FALSE, # Exclude diagonal elements
addCoef.col = "black", # Color of correlation coefficients
col = colorRampPalette(c("blue", "white", "red"))(100), # Custom color palette
main = "Correlation Heatmap", # Main title
pch = 16, # Use solid circles for data points
cex.lab = 1.2, # Adjust label size
cex.main = 0.5 , # Adjust main title size
cex.axis = 0.5 , # Adjust axis label size
number.cex = 0.5) # Adjust number size in the color legend
Some variables above are highly correlated. For instance, Displace and weight are highly correlated. Cylinder and Displacement are also highly correlated as shown in the graph which a correlation coefficient of 0.95. These are possible reasons for our model to suffer from multicollinearity. #### IV. AIC and BIC
library(report)
report_table(model)
sample_data1 <- data[301:392,]
str(sample_data1)
'data.frame': 92 obs. of 8 variables:
$ GallonsPer100Miles: num 2.9 3.1 2.7 3.5 3.5 3.7 3 2.4 2.6 3.1 ...
$ MPG : num 34.5 31.8 37.3 28.4 28.8 26.8 33.5 41.5 38.1 32.1 ...
$ Cylinders : int 4 4 4 4 6 6 4 4 4 4 ...
$ Displacement100ci : num 1.05 0.85 0.91 1.51 1.73 1.73 1.51 0.98 0.89 0.98 ...
$ Horsepower100 : num 0.7 0.65 0.69 0.9 1.15 1.15 0.9 0.76 0.6 0.7 ...
$ Weight1000lb : num 2.15 2.02 2.13 2.67 2.6 ...
$ Seconds0to60 : num 14.9 19.2 14.7 16 11.3 12.9 13.2 14.7 18.8 15.5 ...
$ Name : chr "plymouth horizon tc3" "datsun 210" "fiat strada custom" "buick skylark limited" ...
# Predict values on the test set
predictions <- predict(model, newdata = sample_data1)
# Add predictions as a new column to the test set
sample_data1$predicted_value <- predictions
head(sample_data1,10)
library(ggplot2)
library(ggthemes)
# Combine data into a data frame
data1 <- data.frame(Actual_MPG = sample_data1$MPG, Predicted_MPG = sample_data1$predicted_value)
# Create line plot
ggplot(data1, aes(x = 1:nrow(data1))) +
geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
theme_economist()
From the plot above, the model don’t seem to work well since the predicted MPG is lying slightly far below the actual MPG, and this is a serious concern. It mean that our model is not correctly capturing the variation in MPG as explained by the predictors in the model.
plot(sample_data1$MPG, sample_data1$predicted_value, ylab = "MPG", xlab = "Predicted MPG", main ="Scatter plot of Observed and Predicted MPG")
abline(lm(sample_data1$MPG ~ sample_data1$predicted_value), col = "red")
library(report)
attach(sample_data1)
T_Test <- t.test(MPG, predicted_value, var.equal = TRUE) # Assuming equal variances
T_Test
Two Sample t-test
data: MPG and predicted_value
t = 5.8389, df = 182, p-value = 2.369e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.582491 5.218629
sample estimates:
mean of x mean of y
31.94565 28.04509
The results above indicates that there is a significant difference between the the observed MPG and the predicted MPG since the p-value if greater than less than 0.05. This means that our model is not doing well in terms of prediction.
Based on the results in (VI) and (VII) I would like not revise my answer for (V) since the estimated model is not appropriate for prediction.
library(recipes)
library(lava)
library(sjmisc)
library(igraph)
library(e1071)
library(hardhat)
library(ipred)
library(caret)
library(sjPlot)
library(nnet)
library(wakefield)
library(kknn)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(ISLR)
library(ISLR2)
library(MASS)
library(splines)
library(splines2)
library(pROC)
library(ISLR)
library(ISLR2)
library(MASS)
library(splines)
library(splines2)
library(pROC)
library(randomForest)
library(rpart)
library(rpart.plot)
library(rattle)
library(ISLR2)
library(MASS)
library(splines)
library(pROC)
library(rattle)
library(rpart)
library(party)
library(partykit)
library(ggplot2)
library(tune)
library(TunePareto)
set.seed(333)
trControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 3)
attach(data)
FIT <- train(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = sample_data,
tuneGrid = expand.grid(k=1:70),
method = 'knn',
trControl = trControl,
preProc = c('center', 'scale'))
FIT
k-Nearest Neighbors
300 samples
6 predictor
Pre-processing: centered (6), scaled (6)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 270, 270, 270, 269, 269, 270, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 1.574376 0.9407361 1.0289805
2 1.488331 0.9464728 0.9475791
3 1.449470 0.9494366 0.9275828
4 1.468866 0.9486649 0.9622943
5 1.506811 0.9459447 0.9980082
6 1.519485 0.9455442 1.0090691
7 1.537981 0.9447255 1.0168833
8 1.574708 0.9426251 1.0475633
9 1.590421 0.9413774 1.0827709
10 1.626097 0.9388027 1.1135395
11 1.634828 0.9383082 1.1257423
12 1.657153 0.9370251 1.1443973
13 1.683380 0.9349548 1.1649064
14 1.702267 0.9332652 1.1827724
15 1.718769 0.9317762 1.1979338
16 1.725082 0.9316209 1.2019506
17 1.737452 0.9304641 1.2172920
18 1.761103 0.9288514 1.2360456
19 1.796987 0.9262331 1.2664406
20 1.820877 0.9243558 1.2892821
21 1.846231 0.9220973 1.3098785
22 1.867696 0.9201525 1.3276490
23 1.875812 0.9195781 1.3347344
24 1.896208 0.9178536 1.3476351
25 1.915921 0.9159352 1.3635284
26 1.931918 0.9148229 1.3795251
27 1.952977 0.9128934 1.3975196
28 1.975726 0.9109653 1.4162083
29 1.998009 0.9087932 1.4358004
30 2.022123 0.9063054 1.4560267
31 2.047437 0.9037242 1.4765056
32 2.061935 0.9023895 1.4930848
33 2.083140 0.9002251 1.5148796
34 2.102949 0.8981289 1.5332530
35 2.125267 0.8957856 1.5504856
36 2.141681 0.8941163 1.5657049
37 2.158255 0.8923333 1.5829083
38 2.179925 0.8900707 1.5975942
39 2.203896 0.8874101 1.6122513
40 2.224768 0.8851749 1.6292806
41 2.244025 0.8832789 1.6469351
42 2.267840 0.8807192 1.6640996
43 2.286662 0.8786197 1.6797256
44 2.303200 0.8767907 1.6938561
45 2.317233 0.8752755 1.7058665
46 2.332390 0.8737554 1.7203403
47 2.346059 0.8723368 1.7327327
48 2.357099 0.8710910 1.7415329
49 2.372488 0.8691170 1.7523006
50 2.385921 0.8677511 1.7633392
51 2.402543 0.8657325 1.7773778
52 2.417841 0.8640395 1.7892252
53 2.433101 0.8622596 1.8011448
54 2.442844 0.8610150 1.8113901
55 2.457890 0.8592856 1.8232775
56 2.470306 0.8580062 1.8342347
57 2.481605 0.8566918 1.8441825
58 2.494053 0.8553391 1.8537080
59 2.507794 0.8538809 1.8629001
60 2.521219 0.8523376 1.8719471
61 2.533237 0.8509792 1.8815081
62 2.543044 0.8499303 1.8895953
63 2.551390 0.8491377 1.8987123
64 2.559499 0.8483665 1.9084574
65 2.570403 0.8473175 1.9182475
66 2.586347 0.8454688 1.9309562
67 2.601714 0.8438675 1.9431948
68 2.613506 0.8423891 1.9526062
69 2.624752 0.8410532 1.9625590
70 2.634447 0.8401369 1.9679395
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 3.
FIT$modelType
[1] "Regression"
FIT$coefnames
[1] "GallonsPer100Miles" "Cylinders" "Displacement100ci"
[4] "Horsepower100" "Weight1000lb" "Seconds0to60"
The results above shows that we have run K-Nearest Neighbors with 3 samples and 6 predictors. We used cross validation where ten folds were used and repeated three times. This means that the data is divided into ten folds and only nine folds are used in model estimation and one fold is used for model assessment and validation. From the model RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 3.
plot(FIT, xlab = "Nearest Neighbors", main = "APlot of Root Mean Square Error (RMSE) for the K-Nearest Neighbors")
FIT
k-Nearest Neighbors
300 samples
6 predictor
Pre-processing: centered (6), scaled (6)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 270, 270, 270, 269, 269, 270, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 1.574376 0.9407361 1.0289805
2 1.488331 0.9464728 0.9475791
3 1.449470 0.9494366 0.9275828
4 1.468866 0.9486649 0.9622943
5 1.506811 0.9459447 0.9980082
6 1.519485 0.9455442 1.0090691
7 1.537981 0.9447255 1.0168833
8 1.574708 0.9426251 1.0475633
9 1.590421 0.9413774 1.0827709
10 1.626097 0.9388027 1.1135395
11 1.634828 0.9383082 1.1257423
12 1.657153 0.9370251 1.1443973
13 1.683380 0.9349548 1.1649064
14 1.702267 0.9332652 1.1827724
15 1.718769 0.9317762 1.1979338
16 1.725082 0.9316209 1.2019506
17 1.737452 0.9304641 1.2172920
18 1.761103 0.9288514 1.2360456
19 1.796987 0.9262331 1.2664406
20 1.820877 0.9243558 1.2892821
21 1.846231 0.9220973 1.3098785
22 1.867696 0.9201525 1.3276490
23 1.875812 0.9195781 1.3347344
24 1.896208 0.9178536 1.3476351
25 1.915921 0.9159352 1.3635284
26 1.931918 0.9148229 1.3795251
27 1.952977 0.9128934 1.3975196
28 1.975726 0.9109653 1.4162083
29 1.998009 0.9087932 1.4358004
30 2.022123 0.9063054 1.4560267
31 2.047437 0.9037242 1.4765056
32 2.061935 0.9023895 1.4930848
33 2.083140 0.9002251 1.5148796
34 2.102949 0.8981289 1.5332530
35 2.125267 0.8957856 1.5504856
36 2.141681 0.8941163 1.5657049
37 2.158255 0.8923333 1.5829083
38 2.179925 0.8900707 1.5975942
39 2.203896 0.8874101 1.6122513
40 2.224768 0.8851749 1.6292806
41 2.244025 0.8832789 1.6469351
42 2.267840 0.8807192 1.6640996
43 2.286662 0.8786197 1.6797256
44 2.303200 0.8767907 1.6938561
45 2.317233 0.8752755 1.7058665
46 2.332390 0.8737554 1.7203403
47 2.346059 0.8723368 1.7327327
48 2.357099 0.8710910 1.7415329
49 2.372488 0.8691170 1.7523006
50 2.385921 0.8677511 1.7633392
51 2.402543 0.8657325 1.7773778
52 2.417841 0.8640395 1.7892252
53 2.433101 0.8622596 1.8011448
54 2.442844 0.8610150 1.8113901
55 2.457890 0.8592856 1.8232775
56 2.470306 0.8580062 1.8342347
57 2.481605 0.8566918 1.8441825
58 2.494053 0.8553391 1.8537080
59 2.507794 0.8538809 1.8629001
60 2.521219 0.8523376 1.8719471
61 2.533237 0.8509792 1.8815081
62 2.543044 0.8499303 1.8895953
63 2.551390 0.8491377 1.8987123
64 2.559499 0.8483665 1.9084574
65 2.570403 0.8473175 1.9182475
66 2.586347 0.8454688 1.9309562
67 2.601714 0.8438675 1.9431948
68 2.613506 0.8423891 1.9526062
69 2.624752 0.8410532 1.9625590
70 2.634447 0.8401369 1.9679395
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 3.
The graph above shows that lowest value of RMASE is found when we have k as 3. After that point, the value of RMASE starts to increase steadily.
varImp(FIT)
loess r-squared variable importance
Overall
GallonsPer100Miles 100.00
Weight1000lb 88.17
Displacement100ci 80.64
Horsepower100 75.39
Cylinders 64.06
Seconds0to60 0.00
plot(varImp(FIT), xlab = "Percentage Importance",
ylab ="Variables", main = "Variable Importance for the KNN Model")
The output above variables importance in our model from the most important one to the least. The variable “chas” has significant importance in our knn model. In other words, “chas” is the least important variable.
pred <- predict(FIT, newdata = sample_data1)
Because the response variable numeric and continuous, we will calculate the root mean square for assessment.
RMSE(pred, sample_data1$MPG)
[1] 4.769806
plot(pred, sample_data1$MPG)
FIT <- train(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = sample_data,
tuneGrid = expand.grid(k=1:70),
method = 'knn',
metric = 'Rsquared',
trControl = trControl,
preProc = c('center', 'scale'))
FIT
k-Nearest Neighbors
300 samples
6 predictor
Pre-processing: centered (6), scaled (6)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 270, 271, 270, 270, 269, 269, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 1.591920 0.9394329 1.0319675
2 1.455227 0.9498981 0.9323168
3 1.439877 0.9502025 0.9160603
4 1.485976 0.9469023 0.9541937
5 1.540147 0.9434957 0.9992122
6 1.560133 0.9426078 1.0037810
7 1.590279 0.9412527 1.0392630
8 1.598914 0.9402266 1.0558759
9 1.629941 0.9381906 1.0901669
10 1.655139 0.9365378 1.1120192
11 1.673802 0.9351588 1.1319297
12 1.696486 0.9335523 1.1490496
13 1.714676 0.9319119 1.1670087
14 1.735721 0.9302328 1.1791580
15 1.762991 0.9282030 1.2005774
16 1.773082 0.9272595 1.2116991
17 1.788123 0.9263502 1.2279748
18 1.801409 0.9254666 1.2455852
19 1.824421 0.9236937 1.2683409
20 1.854903 0.9209520 1.2912184
21 1.875529 0.9190249 1.3118706
22 1.889856 0.9181122 1.3262452
23 1.912035 0.9162811 1.3392577
24 1.934766 0.9145289 1.3586907
25 1.953593 0.9128549 1.3742508
26 1.982213 0.9099365 1.3957396
27 1.996255 0.9087167 1.4116796
28 2.014375 0.9071303 1.4268993
29 2.035595 0.9051180 1.4468520
30 2.056944 0.9033305 1.4639827
31 2.077956 0.9013076 1.4839148
32 2.099019 0.8993461 1.5025454
33 2.119217 0.8972836 1.5200710
34 2.135849 0.8956926 1.5345744
35 2.157269 0.8935631 1.5535753
36 2.181741 0.8910892 1.5741680
37 2.208008 0.8884506 1.5948985
38 2.226443 0.8866513 1.6094057
39 2.245309 0.8846252 1.6222174
40 2.262840 0.8828099 1.6384923
41 2.286683 0.8801846 1.6553593
42 2.308474 0.8777284 1.6735914
43 2.324954 0.8759073 1.6880161
44 2.337178 0.8747198 1.7002150
45 2.349809 0.8733561 1.7111410
46 2.365473 0.8716785 1.7226947
47 2.378420 0.8702897 1.7349310
48 2.391963 0.8689972 1.7447193
49 2.407233 0.8674046 1.7577276
50 2.424189 0.8654463 1.7683986
51 2.437694 0.8640195 1.7780372
52 2.454297 0.8621021 1.7918053
53 2.470363 0.8603796 1.8043192
54 2.481606 0.8590029 1.8156761
55 2.491748 0.8578865 1.8253641
56 2.498593 0.8573661 1.8317425
57 2.511403 0.8559964 1.8419442
58 2.526239 0.8541874 1.8539235
59 2.540528 0.8525444 1.8648257
60 2.552716 0.8513346 1.8743174
61 2.566598 0.8497237 1.8849405
62 2.583919 0.8477822 1.9001164
63 2.592890 0.8469463 1.9077759
64 2.604838 0.8456377 1.9184197
65 2.617315 0.8444035 1.9314042
66 2.630802 0.8426598 1.9442484
67 2.639551 0.8418271 1.9516521
68 2.653349 0.8401014 1.9616237
69 2.663703 0.8389873 1.9713350
70 2.674564 0.8377999 1.9810189
Rsquared was used to select the optimal model using the largest value.
The final value used for the model was k = 3.
plot(FIT)
varImp(FIT)
loess r-squared variable importance
Overall
GallonsPer100Miles 100.00
Weight1000lb 88.17
Displacement100ci 80.64
Horsepower100 75.39
Cylinders 64.06
Seconds0to60 0.00
plot(varImp(FIT), xlab = "Percentage Importance",
ylab ="Variables", main = "Variable Importance for the KNN Model")
pred <- predict(FIT, newdata = sample_data1)
data_pred <- data.frame(sample_data1, pred)
head(data_pred,5)
str(data_pred)
'data.frame': 92 obs. of 10 variables:
$ GallonsPer100Miles: num 2.9 3.1 2.7 3.5 3.5 3.7 3 2.4 2.6 3.1 ...
$ MPG : num 34.5 31.8 37.3 28.4 28.8 26.8 33.5 41.5 38.1 32.1 ...
$ Cylinders : int 4 4 4 4 6 6 4 4 4 4 ...
$ Displacement100ci : num 1.05 0.85 0.91 1.51 1.73 1.73 1.51 0.98 0.89 0.98 ...
$ Horsepower100 : num 0.7 0.65 0.69 0.9 1.15 1.15 0.9 0.76 0.6 0.7 ...
$ Weight1000lb : num 2.15 2.02 2.13 2.67 2.6 ...
$ Seconds0to60 : num 14.9 19.2 14.7 16 11.3 12.9 13.2 14.7 18.8 15.5 ...
$ Name : chr "plymouth horizon tc3" "datsun 210" "fiat strada custom" "buick skylark limited" ...
$ predicted_value : num 29.2 29.3 29.9 26.9 27.4 ...
$ pred : num 32.2 31 33.8 25 22.3 ...
library(ggplot2)
library(ggthemes)
# Combine data into a data frame
data2 <- data.frame(Actual_MPG = data_pred$MPG, Predicted_MPG = data_pred$pred)
# Create line plot
ggplot(data2, aes(x = 1:nrow(data2))) +
geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
theme_economist()
Because the response variable numeric and continuous, we will calculate the root mean square for assessment.
RMSE(pred, data2$Actual_MPG)
[1] 4.769806
plot(pred, data2$Actual_MPG, xlab = "Prediction", ylab = "mpg", main = "The scatter plot of predicted and actual mpg")
Interpreting a k-Nearest Neighbors (k-NN) regression model is somewhat different from interpreting a traditional linear regression model. In k-NN regression, instead of fitting a mathematical equation to your data, the model makes predictions based on the values of the k nearest data points in the training dataset. Here’s how you can interpret a k-NN regression model:
For each data point you want to predict, the k-NN algorithm identifies the k nearest data points in the training dataset based on some distance metric (e.g., Euclidean distance). It then calculates the average (or weighted average) of the target values (the values you’re trying to predict) for those k nearest data points. This average is used as the prediction for the new data point. Tuning Parameter (k):
The most important parameter in k-NN regression is “k,” which represents the number of nearest neighbors to consider. A small k (e.g., 1 or 3) may result in a model that closely follows the training data but is sensitive to noise. A large k (e.g., 10 or more) may result in a smoother prediction surface but might not capture local variations. The choice of k should be based on cross-validation and the characteristics of your data.
The choice of distance metric (e.g., Euclidean, Manhattan, etc.) can impact the model’s performance. Different distance metrics can lead to different interpretations of “closeness” among data points.
Unlike linear regression, k-NN regression can capture non-linear relationships in the data. Interpretation may not involve coefficients or slope values as in linear regression.
k-NN regression captures local patterns in the data. Interpretation involves understanding the local behavior around a prediction point. The model does not provide a global equation that describes the entire dataset.
Visualizations can be helpful for interpretation. Plotting the k nearest neighbors for specific data points can provide insights into why the model made a particular prediction.
k-NN does not provide feature importance scores like some other models (e.g., Random Forest or Gradient Boosting). However, you can still analyze feature importance indirectly by examining which features are most influential in determining the nearest neighbors.
Use appropriate regression evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared to assess model performance. These metrics can give you a sense of how well the k-NN regression model fits the data.
Scaling and normalization of features can significantly affect the results, as k-NN is sensitive to the scale of input variables. Interpretation may involve considering the effects of feature scaling. In summary, interpreting a k-NN regression model involves understanding its prediction process, the choice of hyperparameters (especially k), the impact of distance metrics, and the local nature of the model’s predictions. Visualizations and performance metrics play a crucial role in assessing and explaining the model’s behavior on your specific dataset.
In k-Nearest Neighbors (k-NN) regression, accuracy is not typically used as an evaluation metric because k-NN regression is a type of regression, not classification. Accuracy is more suitable for classification problems where you’re predicting discrete class labels.
For regression tasks, you typically use different evaluation metrics to assess the performance of your model. Common metrics for regression tasks include:
It measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to Mean Squared Error.
It measures the average of the squared differences between predicted and actual values. It gives higher weight to larger errors.
It is the square root of the MSE and provides an interpretable measure of the average error in the same units as the target variable.
It measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared indicates a better fit.
Support Vector Regressor (SVR) is a supervised machine learning algorithm used for regression tasks. It’s based on the concept of Support Vector Machines (SVMs) typically used for classification problems. Here’s how SVR works:
The goal of SVR is to find a hyperplane (in high dimensional space for non-linear cases) that best fits the training data while minimizing the prediction error. This hyperplane minimizes the distance to the closest data points, called support vectors. Unlike linear regression, SVR allows for a certain margin of error around the hyperplane.
SVR can handle non-linear relationships between features and target variables using kernel functions. These functions project the data into a higher-dimensional space where a linear relationship might exist.
SVR is less sensitive to outliers compared to some other regression methods. This is because it focuses on the support vectors rather than being influenced by all data points equally.
SVR inherently performs regularization by controlling the margin of error around the hyperplane. This helps to prevent overfitting. R code for Support Vector Regression with kernlab package
# Install kernlab package if not already installed
if(!require(kernlab)) install.packages("kernlab")
library(kernlab)
Split your data into training and testing sets.
Ensure your target variable is numeric.
set.seed(16)
ind <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))
train_data <- data[ind == 1, ]
test_data <- data[ind == 2, ]
library(caret)
model_svm <- train(
MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = train_data,
method = 'svmRadial',
preProcess = c("center", "scale"),
trCtrl = trainControl(method = "none")
)
model_svm
Support Vector Machines with Radial Basis Function Kernel
316 samples
6 predictor
Pre-processing: centered (6), scaled (6)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 316, 316, 316, 316, 316, 316, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.181329 0.9215166 1.0832816
0.50 1.855165 0.9412598 0.9165128
1.00 1.592084 0.9555932 0.8222594
Tuning parameter 'sigma' was held constant at a value of 0.4651685
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.4651685 and C = 1.
pred_tst_svm <- predict(model_svm, test_data)
fit_ind_tst_svm <- data.frame(
R2 = R2(pred_tst_svm, test_data$MPG),
RMSE = RMSE(pred_tst_svm, test_data$MPG)
)
data.frame(
Model = c("SVM Train", "SVM Test"),
R2 = c(fit_ind_tst_svm$R2, fit_ind_tst_svm$R2),
RMSE = c(fit_ind_tst_svm$RMSE, fit_ind_tst_svm$RMSE)
)
R-square for test data of SVM model is 0.9309281 which means that independent variables are able to explain 93.09281% of variance in dependent variable on test data. Here, the difference between RMSE & R-square is less than 5 percent difference so we can say that there is no over-fitting or under-fitting in the model.
tcr_loocv <- trainControl(method = "LOOCV", number = 5, repeats=10)
model_loocv <- train(
MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = train_data,
method="svmRadial",
trControl = tcr_loocv)
pred_tst_loocv <- predict(model_loocv, test_data)
fit_ind_tst_loocv <- data.frame(
R2 = R2(pred_tst_loocv, test_data$MPG),
RMSE = RMSE(pred_tst_loocv, test_data$MPG)
)
print(model_loocv)
Support Vector Machines with Radial Basis Function Kernel
316 samples
6 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 315, 315, 315, 315, 315, 315, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.249977 0.9214763 1.0396851
0.50 1.848800 0.9458065 0.8437484
1.00 1.509427 0.9640092 0.7274306
Tuning parameter 'sigma' was held constant at a value of 0.4126495
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.4126495 and C = 1.
fit_ind_tst_loocv
tcr_cv <- trainControl(method = "cv", number = 5, repeats=10)
model_cv <- train(
MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = train_data,
method="svmRadial",
trControl = tcr_cv)
pred_tst_cv <- predict(model_cv, test_data)
fit_ind_tst_cv <- data.frame(
R2 = R2(pred_tst_cv, test_data$MPG),
RMSE = RMSE(pred_tst_cv, test_data$MPG)
)
model_cv
Support Vector Machines with Radial Basis Function Kernel
316 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 253, 253, 252, 254, 252
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.416428 0.9117854 1.1761786
0.50 1.961931 0.9399702 0.9280455
1.00 1.617822 0.9597706 0.8042350
Tuning parameter 'sigma' was held constant at a value of 0.5265069
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.5265069 and C = 1.
fit_ind_tst_cv
tcr_cv <- trainControl(method = "repeatedcv", number = 10, repeats=3)
model_rep_cv <- train(
MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
data = train_data,
method="svmRadial",
trControl = tcr_cv)
pred_tst_rep_cv <- predict(model_rep_cv, test_data)
fit_ind_tst_rep_cv <- data.frame(
R2 = R2(pred_tst_rep_cv, test_data$MPG),
RMSE = RMSE(pred_tst_rep_cv, test_data$MPG)
)
model_rep_cv
Support Vector Machines with Radial Basis Function Kernel
316 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 285, 285, 284, 284, 283, 284, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.129605 0.9265818 1.0776446
0.50 1.731314 0.9495626 0.8629829
1.00 1.449375 0.9656161 0.7528262
Tuning parameter 'sigma' was held constant at a value of 0.3911894
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.3911894 and C = 1.
fit_ind_tst_rep_cv
fit_indices_table <- data.frame(
Model = c("SVM", "LOOCV", "k-fold CV", "Repeated k-fold CV"),
R2 = c(fit_ind_tst_svm$R2, fit_ind_tst_loocv$R2, fit_ind_tst_cv$R2, fit_ind_tst_rep_cv$R2),
RMSE = c(fit_ind_tst_svm$RMSE, fit_ind_tst_loocv$RMSE, fit_ind_tst_cv$RMSE, fit_ind_tst_rep_cv$RMSE)
)
fit_indices_table
best_model <- fit_indices_table[which.max(fit_indices_table$R2), ]
print(best_model)
Model R2 RMSE
4 Repeated k-fold CV 0.9650457 1.726194
Hence, the best model is Repeated k-fold CV and can be used for better prediction results.
pred_tst_rep_cv <- predict(model_rep_cv, test_data)
pred_tst_rep_cv
[1] 17.69508 14.98023 14.31272 18.11931 24.91703 23.39855 11.15968 11.64694
[9] 24.94750 16.26864 14.03611 14.09027 18.85817 30.54781 31.45355 26.90076
[17] 20.32720 12.54158 15.13381 28.01410 26.88187 14.80701 12.57222 18.25677
[25] 12.57359 11.01399 21.21745 29.49214 22.90923 14.32734 27.94140 15.25337
[33] 18.17099 15.92608 20.47240 16.28532 29.87569 23.70514 23.37281 33.10317
[41] 15.95142 18.07220 17.79461 26.26589 20.72651 35.32813 33.78280 15.92823
[49] 33.69068 30.77930 30.90514 34.81331 35.22081 24.92834 19.53753 20.16339
[57] 19.17670 23.36154 23.48755 35.22230 27.70522 29.48482 40.75431 33.68206
[65] 28.19354 27.36794 26.22746 36.70723 32.75922 32.48806 26.69613 17.95927
[73] 36.51212 36.91017 35.35105 27.50183
# Combine data into a data frame
data3 <- data.frame(Actual_MPG = test_data$MPG, Predicted_MPG = pred_tst_rep_cv)
head(data3,15)
# Create line plot
ggplot(data3, aes(x = 1:nrow(data3))) +
geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
theme_economist()
This predicts the target variable for your testing data (test_data).