Use the data called heart attached to this document and run a multiple regression to determine if smoking and biking significantly causes heart disease. Your report should include the following:

Load the data

library(stargazer)
library(ggplot2)
library(tidyverse)
library(sjPlot)
library(corrplot)
data <- read.csv("regression.csv")
head(data,5)

Check the structure of the data

str(data)
'data.frame':   392 obs. of  8 variables:
 $ GallonsPer100Miles: num  5.6 6.7 5.6 6.3 5.9 6.7 7.1 7.1 7.1 6.7 ...
 $ MPG               : num  18 15 18 16 17 15 14 14 14 15 ...
 $ Cylinders         : int  8 8 8 8 8 8 8 8 8 8 ...
 $ Displacement100ci : num  3.07 3.5 3.18 3.04 3.02 4.29 4.54 4.4 4.55 3.9 ...
 $ Horsepower100     : num  1.3 1.65 1.5 1.5 1.4 1.98 2.2 2.15 2.25 1.9 ...
 $ Weight1000lb      : num  3.5 3.69 3.44 3.43 3.45 ...
 $ Seconds0to60      : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ Name              : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

I. The fitted linear model and an interpretation of the model equation using 80% observation for training and 20% for testing.

In this paper, MPG (miles per gallon) is going to be our dependent variable. MPG represents the fuel efficiency of the vehicles, which is a common dependent variable in automotive analysis. It’s typically used to measure how efficiently a vehicle uses fuel to travel a certain distance. In this dataset, other variables like GallonsPer100Miles or Seconds0to60 can be considered independent variables that affect MPG.

set.seed(16)
ind <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))

train_data <- data[ind == 1, ]
test_data <- data[ind == 2, ]

View the Results using Stargazer and tab_model() Function

model <- lm(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, data = train_data)
stargazer(model, type = "text")

===============================================
                        Dependent variable:    
                    ---------------------------
                                MPG            
-----------------------------------------------
GallonsPer100Miles           -4.496***         
                              (0.217)          
                                               
Cylinders                     -0.037           
                              (0.288)          
                                               
Displacement100ci              0.376           
                              (0.645)          
                                               
Horsepower100                 3.316**          
                              (1.288)          
                                               
Weight1000lb                  -1.180*          
                              (0.603)          
                                               
Seconds0to60                  0.205**          
                              (0.088)          
                                               
Constant                     41.183***         
                              (1.911)          
                                               
-----------------------------------------------
Observations                    316            
R2                             0.879           
Adjusted R2                    0.876           
Residual Std. Error      2.679 (df = 309)      
F Statistic          372.488*** (df = 6; 309)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01
tab_model(model,
          show.se = TRUE,
          show.stat = TRUE)
  MPG
Predictors Estimates std. Error CI Statistic p
(Intercept) 41.18 1.91 37.42 – 44.94 21.55 <0.001
GallonsPer100Miles -4.50 0.22 -4.92 – -4.07 -20.74 <0.001
Cylinders -0.04 0.29 -0.60 – 0.53 -0.13 0.897
Displacement100ci 0.38 0.64 -0.89 – 1.65 0.58 0.560
Horsepower100 3.32 1.29 0.78 – 5.85 2.57 0.011
Weight1000lb -1.18 0.60 -2.37 – 0.01 -1.96 0.051
Seconds0to60 0.20 0.09 0.03 – 0.38 2.32 0.021
Observations 316
R2 / R2 adjusted 0.879 / 0.876

Obtain the Performance Metrics

summary(model)$r.squared %>% 
  c(paste("Root Mean Squared Error (RMSE):", sqrt(mean(model$residuals^2))), 
    paste("R-squared:", .), 
    paste("Adjusted R-squared:", summary(model)$adj.r2), 
    paste("AIC:", AIC(model)), 
    paste("BIC:", BIC(model))) %>% 
  cat(paste0("* ", .), sep = "\n")
0.878534379489978
Root Mean Squared Error (RMSE): 2.64920166828244
R-squared: 0.878534379489978
Adjusted R-squared: 
AIC: 1528.50042219633
BIC: 1558.54635990502
* 0.878534379489978
* Root Mean Squared Error (RMSE): 2.64920166828244
* R-squared: 0.878534379489978
* Adjusted R-squared: 
* AIC: 1528.50042219633
* BIC: 1558.54635990502

Report the Results using Report Function

library(report)
report_table(model)

II. The significance of the explanatory variables.

The regression analysis reveals that the dependent variable, MPG (miles per gallon), is significantly influenced by several independent variables. GallonsPer100Miles has a strong negative relationship with MPG (coef. = -3.336, p < 0.001), indicating that as gallons per 100 miles increase, MPG decreases. Similarly, Horsepower100 has a positive relationship (coef. = 3.115, p < 0.001), suggesting that higher horsepower is associated with lower MPG.

Other variables such as Weight1000lb also negatively impact MPG (coef. = -1.650, p < 0.001), implying that heavier vehicles tend to have lower fuel efficiency. However, Cylinders, Displacement100ci, and Seconds0to60 show no statistically significant relationship with MPG. The overall model is highly significant (F Statistic = 450.982, p < 0.001) and explains approximately 90.2% of the variance in MPG. Therefore, these variables collectively provide a good understanding of the factors affecting fuel efficiency in vehicles.

III. The contribution of the explanatory variables toward the variations in the dependent variable.

In the model above, 90% of the variation in MPG is explained by all the predictors in the model, as indicated by an adjusted R-square of 0.900. The following is the individual cotribution of each predictor to the predicted variable.

library(relaimpo)
rel_imp <- calc.relimp(model, type = "lmg")
rel_imp
Response variable: MPG 
Total response variance: 57.96331 
Analysis based on 316 observations 

6 Regressors: 
GallonsPer100Miles Cylinders Displacement100ci Horsepower100 Weight1000lb Seconds0to60 
Proportion of variance explained by model: 87.85%
Metrics are not normalized (rela=FALSE). 

Relative importance metrics: 

                          lmg
GallonsPer100Miles 0.32364246
Cylinders          0.11665614
Displacement100ci  0.12898432
Horsepower100      0.12464373
Weight1000lb       0.15154632
Seconds0to60       0.03306141

Average coefficients for different model sizes: 

                           1X         2Xs          3Xs         4Xs        5Xs
GallonsPer100Miles  -4.362938 -4.41356882 -4.445032945 -4.46237963 -4.4796383
Cylinders           -3.436396 -1.21592893 -0.356786989 -0.17496077 -0.1038768
Displacement100ci   -5.856599 -3.25581147 -1.308349081 -0.16541115  0.2844714
Horsepower100      -15.855329 -7.57203024 -3.485475510 -0.98585009  1.2567639
Weight1000lb        -7.530221 -4.99897323 -3.608579159 -2.58023926 -1.7784843
Seconds0to60         1.187825 -0.01180123 -0.003588237  0.04921163  0.1231131
                           6Xs
GallonsPer100Miles -4.49606126
Cylinders          -0.03741527
Displacement100ci   0.37604254
Horsepower100       3.31639690
Weight1000lb       -1.18041458
Seconds0to60        0.20453804

The relative importance metrics indicate how much each predictor contributes to the variance explained by the model.

The average coefficients for different model sizes show the estimated impact of each variable on MPG at various stages of model complexity, with coefficients becoming less extreme as more variables are added to the model. These results can be visualized as shown below.

library(ggplot2)
# Convert rel_imp$lmg to a data frame
df <- data.frame(variable = names(rel_imp$lmg), importance = rel_imp$lmg)

# Arrange the data frame in descending order of relative importance
df <- df[order(df$importance, decreasing = TRUE), ]

# Create the bar plot using ggplot2
ggplot(df, aes(x = reorder(variable, -importance), y = importance)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = paste0(round(importance * 100, 3), "%")), vjust = -0.5, size = 3) +  # Format labels as percentages
  labs(x = "Independent variables", y = "Relative Importance", 
       title = "A Bar plot of Relative Importance") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

IV. Determine the overall significance of the fitted model.

The overall significance of the model is determined by looking at the F-value and its associated p-value. From the results above, the F-value of F Statistic 450.982*** (df = 6; 293), which is statistically significant indicates that the model is overall significant and can be used for prediction.

V. Conduct model diagnostics on the fitted model to see if the model is appropriate for predicting future occurrence of heart disease.

I. Normality of the residuals

Among the model diagnostics include the normality of the regression residuals, homoscedastic variance of the error term and the correlation of the predictors. Consider the following plot of the regression residuals.

library(forecast)
checkresiduals(model)


    Breusch-Godfrey test for serial correlation of order up to 10

data:  Residuals
LM test = 74.69, df = 10, p-value = 5.466e-12

The histogram of the distribution of the residuals shows a fairly normally distributed residuals with some aspects of positively skewed however, with no much concern. Further we can test normality using Q-Q plot together with histogram as shown below, showing normally distributed residuals

library(car)
par(mfrow = c(1, 2))
# Q-Q plot
qqPlot(model$residuals, main = "Q-Q Plot of the regression residuals")
 29 328 
 21 263 
# Histogram
hist(model$residuals, col = "lightblue",breaks = 20, main = "Histogram of Residuals")

The shows a fairly normal distribution of the regression residuals with minor aspects of positively skewed distribution, which in this case is not of serious concern. Consider the following inferential results to check the normality of the residuals

# Shapiro-Wilk Test (for formality)
shapiro.test(model$residuals)  # Check p-value for normality

    Shapiro-Wilk normality test

data:  model$residuals
W = 0.84066, p-value < 2.2e-16

The results above with the accompanied p-value shows the regression residuals do not follow a normal distribution. As said earlier, the regression residuals appears to have a positively skewed distribution, however, this is not of great concern.

II. Homoscedasticity

The assumption requires that variance of the regression residuals should constant over time. Otherwise if violated, our model will suffer from heteroscedastic problem. Consider the results below.

library(car)
library(tseries)

Heteroscedasticity

ncvTest(model)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 0.001997865, Df = 1, p = 0.96435

Since the p-value is less than the conventional significance level of 0.05, we reject the null hypothesis of constant variance. This means that there is enough evidence to conclude that the variance of the residuals varies across the range of the predictor variable(s). In other words, our model is suffering from non-constant variance of the error term. Over time, the variance of the regression residuals is not constant.

III. Multicollinearity

vif(model)
GallonsPer100Miles          Cylinders  Displacement100ci      Horsepower100 
          5.497065          10.711602          20.007357          10.299737 
      Weight1000lb       Seconds0to60 
         11.301805           2.578910 

The VIF of 1 indicates no correlation, no multicollinearity. The VIF between 1 and 5 indicates a moderate correlation, some multicollinearity but may not be a major concern. The VIF greater than 5 indicates a high correlation, significant multicollinearity which can cause issues with your model. Based on the output above, our model is suffering from multicollinearity and therefore good for prediction. Some variables are likely correlated. Let’s look at the correlation plot below.

# Select numeric columns from Wine_data (assuming these are the columns you want to analyze)
numeric_data <- train_data[, sapply(train_data, is.numeric)]

# Calculate correlation matrix
correlation_matrix <- cor(numeric_data)
# Create a correlation heatmap using corrplot
corrplot(correlation_matrix,
         method = "color",            # Display correlations using color
         type = "upper",              # Show upper triangle of the correlation matrix
         tl.col = "black",            # Color of text labels
         tl.srt = 45,                 # Rotate text labels by 45 degrees
         diag = FALSE,                # Exclude diagonal elements
         addCoef.col = "black",       # Color of correlation coefficients
         col = colorRampPalette(c("blue", "white", "red"))(100),  # Custom color palette
         main = "Correlation Heatmap", # Main title
         pch = 16,                    # Use solid circles for data points
         cex.lab = 1.2,               # Adjust label size
         cex.main = 0.5 ,              # Adjust main title size
         cex.axis = 0.5 ,              # Adjust axis label size
         number.cex = 0.5)             # Adjust number size in the color legend

Some variables above are highly correlated. For instance, Displace and weight are highly correlated. Cylinder and Displacement are also highly correlated as shown in the graph which a correlation coefficient of 0.95. These are possible reasons for our model to suffer from multicollinearity. #### IV. AIC and BIC

library(report)
report_table(model)

VI. Use the model trained with the 80% observations above to get the predicted values for 20% observations.

Prediction and a Plot of the Predicted and Actual MPG (Testing Data)

# Predict values on the test set
pred_reg <- predict(model, newdata = test_data)

# Add predictions as a new column to the test set
test_data$pred_reg <- pred_reg
head(test_data,10)

Plot

library(ggplot2)
library(ggthemes)
# Combine data into a data frame
data1 <- data.frame(Actual_MPG = test_data$MPG, Predicted_MPG = test_data$pred_reg)

# Create line plot
ggplot(data1, aes(x = 1:nrow(data1))) +
  geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
  geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
  scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
  labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
  theme_economist()

From the plot above, the model don’t seem to work well since the predicted MPG is lying slightly far below the actual MPG, and this is a serious concern. It mean that our model is not correctly capturing the variation in MPG as explained by the predictors in the model.

VII. Using an appropriate tool/method determine whether there is a significant difference between the observed and the corresponding predicted values of the occurrence of heart disease found in (vi) and draw your conclusion.

plot(data1$Actual_MPG, data1$Predicted_MPG, ylab = "MPG", xlab = "Predicted MPG", main ="Scatter plot of Observed and Predicted MPG")
abline(lm(data1$Actual_MPG ~ data1$Predicted_MPG), col = "red")

T_test to determine if there exist a significane difference between the observed and predicted heart disease

library(report)
head(data1,5)
T_Test <- t.test(data1$Actual_MPG, data1$Predicted_MPG, var.equal = TRUE, data = data1)  # Assuming equal variances
T_Test

    Two Sample t-test

data:  data1$Actual_MPG and data1$Predicted_MPG
t = 0.46148, df = 150, p-value = 0.6451
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.028528  3.264819
sample estimates:
mean of x mean of y 
 24.09079  23.47264 

The results above indicates that there is a significant difference between the the observed MPG and the predicted MPG since the p-value if greater than less than 0.05. This means that our model is not doing well in terms of prediction.

VIII. Would you want to revise your answer to (v) given your results in (vi) and (vii).

Based on the results in (VI) and (VII) I would like not revise my answer for (V) since the estimated model is not appropriate for prediction.

MACHINE LEARNING

Load the Following Libraries

library(recipes)
library(lava)
library(sjmisc)
library(igraph)
library(e1071)
library(hardhat)
library(ipred)
library(caret)
library(sjPlot)
library(nnet)
library(wakefield)
library(kknn)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(ISLR)
library(ISLR2)
library(MASS)
library(splines)
library(splines2)
library(pROC)
library(ISLR)
library(ISLR2)
library(MASS)
library(splines)
library(splines2)
library(pROC)
library(randomForest)
library(rpart)
library(rpart.plot)
library(rattle)
library(ISLR2)
library(MASS)
library(splines)
library(pROC)
library(rattle)
library(rpart)
library(party)
library(partykit)
library(ggplot2)
library(tune)
library(TunePareto)

KNN Model

set.seed(333)
trControl <- trainControl(method = 'repeatedcv',
                          number = 10,
                          repeats = 3)
attach(data)
FIT <- train(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, 
             data = train_data,
             tuneGrid = expand.grid(k=1:70),
             method = 'knn',
             trControl = trControl,
             preProc = c('center', 'scale'))

Model Performance

FIT
k-Nearest Neighbors 

316 samples
  6 predictor

Pre-processing: centered (6), scaled (6) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 286, 285, 283, 284, 284, 284, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   1  1.783395  0.9465330  1.267884
   2  1.578264  0.9588961  1.083716
   3  1.640619  0.9555809  1.118581
   4  1.771783  0.9496549  1.240077
   5  1.886566  0.9432727  1.294011
   6  1.993973  0.9362281  1.360566
   7  2.092084  0.9292225  1.444971
   8  2.168548  0.9234858  1.479690
   9  2.197972  0.9211316  1.507198
  10  2.233672  0.9181431  1.525158
  11  2.289286  0.9137305  1.558220
  12  2.324264  0.9112160  1.585555
  13  2.328740  0.9112009  1.597873
  14  2.369729  0.9077910  1.627473
  15  2.399360  0.9053807  1.657008
  16  2.427778  0.9033158  1.689657
  17  2.456775  0.9012417  1.718322
  18  2.475448  0.8999278  1.733581
  19  2.492300  0.8987327  1.750486
  20  2.505375  0.8979640  1.765105
  21  2.528126  0.8962883  1.789002
  22  2.545789  0.8949811  1.809281
  23  2.568144  0.8929809  1.830354
  24  2.593261  0.8907865  1.848483
  25  2.615129  0.8890966  1.863240
  26  2.644319  0.8865205  1.884953
  27  2.668191  0.8844273  1.902565
  28  2.695351  0.8819705  1.923651
  29  2.711865  0.8806072  1.939511
  30  2.730406  0.8789415  1.953984
  31  2.747350  0.8774241  1.966602
  32  2.760929  0.8765021  1.980827
  33  2.768751  0.8760121  1.994704
  34  2.783753  0.8745695  2.008538
  35  2.800665  0.8731480  2.022487
  36  2.817350  0.8715688  2.038481
  37  2.846050  0.8690780  2.060828
  38  2.871581  0.8667788  2.080590
  39  2.890841  0.8651714  2.092504
  40  2.915780  0.8627445  2.112087
  41  2.934438  0.8609307  2.128435
  42  2.958642  0.8585409  2.152531
  43  2.975195  0.8570089  2.169251
  44  2.991043  0.8555035  2.184206
  45  3.006775  0.8540882  2.199963
  46  3.015617  0.8534752  2.209508
  47  3.034689  0.8517103  2.223085
  48  3.047292  0.8507089  2.233995
  49  3.063336  0.8492785  2.249210
  50  3.079286  0.8479238  2.260989
  51  3.092237  0.8468482  2.276027
  52  3.111173  0.8452303  2.287992
  53  3.129631  0.8434392  2.302984
  54  3.144063  0.8421629  2.316153
  55  3.161100  0.8404277  2.329542
  56  3.178659  0.8388016  2.345005
  57  3.196145  0.8369504  2.359777
  58  3.211336  0.8355144  2.368638
  59  3.229410  0.8336502  2.379884
  60  3.244792  0.8319950  2.390782
  61  3.256934  0.8310370  2.402198
  62  3.269184  0.8298438  2.411623
  63  3.278817  0.8290036  2.418033
  64  3.290029  0.8281306  2.428655
  65  3.305502  0.8266414  2.439386
  66  3.313860  0.8261223  2.447875
  67  3.325855  0.8252619  2.456941
  68  3.340524  0.8239564  2.469927
  69  3.353495  0.8227274  2.483024
  70  3.364866  0.8217960  2.494411

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 2.

Check the model type

FIT$modelType
[1] "Regression"

Obtain the coefficient names

FIT$coefnames
[1] "GallonsPer100Miles" "Cylinders"          "Displacement100ci" 
[4] "Horsepower100"      "Weight1000lb"       "Seconds0to60"      

The results above shows that we have run K-Nearest Neighbors with 2 samples and 6 predictors. We used cross validation where ten folds were used and repeated three times. This means that the data is divided into ten folds and only nine folds are used in model estimation and one fold is used for model assessment and validation. From the model RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 2.

Plot the model

plot(FIT, xlab = "Nearest Neighbors", main = "APlot of Root Mean Square Error (RMSE) for the K-Nearest Neighbors")

View the Model

FIT
k-Nearest Neighbors 

316 samples
  6 predictor

Pre-processing: centered (6), scaled (6) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 286, 285, 283, 284, 284, 284, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   1  1.783395  0.9465330  1.267884
   2  1.578264  0.9588961  1.083716
   3  1.640619  0.9555809  1.118581
   4  1.771783  0.9496549  1.240077
   5  1.886566  0.9432727  1.294011
   6  1.993973  0.9362281  1.360566
   7  2.092084  0.9292225  1.444971
   8  2.168548  0.9234858  1.479690
   9  2.197972  0.9211316  1.507198
  10  2.233672  0.9181431  1.525158
  11  2.289286  0.9137305  1.558220
  12  2.324264  0.9112160  1.585555
  13  2.328740  0.9112009  1.597873
  14  2.369729  0.9077910  1.627473
  15  2.399360  0.9053807  1.657008
  16  2.427778  0.9033158  1.689657
  17  2.456775  0.9012417  1.718322
  18  2.475448  0.8999278  1.733581
  19  2.492300  0.8987327  1.750486
  20  2.505375  0.8979640  1.765105
  21  2.528126  0.8962883  1.789002
  22  2.545789  0.8949811  1.809281
  23  2.568144  0.8929809  1.830354
  24  2.593261  0.8907865  1.848483
  25  2.615129  0.8890966  1.863240
  26  2.644319  0.8865205  1.884953
  27  2.668191  0.8844273  1.902565
  28  2.695351  0.8819705  1.923651
  29  2.711865  0.8806072  1.939511
  30  2.730406  0.8789415  1.953984
  31  2.747350  0.8774241  1.966602
  32  2.760929  0.8765021  1.980827
  33  2.768751  0.8760121  1.994704
  34  2.783753  0.8745695  2.008538
  35  2.800665  0.8731480  2.022487
  36  2.817350  0.8715688  2.038481
  37  2.846050  0.8690780  2.060828
  38  2.871581  0.8667788  2.080590
  39  2.890841  0.8651714  2.092504
  40  2.915780  0.8627445  2.112087
  41  2.934438  0.8609307  2.128435
  42  2.958642  0.8585409  2.152531
  43  2.975195  0.8570089  2.169251
  44  2.991043  0.8555035  2.184206
  45  3.006775  0.8540882  2.199963
  46  3.015617  0.8534752  2.209508
  47  3.034689  0.8517103  2.223085
  48  3.047292  0.8507089  2.233995
  49  3.063336  0.8492785  2.249210
  50  3.079286  0.8479238  2.260989
  51  3.092237  0.8468482  2.276027
  52  3.111173  0.8452303  2.287992
  53  3.129631  0.8434392  2.302984
  54  3.144063  0.8421629  2.316153
  55  3.161100  0.8404277  2.329542
  56  3.178659  0.8388016  2.345005
  57  3.196145  0.8369504  2.359777
  58  3.211336  0.8355144  2.368638
  59  3.229410  0.8336502  2.379884
  60  3.244792  0.8319950  2.390782
  61  3.256934  0.8310370  2.402198
  62  3.269184  0.8298438  2.411623
  63  3.278817  0.8290036  2.418033
  64  3.290029  0.8281306  2.428655
  65  3.305502  0.8266414  2.439386
  66  3.313860  0.8261223  2.447875
  67  3.325855  0.8252619  2.456941
  68  3.340524  0.8239564  2.469927
  69  3.353495  0.8227274  2.483024
  70  3.364866  0.8217960  2.494411

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 2.

The graph above shows that lowest value of RMASE is found when we have k as 3. After that point, the value of RMASE starts to increase steadily.

Variable Importance

varImp(FIT)
loess r-squared variable importance

                   Overall
GallonsPer100Miles  100.00
Weight1000lb         76.94
Horsepower100        74.09
Displacement100ci    71.72
Cylinders            58.38
Seconds0to60          0.00

Plot variable importance

plot(varImp(FIT), xlab = "Percentage Importance", 
     ylab ="Variables", main = "Variable Importance for the KNN Model")

The output above variables importance in our model from the most important one to the least. The variable “chas” has significant importance in our knn model. In other words, “chas” is the least important variable.

Prediction

pred <- predict(FIT, newdata = test_data)

NOTE

Because the response variable numeric and continuous, we will calculate the root mean square for assessment.

RMSE(pred, test_data$MPG)
[1] 2.614515

Make the Plot

plot(pred, test_data$MPG)

We can chooce R-square for Model Evaluation

FIT <- train(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, 
             data = test_data,
             tuneGrid = expand.grid(k=1:70),
             method = 'knn',
             metric = 'Rsquared',
             trControl = trControl,
             preProc = c('center', 'scale'))

Model Performance

FIT
k-Nearest Neighbors 

76 samples
 6 predictor

Pre-processing: centered (6), scaled (6) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 68, 68, 68, 69, 68, 68, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   1  3.642680  0.8285006  2.631189
   2  3.381405  0.8510568  2.313700
   3  3.286426  0.8518784  2.296541
   4  3.308465  0.8533886  2.350619
   5  3.292164  0.8564694  2.379178
   6  3.364698  0.8492442  2.467220
   7  3.376008  0.8517287  2.499675
   8  3.369542  0.8497168  2.528384
   9  3.394021  0.8517873  2.591897
  10  3.388890  0.8519823  2.569699
  11  3.435969  0.8496311  2.593895
  12  3.496551  0.8458750  2.645470
  13  3.541179  0.8414177  2.678483
  14  3.605789  0.8358056  2.723712
  15  3.645515  0.8336030  2.779853
  16  3.707619  0.8300868  2.856962
  17  3.798858  0.8231895  2.934223
  18  3.874223  0.8162662  2.998468
  19  3.906228  0.8168928  3.053208
  20  3.965166  0.8143006  3.141060
  21  4.012164  0.8128860  3.213832
  22  4.075422  0.8092140  3.272805
  23  4.130741  0.8052150  3.334578
  24  4.130002  0.8084642  3.342645
  25  4.173709  0.8070970  3.411567
  26  4.253053  0.8014468  3.490420
  27  4.353496  0.7950761  3.608816
  28  4.456611  0.7879175  3.707909
  29  4.551243  0.7848085  3.827875
  30  4.646857  0.7821657  3.935700
  31  4.753717  0.7764822  4.058242
  32  4.831508  0.7750315  4.157505
  33  4.913362  0.7711757  4.244205
  34  4.986697  0.7685962  4.328727
  35  5.061106  0.7677898  4.404852
  36  5.111894  0.7718253  4.467672
  37  5.162866  0.7758691  4.511663
  38  5.228514  0.7772398  4.576750
  39  5.308262  0.7750717  4.656066
  40  5.401852  0.7710140  4.740367
  41  5.508070  0.7640209  4.834506
  42  5.602335  0.7580722  4.917074
  43  5.696394  0.7514668  4.995238
  44  5.782067  0.7428961  5.072626
  45  5.856428  0.7405677  5.136728
  46  5.950933  0.7354029  5.219848
  47  6.052314  0.7305431  5.308826
  48  6.150310  0.7206875  5.389123
  49  6.241525  0.7176369  5.472246
  50  6.356709  0.7112680  5.568984
  51  6.467270  0.7004606  5.668242
  52  6.584435  0.6930337  5.770255
  53  6.717015  0.6840864  5.881993
  54  6.847536  0.6727100  5.994121
  55  6.983855  0.6559024  6.119296
  56  7.116343  0.6399409  6.238169
  57  7.232595  0.6234157  6.343109
  58  7.352577  0.6112472  6.451494
  59  7.466454  0.6000516  6.556106
  60  7.578958  0.5908322  6.651006
  61  7.689233  0.5768923  6.744090
  62  7.794351  0.5739977  6.828971
  63  7.899706  0.5751545  6.913336
  64  8.019408  0.5657804  7.013417
  65  8.145226  0.5589981  7.116412
  66  8.256982  0.5323849  7.206872
  67  8.354644  0.5100389  7.283514
  68  8.436378  0.4836368  7.346707
  69  8.462588  0.4507053  7.364396
  70  8.465873  0.4424406  7.366180

Rsquared was used to select the optimal model using the largest value.
The final value used for the model was k = 5.
plot(FIT)

Variables Importance

varImp(FIT)
loess r-squared variable importance

                   Overall
GallonsPer100Miles  100.00
Weight1000lb         58.38
Displacement100ci    57.55
Horsepower100        54.10
Cylinders            47.17
Seconds0to60          0.00

Plot variable Importance

plot(varImp(FIT), xlab = "Percentage Importance", 
     ylab ="Variables", main = "Variable Importance for the KNN Model")

Prediction

pred_knn <- predict(FIT, newdata = test_data)
test_data <- data.frame(test_data, pred_knn)
head(test_data,5)

Check the Structure of the Data Set

str(test_data)
'data.frame':   76 obs. of  10 variables:
 $ GallonsPer100Miles: num  5.9 7.1 7.1 5.6 4 4.2 10 9.1 4 6.3 ...
 $ MPG               : num  17 14 14 18 25 24 10 11 25 16 ...
 $ Cylinders         : int  8 8 8 6 4 4 8 8 4 6 ...
 $ Displacement100ci : num  3.02 4.4 4.55 1.99 1.1 1.07 3.07 3.18 1.13 2.25 ...
 $ Horsepower100     : num  1.4 2.15 2.25 0.97 0.87 0.9 2 2.1 0.95 1.05 ...
 $ Weight1000lb      : num  3.45 4.31 4.42 2.77 2.67 ...
 $ Seconds0to60      : num  10.5 8.5 10 15.5 17.5 14.5 15 13.5 14 15.5 ...
 $ Name              : chr  "ford torino" "plymouth fury iii" "pontiac catalina" "amc hornet" ...
 $ pred_reg          : num  18.2 14.4 15 19.6 26.8 ...
 $ pred_knn          : num  14.6 14.4 14 19.3 25.3 ...

Plot the Actual Response Variable and Predicted Values

library(ggplot2)
library(ggthemes)
# Combine data into a data frame
data2 <- data.frame(Actual_MPG = test_data$MPG, Predicted_MPG = test_data$pred_knn)

# Create line plot
ggplot(data2, aes(x = 1:nrow(data2))) +
  geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
  geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
  scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
  labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
  theme_economist()

NOTE

Because the response variable numeric and continuous, we will calculate the root mean square for assessment.

RMSE(pred_knn, data2$Actual_MPG)
[1] 2.879546

Make the Plot

plot(pred_knn, data2$Actual_MPG, xlab = "Prediction", ylab = "mpg", main = "The scatter plot of predicted and actual mpg")

No Much Improvement.

Interpreting a k-Nearest Neighbors (k-NN) regression model is somewhat different from interpreting a traditional linear regression model. In k-NN regression, instead of fitting a mathematical equation to your data, the model makes predictions based on the values of the k nearest data points in the training dataset. Here’s how you can interpret a k-NN regression model:

Prediction Process:

For each data point you want to predict, the k-NN algorithm identifies the k nearest data points in the training dataset based on some distance metric (e.g., Euclidean distance). It then calculates the average (or weighted average) of the target values (the values you’re trying to predict) for those k nearest data points. This average is used as the prediction for the new data point. Tuning Parameter (k):

The most important parameter in k-NN regression is “k,” which represents the number of nearest neighbors to consider. A small k (e.g., 1 or 3) may result in a model that closely follows the training data but is sensitive to noise. A large k (e.g., 10 or more) may result in a smoother prediction surface but might not capture local variations. The choice of k should be based on cross-validation and the characteristics of your data.

Distance Metric:

The choice of distance metric (e.g., Euclidean, Manhattan, etc.) can impact the model’s performance. Different distance metrics can lead to different interpretations of “closeness” among data points.

Non-Linearity:

Unlike linear regression, k-NN regression can capture non-linear relationships in the data. Interpretation may not involve coefficients or slope values as in linear regression.

Local vs. Global Patterns:

k-NN regression captures local patterns in the data. Interpretation involves understanding the local behavior around a prediction point. The model does not provide a global equation that describes the entire dataset.

Visualizations:

Visualizations can be helpful for interpretation. Plotting the k nearest neighbors for specific data points can provide insights into why the model made a particular prediction.

Feature Importance:

k-NN does not provide feature importance scores like some other models (e.g., Random Forest or Gradient Boosting). However, you can still analyze feature importance indirectly by examining which features are most influential in determining the nearest neighbors.

Performance Metrics:

Use appropriate regression evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared to assess model performance. These metrics can give you a sense of how well the k-NN regression model fits the data.

Rescaling Features:

Scaling and normalization of features can significantly affect the results, as k-NN is sensitive to the scale of input variables. Interpretation may involve considering the effects of feature scaling. In summary, interpreting a k-NN regression model involves understanding its prediction process, the choice of hyperparameters (especially k), the impact of distance metrics, and the local nature of the model’s predictions. Visualizations and performance metrics play a crucial role in assessing and explaining the model’s behavior on your specific dataset.

K-Nearest Neighbors

In k-Nearest Neighbors (k-NN) regression, accuracy is not typically used as an evaluation metric because k-NN regression is a type of regression, not classification. Accuracy is more suitable for classification problems where you’re predicting discrete class labels.

For regression tasks, you typically use different evaluation metrics to assess the performance of your model. Common metrics for regression tasks include:

Mean Absolute Error (MAE):

It measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to Mean Squared Error.

Mean Squared Error (MSE):

It measures the average of the squared differences between predicted and actual values. It gives higher weight to larger errors.

Root Mean Squared Error (RMSE):

It is the square root of the MSE and provides an interpretable measure of the average error in the same units as the target variable.

R-squared (R2):

It measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared indicates a better fit.

Support Vector Machine (Support Vector Regressor)

Support Vector Regressor (SVR) is a supervised machine learning algorithm used for regression tasks. It’s based on the concept of Support Vector Machines (SVMs) typically used for classification problems. Here’s how SVR works:

Objective:

The goal of SVR is to find a hyperplane (in high dimensional space for non-linear cases) that best fits the training data while minimizing the prediction error. This hyperplane minimizes the distance to the closest data points, called support vectors. Unlike linear regression, SVR allows for a certain margin of error around the hyperplane.

Key Points:

Non-linearity:

SVR can handle non-linear relationships between features and target variables using kernel functions. These functions project the data into a higher-dimensional space where a linear relationship might exist.

Robustness:

SVR is less sensitive to outliers compared to some other regression methods. This is because it focuses on the support vectors rather than being influenced by all data points equally.

Regularization:

SVR inherently performs regularization by controlling the margin of error around the hyperplane. This helps to prevent overfitting. R code for Support Vector Regression with kernlab package

1. Install and Load Packages:

# Install kernlab package if not already installed
if(!require(kernlab)) install.packages("kernlab")
library(kernlab)

2. Prepare Data:

  • Split your data into training and testing sets.

  • Ensure your target variable is numeric.

set.seed(16)
ind <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))

train_data <- data[ind == 1, ]
test_data <- data[ind == 2, ]

3. Define SVR Model

library(caret)
model_svm <- train(
  MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, 
             data = train_data,
  method = 'svmRadial',
  preProcess = c("center", "scale"),
  trCtrl = trainControl(method = "none")
)
model_svm
Support Vector Machines with Radial Basis Function Kernel 

316 samples
  6 predictor

Pre-processing: centered (6), scaled (6) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 316, 316, 316, 316, 316, 316, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared   MAE      
  0.25  2.181329  0.9215166  1.0832816
  0.50  1.855165  0.9412598  0.9165128
  1.00  1.592084  0.9555932  0.8222594

Tuning parameter 'sigma' was held constant at a value of 0.4651685
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.4651685 and C = 1.

Again, to find R square and RMSE for test data

pred_tr_svm <- predict(model_svm, train_data)
pred_tst_svm <- predict(model_svm, test_data)

fit_ind_tr_svm <- data.frame(
  R2 = R2(pred_tr_svm, train_data$MPG),
  RMSE = RMSE(pred_tr_svm, train_data$MPG)
)

fit_ind_tst_svm <- data.frame(
  R2 = R2(pred_tst_svm, test_data$MPG),
  RMSE = RMSE(pred_tst_svm, test_data$MPG)
)

Comparison table of test & train data

data.frame(
  Model = c("SVM Train", "SVM Test"),
  R2 = c(fit_ind_tr_svm$R2, fit_ind_tst_svm$R2),
  RMSE = c(fit_ind_tr_svm$RMSE, fit_ind_tst_svm$RMSE)
)

R-square for test data of SVM model is 0.9309281 which means that independent variables are able to explain 93.09281% of variance in dependent variable on test data. Here, the difference between RMSE & R-square is less than 5 percent difference so we can say that there is no over-fitting or under-fitting in the model.

Cross Vaildations

Leave One Our Cross Validation

Leave-One-Out Cross Validation (LOOCV) is a technique used to estimate the performance of a machine learning model. Here’s a breakdown of how it works:

1. Individual Removal:

LOOCV iterates through your entire dataset one sample at a time. For each iteration, it removes a single sample from the dataset. This removed sample becomes the “test set” for that particular iteration.

2. Model Training:

The remaining data points (all except the removed sample) become the “training set”. The model is then trained using this training set.

3. Prediction and Evaluation:

Once the model is trained, it’s used to predict the value of the target variable for the removed “test set” sample. This prediction is compared to the actual value of the target variable in the removed sample. The error between the predicted and actual value is calculated (e.g., squared error for regression).

4. Repetition and Averaging:

This process of removing a sample, training the model, predicting, and calculating error is repeated for every single sample in the dataset. Each sample gets a chance to be the “test set” once. Finally, the errors from all iterations are averaged to obtain a single estimate of the model’s performance on unseen data.

tcr_loocv <- trainControl(method = "LOOCV")
model_loocv <- train(
  MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
  data = train_data, 
  method="svmRadial", 
  trControl = tcr_loocv)

pred_tst_loocv <- predict(model_loocv, test_data)

fit_ind_tst_loocv <- data.frame(
  R2 = R2(pred_tst_loocv, test_data$MPG),
  RMSE = RMSE(pred_tst_loocv, test_data$MPG)
)

Train model data & test model data

print(model_loocv)
Support Vector Machines with Radial Basis Function Kernel 

316 samples
  6 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 315, 315, 315, 315, 315, 315, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared   MAE      
  0.25  2.249977  0.9214763  1.0396851
  0.50  1.848800  0.9458065  0.8437484
  1.00  1.509427  0.9640092  0.7274306

Tuning parameter 'sigma' was held constant at a value of 0.4126495
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.4126495 and C = 1.
fit_ind_tst_loocv

K-folds Cross Validation

K-Fold Cross Validation (K-Fold CV) and Repeated K-Fold Cross Validation (Repeated K-Fold CV) are both techniques used to evaluate the performance of machine learning models. They share some similarities but also have key differences:

K-Fold Cross Validation:

Data Splitting:

K-Fold CV splits the data into k folds (groups) of (almost) equal size. A common choice for k is 10.

Iteration:

The following steps are repeated k times: One fold is chosen as the test set for evaluation. The remaining k-1 folds are combined to form the training set. The model is trained on the training set. The trained model is used to make predictions on the test set. The performance of the model is evaluated using a metric like accuracy, error rate, or AUC (for classification) or R-squared, RMSE (for regression). Performance Estimation: The performance metrics from all k iterations are averaged to obtain a final estimate of the model’s generalizability (performance on unseen data).

Feature LOOCV K-Fold CV
Data Splitting n folds (n samples) k folds (equal size)
Iterations n k
Computational Cost High (n model trainings) Lower (k model trainings)
Performance Estimate Can be pessimistic Less variance
Common Use Cases Small datasets, understanding concepts Most practical scenarios
tcr_cv <- trainControl(method = "cv", number = 10)
model_cv <- train(
  MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
  data = train_data, 
  method="svmRadial", 
  trControl = tcr_cv)

pred_tst_cv <- predict(model_cv, test_data)

fit_ind_tst_cv <- data.frame(
  R2 = R2(pred_tst_cv, test_data$MPG),
  RMSE = RMSE(pred_tst_cv, test_data$MPG)
)

View the Model

model_cv
Support Vector Machines with Radial Basis Function Kernel 

316 samples
  6 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 284, 284, 284, 286, 284, 285, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared   MAE      
  0.25  2.218443  0.9227143  1.1399095
  0.50  1.810435  0.9461030  0.9165103
  1.00  1.484589  0.9642400  0.7832534

Tuning parameter 'sigma' was held constant at a value of 0.5265069
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.5265069 and C = 1.

View the accuracy on the Testing Data

fit_ind_tst_cv

Repeated K-fold Cross Validation

Repeated K-Fold Cross Validation:

Outer Loop: Repeated K-Fold CV introduces an additional outer loop that repeats the entire K-Fold CV process r times (e.g., r = 3). Inner K-Fold CV: Within each outer loop iteration, the regular K-Fold CV procedure (described above) is performed with the chosen value of k. Performance Aggregation: After completing all r x k iterations, the performance metrics are collected and potentially averaged across all folds and repetitions. Key Differences:

Number of Iterations: K-Fold CV iterates k times, while Repeated K-Fold CV iterates k times within each outer loop repetition (r times). Variance Reduction: Repeated K-Fold CV aims to reduce the variance of the performance estimate obtained from a single run of K-Fold CV. Different data splits in K-Fold CV can lead to slightly different performance estimates. By repeating the process multiple times, Repeated K-Fold CV provides a more stable estimate. Computational Cost: Repeated K-Fold CV is computationally more expensive than K-Fold CV due to the additional outer loop repetitions. Choosing Between Them:

K-Fold CV: A good starting point for most cases. It’s simpler to implement and less computationally expensive. Repeated K-Fold CV: Consider using it if: You suspect high variance in the performance estimates from K-Fold CV. You have sufficient computational resources. In summary, K-Fold CV provides a basic and efficient way to evaluate model performance. Repeated K-Fold CV adds an extra layer of stability by averaging performance estimates from multiple K-Fold CV runs.

Feature K-Fold Cross Validation Repeated K-Fold Cross Validation
Definition Divides the dataset into k subsets Similar to k-fold, but the process is repeated multiple times with different random splits of the data
Number of Iterations One iteration Multiple iterations (repeats)
Randomization Typically, data is shuffled once and divided into k folds Randomization occurs multiple times, creating new random splits for each iteration
Variability Reduction Provides a single estimate of model performance Reduces variability by averaging performance metrics over multiple iterations
Performance Estimation Estimates model performance based on one split of the data Provides more robust estimates of model performance by averaging over multiple splits
Computationally Expensive Less computationally expensive compared to repeated k-fold More computationally expensive due to multiple iterations, especially with large datasets
Use Cases Suitable for initial model evaluation or when computational resources are limited Recommended when a more reliable estimate of model performance is required, or when dealing with small datasets
Implementation Can be easily implemented using built-in functions in most machine learning libraries Implementation may require custom code or specialized functions that support repeated cross-validation
tcr_cv <- trainControl(method = "repeatedcv", number = 10, repeats=3)
model_rep_cv <- train(
  MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
  data = train_data, 
  method="svmRadial", 
  trControl = tcr_cv)

pred_tst_rep_cv <- predict(model_rep_cv, test_data)

fit_ind_tst_rep_cv <- data.frame(
  R2 = R2(pred_tst_rep_cv, test_data$MPG),
  RMSE = RMSE(pred_tst_rep_cv, test_data$MPG)
)

View the Model

model_rep_cv
Support Vector Machines with Radial Basis Function Kernel 

316 samples
  6 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 284, 283, 284, 284, 286, 284, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared   MAE      
  0.25  2.369185  0.9118999  1.2007868
  0.50  1.984428  0.9363234  0.9875723
  1.00  1.668257  0.9555694  0.8582554

Tuning parameter 'sigma' was held constant at a value of 0.6495909
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.6495909 and C = 1.

View the Accuracy on the Testing Data

fit_ind_tst_rep_cv

Compare the three Models

fit_indices_table <- data.frame(
  Model = c("SVM", "LOOCV", "k-fold CV", "Repeated k-fold CV"),
  R2 = c(fit_ind_tst_svm$R2, fit_ind_tst_loocv$R2, fit_ind_tst_cv$R2, fit_ind_tst_rep_cv$R2),
  RMSE = c(fit_ind_tst_svm$RMSE, fit_ind_tst_loocv$RMSE, fit_ind_tst_cv$RMSE, fit_ind_tst_rep_cv$RMSE)
)

fit_indices_table

The best model is

best_model <- fit_indices_table[which.max(fit_indices_table$R2), ]
print(best_model)
  Model        R2     RMSE
2 LOOCV 0.9630812 1.770673

Hence, the best model is Leave Out One Cross Validation (LOOCV) and can be used for better prediction results.

4. Make Predictions:

pred_tst_loocv <- predict(model_loocv, test_data)
# Combine data into a data frame
data3 <- data.frame(Actual_MPG = test_data$MPG, Predicted_MPG = pred_tst_loocv)
head(data3,15)

5. Plot the Predicted and Actual MPG

# Create line plot
ggplot(data3, aes(x = 1:nrow(data3))) +
  geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
  geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
  scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
  labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
  theme_economist()

This predicts the target variable for your testing data (test_data).