Use the data called heart attached to this document and run a multiple regression to determine if smoking and biking significantly causes heart disease. Your report should include the following:

Load the data

library(stargazer)
library(ggplot2)
library(tidyverse)
library(sjPlot)
library(corrplot)
data <- read.csv("regression.csv")
head(data,5)

Check the structure of the data

str(data)
'data.frame':   392 obs. of  8 variables:
 $ GallonsPer100Miles: num  5.6 6.7 5.6 6.3 5.9 6.7 7.1 7.1 7.1 6.7 ...
 $ MPG               : num  18 15 18 16 17 15 14 14 14 15 ...
 $ Cylinders         : int  8 8 8 8 8 8 8 8 8 8 ...
 $ Displacement100ci : num  3.07 3.5 3.18 3.04 3.02 4.29 4.54 4.4 4.55 3.9 ...
 $ Horsepower100     : num  1.3 1.65 1.5 1.5 1.4 1.98 2.2 2.15 2.25 1.9 ...
 $ Weight1000lb      : num  3.5 3.69 3.44 3.43 3.45 ...
 $ Seconds0to60      : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ Name              : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

I. The fitted linear model and an interpretation of the model equation using up to the 300th observation.

In this paper, MPG (miles per gallon) is going to be our dependent variable. MPG represents the fuel efficiency of the vehicles, which is a common dependent variable in automotive analysis. It’s typically used to measure how efficiently a vehicle uses fuel to travel a certain distance. In this dataset, other variables like GallonsPer100Miles or Seconds0to60 can be considered independent variables that affect MPG.

# Select the first 200 rows
sample_data <- data[1:300, ]
head(sample_data,5)
attach(sample_data)

View the Results using Stargazer and tab_model() Function

model <- lm(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, data = sample_data)
stargazer(model, type = "text")

===============================================
                        Dependent variable:    
                    ---------------------------
                                MPG            
-----------------------------------------------
GallonsPer100Miles           -3.336***         
                              (0.175)          
                                               
Cylinders                      0.195           
                              (0.220)          
                                               
Displacement100ci             -0.541           
                              (0.468)          
                                               
Horsepower100                3.115***          
                              (0.896)          
                                               
Weight1000lb                 -1.650***         
                              (0.441)          
                                               
Seconds0to60                  0.143**          
                              (0.070)          
                                               
Constant                     37.912***         
                              (1.481)          
                                               
-----------------------------------------------
Observations                    300            
R2                             0.902           
Adjusted R2                    0.900           
Residual Std. Error      1.999 (df = 293)      
F Statistic          450.982*** (df = 6; 293)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01
tab_model(model,
          show.se = TRUE,
          show.stat = TRUE)
  MPG
Predictors Estimates std. Error CI Statistic p
(Intercept) 37.91 1.48 35.00 – 40.83 25.60 <0.001
GallonsPer100Miles -3.34 0.17 -3.68 – -2.99 -19.08 <0.001
Cylinders 0.20 0.22 -0.24 – 0.63 0.89 0.375
Displacement100ci -0.54 0.47 -1.46 – 0.38 -1.16 0.249
Horsepower100 3.12 0.90 1.35 – 4.88 3.48 0.001
Weight1000lb -1.65 0.44 -2.52 – -0.78 -3.74 <0.001
Seconds0to60 0.14 0.07 0.01 – 0.28 2.04 0.042
Observations 300
R2 / R2 adjusted 0.902 / 0.900

Obtain the Performance Metrics

summary(model)$r.squared %>% 
  c(paste("Root Mean Squared Error (RMSE):", sqrt(mean(model$residuals^2))), 
    paste("R-squared:", .), 
    paste("Adjusted R-squared:", summary(model)$adj.r2), 
    paste("AIC:", AIC(model)), 
    paste("BIC:", BIC(model))) %>% 
  cat(paste0("* ", .), sep = "\n")
0.902297279731393
Root Mean Squared Error (RMSE): 1.97551731955362
R-squared: 0.902297279731393
Adjusted R-squared: 
AIC: 1275.86129872615
BIC: 1305.4915585234
* 0.902297279731393
* Root Mean Squared Error (RMSE): 1.97551731955362
* R-squared: 0.902297279731393
* Adjusted R-squared: 
* AIC: 1275.86129872615
* BIC: 1305.4915585234

Report the Results using Report Function

library(report)
report_table(model)

II. The significance of the explanatory variables.

The regression analysis reveals that the dependent variable, MPG (miles per gallon), is significantly influenced by several independent variables. GallonsPer100Miles has a strong negative relationship with MPG (coef. = -3.336, p < 0.001), indicating that as gallons per 100 miles increase, MPG decreases. Similarly, Horsepower100 has a positive relationship (coef. = 3.115, p < 0.001), suggesting that higher horsepower is associated with lower MPG.

Other variables such as Weight1000lb also negatively impact MPG (coef. = -1.650, p < 0.001), implying that heavier vehicles tend to have lower fuel efficiency. However, Cylinders, Displacement100ci, and Seconds0to60 show no statistically significant relationship with MPG. The overall model is highly significant (F Statistic = 450.982, p < 0.001) and explains approximately 90.2% of the variance in MPG. Therefore, these variables collectively provide a good understanding of the factors affecting fuel efficiency in vehicles.

III. The contribution of the explanatory variables toward the variations in the dependent variable.

In the model above, 90% of the variation in MPG is explained by all the predictors in the model, as indicated by an adjusted R-square of 0.900. The following is the individual cotribution of each predictor to the predicted variable.

library(relaimpo)
rel_imp <- calc.relimp(model, type = "lmg")
rel_imp
Response variable: MPG 
Total response variance: 40.07791 
Analysis based on 300 observations 

6 Regressors: 
GallonsPer100Miles Cylinders Displacement100ci Horsepower100 Weight1000lb Seconds0to60 
Proportion of variance explained by model: 90.23%
Metrics are not normalized (rela=FALSE). 

Relative importance metrics: 

                          lmg
GallonsPer100Miles 0.29081991
Cylinders          0.12976856
Displacement100ci  0.14284670
Horsepower100      0.12872477
Weight1000lb       0.17336539
Seconds0to60       0.03677194

Average coefficients for different model sizes: 

                           1X        2Xs         3Xs         4Xs         5Xs
GallonsPer100Miles  -3.738533 -3.4945755 -3.37725750 -3.33128765 -3.32524451
Cylinders           -2.935299 -1.1715011 -0.39684515 -0.13716122  0.03160863
Displacement100ci   -4.891259 -2.9386899 -1.52718236 -0.75064198 -0.50623353
Horsepower100      -12.582389 -5.5819036 -2.23306878 -0.29850580  1.45099326
Weight1000lb        -6.290453 -4.6067756 -3.59593620 -2.77444613 -2.12232776
Seconds0to60         1.037865 -0.1049534 -0.09138239 -0.03052353  0.05194459
                          6Xs
GallonsPer100Miles -3.3364268
Cylinders           0.1954997
Displacement100ci  -0.5408844
Horsepower100       3.1154685
Weight1000lb       -1.6502999
Seconds0to60        0.1430547

The relative importance metrics indicate how much each predictor contributes to the variance explained by the model.

The average coefficients for different model sizes show the estimated impact of each variable on MPG at various stages of model complexity, with coefficients becoming less extreme as more variables are added to the model. These results can be visualized as shown below.

library(ggplot2)
# Convert rel_imp$lmg to a data frame
df <- data.frame(variable = names(rel_imp$lmg), importance = rel_imp$lmg)

# Arrange the data frame in descending order of relative importance
df <- df[order(df$importance, decreasing = TRUE), ]

# Create the bar plot using ggplot2
ggplot(df, aes(x = reorder(variable, -importance), y = importance)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = paste0(round(importance * 100, 3), "%")), vjust = -0.5, size = 3) +  # Format labels as percentages
  labs(x = "Independent variables", y = "Relative Importance", 
       title = "A Bar plot of Relative Importance") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

IV. Determine the overall significance of the fitted model.

The overall significance of the model is determined by looking at the F-value and its associated p-value. From the results above, the F-value of F Statistic 450.982*** (df = 6; 293), which is statistically significant indicates that the model is overall significant and can be used for prediction.

V. Conduct model diagnostics on the fitted model to see if the model is appropriate for predicting future occurrence of heart disease.

I. Normality of the residuals

Among the model diagnostics include the normality of the regression residuals, homoscedastic variance of the error term and the correlation of the predictors. Consider the following plot of the regression residuals.

library(forecast)
checkresiduals(model)


    Breusch-Godfrey test for serial correlation of order up to 10

data:  Residuals
LM test = 62.869, df = 10, p-value = 1.034e-09

The histogram of the distribution of the residuals shows a fairly normally distributed residuals with some aspects of positively skewed however, with no much concern. Further we can test normality using Q-Q plot together with histogram as shown below, showing normally distributed residuals

library(car)
par(mfrow = c(1, 2))
# Q-Q plot
qqPlot(model$residuals, main = "Q-Q Plot of the regression residuals")
[1] 243 246
# Histogram
hist(model$residuals, col = "lightblue",breaks = 20, main = "Histogram of Residuals")

The shows a fairly normal distribution of the regression residuals with minor aspects of positively skewed distribution, which in this case is not of serious concern. Consider the following inferential results to check the normality of the residuals

# Shapiro-Wilk Test (for formality)
shapiro.test(model$residuals)  # Check p-value for normality

    Shapiro-Wilk normality test

data:  model$residuals
W = 0.86732, p-value = 2.179e-15

The results above with the accompanied p-value shows the regression residuals do not follow a normal distribution. As said earlier, the regression residuals appears to have a positively skewed distribution, however, this is not of great concern.

II. Homoscedasticity

The assumption requires that variance of the regression residuals should constant over time. Otherwise if violated, our model will suffer from heteroscedastic problem. Consider the results below.

library(car)
library(tseries)

Heteroscedasticity

ncvTest(model)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 14.1872, Df = 1, p = 0.00016549

Since the p-value is less than the conventional significance level of 0.05, we reject the null hypothesis of constant variance. This means that there is enough evidence to conclude that the variance of the residuals varies across the range of the predictor variable(s). In other words, our model is suffering from non-constant variance of the error term. Over time, the variance of the regression residuals is not constant.

III. Multicollinearity

vif(model)
GallonsPer100Miles          Cylinders  Displacement100ci      Horsepower100 
          5.849554          11.193871          19.488669           9.754632 
      Weight1000lb       Seconds0to60 
         11.374802           2.807855 

The VIF of 1 indicates no correlation, no multicollinearity. The VIF between 1 and 5 indicates a moderate correlation, some multicollinearity but may not be a major concern. The VIF greater than 5 indicates a high correlation, significant multicollinearity which can cause issues with your model. Based on the output above, our model is suffering from multicollinearity and therefore good for prediction. Some variables are likely correlated. Let’s look at the correlation plot below.

# Select numeric columns from Wine_data (assuming these are the columns you want to analyze)
numeric_data <- data[, sapply(data, is.numeric)]

# Calculate correlation matrix
correlation_matrix <- cor(numeric_data)
# Create a correlation heatmap using corrplot
corrplot(correlation_matrix,
         method = "color",            # Display correlations using color
         type = "upper",              # Show upper triangle of the correlation matrix
         tl.col = "black",            # Color of text labels
         tl.srt = 45,                 # Rotate text labels by 45 degrees
         diag = FALSE,                # Exclude diagonal elements
         addCoef.col = "black",       # Color of correlation coefficients
         col = colorRampPalette(c("blue", "white", "red"))(100),  # Custom color palette
         main = "Correlation Heatmap", # Main title
         pch = 16,                    # Use solid circles for data points
         cex.lab = 1.2,               # Adjust label size
         cex.main = 0.5 ,              # Adjust main title size
         cex.axis = 0.5 ,              # Adjust axis label size
         number.cex = 0.5)             # Adjust number size in the color legend

Some variables above are highly correlated. For instance, Displace and weight are highly correlated. Cylinder and Displacement are also highly correlated as shown in the graph which a correlation coefficient of 0.95. These are possible reasons for our model to suffer from multicollinearity. #### IV. AIC and BIC

library(report)
report_table(model)

VI. Use the model trained with the first 300 observations above to get the predicted values for the 301st to 498th observations.

sample_data1 <- data[301:392,]
str(sample_data1)
'data.frame':   92 obs. of  8 variables:
 $ GallonsPer100Miles: num  2.9 3.1 2.7 3.5 3.5 3.7 3 2.4 2.6 3.1 ...
 $ MPG               : num  34.5 31.8 37.3 28.4 28.8 26.8 33.5 41.5 38.1 32.1 ...
 $ Cylinders         : int  4 4 4 4 6 6 4 4 4 4 ...
 $ Displacement100ci : num  1.05 0.85 0.91 1.51 1.73 1.73 1.51 0.98 0.89 0.98 ...
 $ Horsepower100     : num  0.7 0.65 0.69 0.9 1.15 1.15 0.9 0.76 0.6 0.7 ...
 $ Weight1000lb      : num  2.15 2.02 2.13 2.67 2.6 ...
 $ Seconds0to60      : num  14.9 19.2 14.7 16 11.3 12.9 13.2 14.7 18.8 15.5 ...
 $ Name              : chr  "plymouth horizon tc3" "datsun 210" "fiat strada custom" "buick skylark limited" ...

Prediction and a Plot of the Predicted and Actual MPG (Testing Data)

# Predict values on the test set
predictions <- predict(model, newdata = sample_data1)

# Add predictions as a new column to the test set
sample_data1$predicted_value <- predictions
head(sample_data1,10)

Plot

library(ggplot2)
library(ggthemes)
# Combine data into a data frame
data1 <- data.frame(Actual_MPG = sample_data1$MPG, Predicted_MPG = sample_data1$predicted_value)

# Create line plot
ggplot(data1, aes(x = 1:nrow(data1))) +
  geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
  geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
  scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
  labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
  theme_economist()

From the plot above, the model don’t seem to work well since the predicted MPG is lying slightly far below the actual MPG, and this is a serious concern. It mean that our model is not correctly capturing the variation in MPG as explained by the predictors in the model.

VII. Using an appropriate tool/method determine whether there is a significant difference between the observed and the corresponding predicted values of the occurrence of heart disease found in (vi) and draw your conclusion.

plot(sample_data1$MPG, sample_data1$predicted_value, ylab = "MPG", xlab = "Predicted MPG", main ="Scatter plot of Observed and Predicted MPG")
abline(lm(sample_data1$MPG ~ sample_data1$predicted_value), col = "red")

T_test to determine if there exist a significane difference between the observed and predicted heart disease

library(report)
attach(sample_data1)
T_Test <- t.test(MPG, predicted_value, var.equal = TRUE)  # Assuming equal variances
T_Test

    Two Sample t-test

data:  MPG and predicted_value
t = 5.8389, df = 182, p-value = 2.369e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.582491 5.218629
sample estimates:
mean of x mean of y 
 31.94565  28.04509 

The results above indicates that there is a significant difference between the the observed MPG and the predicted MPG since the p-value if greater than less than 0.05. This means that our model is not doing well in terms of prediction.

VIII. Would you want to revise your answer to (v) given your results in (vi) and (vii).

Based on the results in (VI) and (VII) I would like not revise my answer for (V) since the estimated model is not appropriate for prediction.

MACHINE LEARNING

Load the Following Libraries

library(recipes)
library(lava)
library(sjmisc)
library(igraph)
library(e1071)
library(hardhat)
library(ipred)
library(caret)
library(sjPlot)
library(nnet)
library(wakefield)
library(kknn)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(dplyr)
library(nnet)
library(caTools)
library(ROCR)
library(stargazer)
library(ISLR)
library(ISLR2)
library(MASS)
library(splines)
library(splines2)
library(pROC)
library(ISLR)
library(ISLR2)
library(MASS)
library(splines)
library(splines2)
library(pROC)
library(randomForest)
library(rpart)
library(rpart.plot)
library(rattle)
library(ISLR2)
library(MASS)
library(splines)
library(pROC)
library(rattle)
library(rpart)
library(party)
library(partykit)
library(ggplot2)
library(tune)
library(TunePareto)

KNN Model

set.seed(333)
trControl <- trainControl(method = 'repeatedcv',
                          number = 10,
                          repeats = 3)
attach(data)
FIT <- train(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, 
             data = sample_data,
             tuneGrid = expand.grid(k=1:70),
             method = 'knn',
             trControl = trControl,
             preProc = c('center', 'scale'))

Model Performance

FIT
k-Nearest Neighbors 

300 samples
  6 predictor

Pre-processing: centered (6), scaled (6) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 270, 270, 270, 269, 269, 270, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE      
   1  1.574376  0.9407361  1.0289805
   2  1.488331  0.9464728  0.9475791
   3  1.449470  0.9494366  0.9275828
   4  1.468866  0.9486649  0.9622943
   5  1.506811  0.9459447  0.9980082
   6  1.519485  0.9455442  1.0090691
   7  1.537981  0.9447255  1.0168833
   8  1.574708  0.9426251  1.0475633
   9  1.590421  0.9413774  1.0827709
  10  1.626097  0.9388027  1.1135395
  11  1.634828  0.9383082  1.1257423
  12  1.657153  0.9370251  1.1443973
  13  1.683380  0.9349548  1.1649064
  14  1.702267  0.9332652  1.1827724
  15  1.718769  0.9317762  1.1979338
  16  1.725082  0.9316209  1.2019506
  17  1.737452  0.9304641  1.2172920
  18  1.761103  0.9288514  1.2360456
  19  1.796987  0.9262331  1.2664406
  20  1.820877  0.9243558  1.2892821
  21  1.846231  0.9220973  1.3098785
  22  1.867696  0.9201525  1.3276490
  23  1.875812  0.9195781  1.3347344
  24  1.896208  0.9178536  1.3476351
  25  1.915921  0.9159352  1.3635284
  26  1.931918  0.9148229  1.3795251
  27  1.952977  0.9128934  1.3975196
  28  1.975726  0.9109653  1.4162083
  29  1.998009  0.9087932  1.4358004
  30  2.022123  0.9063054  1.4560267
  31  2.047437  0.9037242  1.4765056
  32  2.061935  0.9023895  1.4930848
  33  2.083140  0.9002251  1.5148796
  34  2.102949  0.8981289  1.5332530
  35  2.125267  0.8957856  1.5504856
  36  2.141681  0.8941163  1.5657049
  37  2.158255  0.8923333  1.5829083
  38  2.179925  0.8900707  1.5975942
  39  2.203896  0.8874101  1.6122513
  40  2.224768  0.8851749  1.6292806
  41  2.244025  0.8832789  1.6469351
  42  2.267840  0.8807192  1.6640996
  43  2.286662  0.8786197  1.6797256
  44  2.303200  0.8767907  1.6938561
  45  2.317233  0.8752755  1.7058665
  46  2.332390  0.8737554  1.7203403
  47  2.346059  0.8723368  1.7327327
  48  2.357099  0.8710910  1.7415329
  49  2.372488  0.8691170  1.7523006
  50  2.385921  0.8677511  1.7633392
  51  2.402543  0.8657325  1.7773778
  52  2.417841  0.8640395  1.7892252
  53  2.433101  0.8622596  1.8011448
  54  2.442844  0.8610150  1.8113901
  55  2.457890  0.8592856  1.8232775
  56  2.470306  0.8580062  1.8342347
  57  2.481605  0.8566918  1.8441825
  58  2.494053  0.8553391  1.8537080
  59  2.507794  0.8538809  1.8629001
  60  2.521219  0.8523376  1.8719471
  61  2.533237  0.8509792  1.8815081
  62  2.543044  0.8499303  1.8895953
  63  2.551390  0.8491377  1.8987123
  64  2.559499  0.8483665  1.9084574
  65  2.570403  0.8473175  1.9182475
  66  2.586347  0.8454688  1.9309562
  67  2.601714  0.8438675  1.9431948
  68  2.613506  0.8423891  1.9526062
  69  2.624752  0.8410532  1.9625590
  70  2.634447  0.8401369  1.9679395

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 3.

Check the model type

FIT$modelType
[1] "Regression"

Obtain the coefficient names

FIT$coefnames
[1] "GallonsPer100Miles" "Cylinders"          "Displacement100ci" 
[4] "Horsepower100"      "Weight1000lb"       "Seconds0to60"      

The results above shows that we have run K-Nearest Neighbors with 3 samples and 6 predictors. We used cross validation where ten folds were used and repeated three times. This means that the data is divided into ten folds and only nine folds are used in model estimation and one fold is used for model assessment and validation. From the model RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 3.

Plot the model

plot(FIT, xlab = "Nearest Neighbors", main = "APlot of Root Mean Square Error (RMSE) for the K-Nearest Neighbors")

View the Model

FIT
k-Nearest Neighbors 

300 samples
  6 predictor

Pre-processing: centered (6), scaled (6) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 270, 270, 270, 269, 269, 270, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE      
   1  1.574376  0.9407361  1.0289805
   2  1.488331  0.9464728  0.9475791
   3  1.449470  0.9494366  0.9275828
   4  1.468866  0.9486649  0.9622943
   5  1.506811  0.9459447  0.9980082
   6  1.519485  0.9455442  1.0090691
   7  1.537981  0.9447255  1.0168833
   8  1.574708  0.9426251  1.0475633
   9  1.590421  0.9413774  1.0827709
  10  1.626097  0.9388027  1.1135395
  11  1.634828  0.9383082  1.1257423
  12  1.657153  0.9370251  1.1443973
  13  1.683380  0.9349548  1.1649064
  14  1.702267  0.9332652  1.1827724
  15  1.718769  0.9317762  1.1979338
  16  1.725082  0.9316209  1.2019506
  17  1.737452  0.9304641  1.2172920
  18  1.761103  0.9288514  1.2360456
  19  1.796987  0.9262331  1.2664406
  20  1.820877  0.9243558  1.2892821
  21  1.846231  0.9220973  1.3098785
  22  1.867696  0.9201525  1.3276490
  23  1.875812  0.9195781  1.3347344
  24  1.896208  0.9178536  1.3476351
  25  1.915921  0.9159352  1.3635284
  26  1.931918  0.9148229  1.3795251
  27  1.952977  0.9128934  1.3975196
  28  1.975726  0.9109653  1.4162083
  29  1.998009  0.9087932  1.4358004
  30  2.022123  0.9063054  1.4560267
  31  2.047437  0.9037242  1.4765056
  32  2.061935  0.9023895  1.4930848
  33  2.083140  0.9002251  1.5148796
  34  2.102949  0.8981289  1.5332530
  35  2.125267  0.8957856  1.5504856
  36  2.141681  0.8941163  1.5657049
  37  2.158255  0.8923333  1.5829083
  38  2.179925  0.8900707  1.5975942
  39  2.203896  0.8874101  1.6122513
  40  2.224768  0.8851749  1.6292806
  41  2.244025  0.8832789  1.6469351
  42  2.267840  0.8807192  1.6640996
  43  2.286662  0.8786197  1.6797256
  44  2.303200  0.8767907  1.6938561
  45  2.317233  0.8752755  1.7058665
  46  2.332390  0.8737554  1.7203403
  47  2.346059  0.8723368  1.7327327
  48  2.357099  0.8710910  1.7415329
  49  2.372488  0.8691170  1.7523006
  50  2.385921  0.8677511  1.7633392
  51  2.402543  0.8657325  1.7773778
  52  2.417841  0.8640395  1.7892252
  53  2.433101  0.8622596  1.8011448
  54  2.442844  0.8610150  1.8113901
  55  2.457890  0.8592856  1.8232775
  56  2.470306  0.8580062  1.8342347
  57  2.481605  0.8566918  1.8441825
  58  2.494053  0.8553391  1.8537080
  59  2.507794  0.8538809  1.8629001
  60  2.521219  0.8523376  1.8719471
  61  2.533237  0.8509792  1.8815081
  62  2.543044  0.8499303  1.8895953
  63  2.551390  0.8491377  1.8987123
  64  2.559499  0.8483665  1.9084574
  65  2.570403  0.8473175  1.9182475
  66  2.586347  0.8454688  1.9309562
  67  2.601714  0.8438675  1.9431948
  68  2.613506  0.8423891  1.9526062
  69  2.624752  0.8410532  1.9625590
  70  2.634447  0.8401369  1.9679395

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 3.

The graph above shows that lowest value of RMASE is found when we have k as 3. After that point, the value of RMASE starts to increase steadily.

Variable Importance

varImp(FIT)
loess r-squared variable importance

                   Overall
GallonsPer100Miles  100.00
Weight1000lb         88.17
Displacement100ci    80.64
Horsepower100        75.39
Cylinders            64.06
Seconds0to60          0.00

Plot variable importance

plot(varImp(FIT), xlab = "Percentage Importance", 
     ylab ="Variables", main = "Variable Importance for the KNN Model")

The output above variables importance in our model from the most important one to the least. The variable “chas” has significant importance in our knn model. In other words, “chas” is the least important variable.

Prediction

pred <- predict(FIT, newdata = sample_data1)

NOTE

Because the response variable numeric and continuous, we will calculate the root mean square for assessment.

RMSE(pred, sample_data1$MPG)
[1] 4.769806

Make the Plot

plot(pred, sample_data1$MPG)

We can chooce R-square for Model Evaluation

FIT <- train(MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, 
             data = sample_data,
             tuneGrid = expand.grid(k=1:70),
             method = 'knn',
             metric = 'Rsquared',
             trControl = trControl,
             preProc = c('center', 'scale'))

Model Performance

FIT
k-Nearest Neighbors 

300 samples
  6 predictor

Pre-processing: centered (6), scaled (6) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 270, 271, 270, 270, 269, 269, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE      
   1  1.591920  0.9394329  1.0319675
   2  1.455227  0.9498981  0.9323168
   3  1.439877  0.9502025  0.9160603
   4  1.485976  0.9469023  0.9541937
   5  1.540147  0.9434957  0.9992122
   6  1.560133  0.9426078  1.0037810
   7  1.590279  0.9412527  1.0392630
   8  1.598914  0.9402266  1.0558759
   9  1.629941  0.9381906  1.0901669
  10  1.655139  0.9365378  1.1120192
  11  1.673802  0.9351588  1.1319297
  12  1.696486  0.9335523  1.1490496
  13  1.714676  0.9319119  1.1670087
  14  1.735721  0.9302328  1.1791580
  15  1.762991  0.9282030  1.2005774
  16  1.773082  0.9272595  1.2116991
  17  1.788123  0.9263502  1.2279748
  18  1.801409  0.9254666  1.2455852
  19  1.824421  0.9236937  1.2683409
  20  1.854903  0.9209520  1.2912184
  21  1.875529  0.9190249  1.3118706
  22  1.889856  0.9181122  1.3262452
  23  1.912035  0.9162811  1.3392577
  24  1.934766  0.9145289  1.3586907
  25  1.953593  0.9128549  1.3742508
  26  1.982213  0.9099365  1.3957396
  27  1.996255  0.9087167  1.4116796
  28  2.014375  0.9071303  1.4268993
  29  2.035595  0.9051180  1.4468520
  30  2.056944  0.9033305  1.4639827
  31  2.077956  0.9013076  1.4839148
  32  2.099019  0.8993461  1.5025454
  33  2.119217  0.8972836  1.5200710
  34  2.135849  0.8956926  1.5345744
  35  2.157269  0.8935631  1.5535753
  36  2.181741  0.8910892  1.5741680
  37  2.208008  0.8884506  1.5948985
  38  2.226443  0.8866513  1.6094057
  39  2.245309  0.8846252  1.6222174
  40  2.262840  0.8828099  1.6384923
  41  2.286683  0.8801846  1.6553593
  42  2.308474  0.8777284  1.6735914
  43  2.324954  0.8759073  1.6880161
  44  2.337178  0.8747198  1.7002150
  45  2.349809  0.8733561  1.7111410
  46  2.365473  0.8716785  1.7226947
  47  2.378420  0.8702897  1.7349310
  48  2.391963  0.8689972  1.7447193
  49  2.407233  0.8674046  1.7577276
  50  2.424189  0.8654463  1.7683986
  51  2.437694  0.8640195  1.7780372
  52  2.454297  0.8621021  1.7918053
  53  2.470363  0.8603796  1.8043192
  54  2.481606  0.8590029  1.8156761
  55  2.491748  0.8578865  1.8253641
  56  2.498593  0.8573661  1.8317425
  57  2.511403  0.8559964  1.8419442
  58  2.526239  0.8541874  1.8539235
  59  2.540528  0.8525444  1.8648257
  60  2.552716  0.8513346  1.8743174
  61  2.566598  0.8497237  1.8849405
  62  2.583919  0.8477822  1.9001164
  63  2.592890  0.8469463  1.9077759
  64  2.604838  0.8456377  1.9184197
  65  2.617315  0.8444035  1.9314042
  66  2.630802  0.8426598  1.9442484
  67  2.639551  0.8418271  1.9516521
  68  2.653349  0.8401014  1.9616237
  69  2.663703  0.8389873  1.9713350
  70  2.674564  0.8377999  1.9810189

Rsquared was used to select the optimal model using the largest value.
The final value used for the model was k = 3.
plot(FIT)

Variables Importance

varImp(FIT)
loess r-squared variable importance

                   Overall
GallonsPer100Miles  100.00
Weight1000lb         88.17
Displacement100ci    80.64
Horsepower100        75.39
Cylinders            64.06
Seconds0to60          0.00

Plot variable Importance

plot(varImp(FIT), xlab = "Percentage Importance", 
     ylab ="Variables", main = "Variable Importance for the KNN Model")

Prediction

pred <- predict(FIT, newdata = sample_data1)
data_pred <- data.frame(sample_data1, pred)
head(data_pred,5)

Check the Structure of the Data Set

str(data_pred)
'data.frame':   92 obs. of  10 variables:
 $ GallonsPer100Miles: num  2.9 3.1 2.7 3.5 3.5 3.7 3 2.4 2.6 3.1 ...
 $ MPG               : num  34.5 31.8 37.3 28.4 28.8 26.8 33.5 41.5 38.1 32.1 ...
 $ Cylinders         : int  4 4 4 4 6 6 4 4 4 4 ...
 $ Displacement100ci : num  1.05 0.85 0.91 1.51 1.73 1.73 1.51 0.98 0.89 0.98 ...
 $ Horsepower100     : num  0.7 0.65 0.69 0.9 1.15 1.15 0.9 0.76 0.6 0.7 ...
 $ Weight1000lb      : num  2.15 2.02 2.13 2.67 2.6 ...
 $ Seconds0to60      : num  14.9 19.2 14.7 16 11.3 12.9 13.2 14.7 18.8 15.5 ...
 $ Name              : chr  "plymouth horizon tc3" "datsun 210" "fiat strada custom" "buick skylark limited" ...
 $ predicted_value   : num  29.2 29.3 29.9 26.9 27.4 ...
 $ pred              : num  32.2 31 33.8 25 22.3 ...

Plot the Actual Response Variable and Predicted Values

library(ggplot2)
library(ggthemes)
# Combine data into a data frame
data2 <- data.frame(Actual_MPG = data_pred$MPG, Predicted_MPG = data_pred$pred)

# Create line plot
ggplot(data2, aes(x = 1:nrow(data2))) +
  geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
  geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
  scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
  labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
  theme_economist()

NOTE

Because the response variable numeric and continuous, we will calculate the root mean square for assessment.

RMSE(pred, data2$Actual_MPG)
[1] 4.769806

Make the Plot

plot(pred, data2$Actual_MPG, xlab = "Prediction", ylab = "mpg", main = "The scatter plot of predicted and actual mpg")

No Much Improvement.

Interpreting a k-Nearest Neighbors (k-NN) regression model is somewhat different from interpreting a traditional linear regression model. In k-NN regression, instead of fitting a mathematical equation to your data, the model makes predictions based on the values of the k nearest data points in the training dataset. Here’s how you can interpret a k-NN regression model:

Prediction Process:

For each data point you want to predict, the k-NN algorithm identifies the k nearest data points in the training dataset based on some distance metric (e.g., Euclidean distance). It then calculates the average (or weighted average) of the target values (the values you’re trying to predict) for those k nearest data points. This average is used as the prediction for the new data point. Tuning Parameter (k):

The most important parameter in k-NN regression is “k,” which represents the number of nearest neighbors to consider. A small k (e.g., 1 or 3) may result in a model that closely follows the training data but is sensitive to noise. A large k (e.g., 10 or more) may result in a smoother prediction surface but might not capture local variations. The choice of k should be based on cross-validation and the characteristics of your data.

Distance Metric:

The choice of distance metric (e.g., Euclidean, Manhattan, etc.) can impact the model’s performance. Different distance metrics can lead to different interpretations of “closeness” among data points.

Non-Linearity:

Unlike linear regression, k-NN regression can capture non-linear relationships in the data. Interpretation may not involve coefficients or slope values as in linear regression.

Local vs. Global Patterns:

k-NN regression captures local patterns in the data. Interpretation involves understanding the local behavior around a prediction point. The model does not provide a global equation that describes the entire dataset.

Visualizations:

Visualizations can be helpful for interpretation. Plotting the k nearest neighbors for specific data points can provide insights into why the model made a particular prediction.

Feature Importance:

k-NN does not provide feature importance scores like some other models (e.g., Random Forest or Gradient Boosting). However, you can still analyze feature importance indirectly by examining which features are most influential in determining the nearest neighbors.

Performance Metrics:

Use appropriate regression evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared to assess model performance. These metrics can give you a sense of how well the k-NN regression model fits the data.

Rescaling Features:

Scaling and normalization of features can significantly affect the results, as k-NN is sensitive to the scale of input variables. Interpretation may involve considering the effects of feature scaling. In summary, interpreting a k-NN regression model involves understanding its prediction process, the choice of hyperparameters (especially k), the impact of distance metrics, and the local nature of the model’s predictions. Visualizations and performance metrics play a crucial role in assessing and explaining the model’s behavior on your specific dataset.

In k-Nearest Neighbors (k-NN) regression, accuracy is not typically used as an evaluation metric because k-NN regression is a type of regression, not classification. Accuracy is more suitable for classification problems where you’re predicting discrete class labels.

For regression tasks, you typically use different evaluation metrics to assess the performance of your model. Common metrics for regression tasks include:

Mean Absolute Error (MAE):

It measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to Mean Squared Error.

Mean Squared Error (MSE):

It measures the average of the squared differences between predicted and actual values. It gives higher weight to larger errors.

Root Mean Squared Error (RMSE):

It is the square root of the MSE and provides an interpretable measure of the average error in the same units as the target variable.

R-squared (R2):

It measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared indicates a better fit.

Support Vector Machine (Support Vector Regressor)

Support Vector Regressor (SVR) is a supervised machine learning algorithm used for regression tasks. It’s based on the concept of Support Vector Machines (SVMs) typically used for classification problems. Here’s how SVR works:

Objective:

The goal of SVR is to find a hyperplane (in high dimensional space for non-linear cases) that best fits the training data while minimizing the prediction error. This hyperplane minimizes the distance to the closest data points, called support vectors. Unlike linear regression, SVR allows for a certain margin of error around the hyperplane.

Key Points:

Non-linearity:

SVR can handle non-linear relationships between features and target variables using kernel functions. These functions project the data into a higher-dimensional space where a linear relationship might exist.

Robustness:

SVR is less sensitive to outliers compared to some other regression methods. This is because it focuses on the support vectors rather than being influenced by all data points equally.

Regularization:

SVR inherently performs regularization by controlling the margin of error around the hyperplane. This helps to prevent overfitting. R code for Support Vector Regression with kernlab package

1. Install and Load Packages:

# Install kernlab package if not already installed
if(!require(kernlab)) install.packages("kernlab")
library(kernlab)

2. Prepare Data:

  • Split your data into training and testing sets.

  • Ensure your target variable is numeric.

set.seed(16)
ind <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))

train_data <- data[ind == 1, ]
test_data <- data[ind == 2, ]

3. Define SVR Model

library(caret)
model_svm <- train(
  MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60, 
             data = train_data,
  method = 'svmRadial',
  preProcess = c("center", "scale"),
  trCtrl = trainControl(method = "none")
)
model_svm
Support Vector Machines with Radial Basis Function Kernel 

316 samples
  6 predictor

Pre-processing: centered (6), scaled (6) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 316, 316, 316, 316, 316, 316, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared   MAE      
  0.25  2.181329  0.9215166  1.0832816
  0.50  1.855165  0.9412598  0.9165128
  1.00  1.592084  0.9555932  0.8222594

Tuning parameter 'sigma' was held constant at a value of 0.4651685
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.4651685 and C = 1.

Again, to find R square and RMSE for test data

pred_tst_svm <- predict(model_svm, test_data)

fit_ind_tst_svm <- data.frame(
  R2 = R2(pred_tst_svm, test_data$MPG),
  RMSE = RMSE(pred_tst_svm, test_data$MPG)
)

Comparison table of test & train data

data.frame(
  Model = c("SVM Train", "SVM Test"),
  R2 = c(fit_ind_tst_svm$R2, fit_ind_tst_svm$R2),
  RMSE = c(fit_ind_tst_svm$RMSE, fit_ind_tst_svm$RMSE)
)

R-square for test data of SVM model is 0.9309281 which means that independent variables are able to explain 93.09281% of variance in dependent variable on test data. Here, the difference between RMSE & R-square is less than 5 percent difference so we can say that there is no over-fitting or under-fitting in the model.

Cross Vaildations

Leave One Our Cross Validation

tcr_loocv <- trainControl(method = "LOOCV", number = 5, repeats=10)
model_loocv <- train(
  MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
  data = train_data, 
  method="svmRadial", 
  trControl = tcr_loocv)

pred_tst_loocv <- predict(model_loocv, test_data)

fit_ind_tst_loocv <- data.frame(
  R2 = R2(pred_tst_loocv, test_data$MPG),
  RMSE = RMSE(pred_tst_loocv, test_data$MPG)
)

Train model data & test model data

print(model_loocv)
Support Vector Machines with Radial Basis Function Kernel 

316 samples
  6 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 315, 315, 315, 315, 315, 315, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared   MAE      
  0.25  2.249977  0.9214763  1.0396851
  0.50  1.848800  0.9458065  0.8437484
  1.00  1.509427  0.9640092  0.7274306

Tuning parameter 'sigma' was held constant at a value of 0.4126495
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.4126495 and C = 1.
fit_ind_tst_loocv

K-folds Cross Validation

tcr_cv <- trainControl(method = "cv", number = 5, repeats=10)
model_cv <- train(
  MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
  data = train_data, 
  method="svmRadial", 
  trControl = tcr_cv)

pred_tst_cv <- predict(model_cv, test_data)

fit_ind_tst_cv <- data.frame(
  R2 = R2(pred_tst_cv, test_data$MPG),
  RMSE = RMSE(pred_tst_cv, test_data$MPG)
)

View the Model

model_cv
Support Vector Machines with Radial Basis Function Kernel 

316 samples
  6 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 253, 253, 252, 254, 252 
Resampling results across tuning parameters:

  C     RMSE      Rsquared   MAE      
  0.25  2.416428  0.9117854  1.1761786
  0.50  1.961931  0.9399702  0.9280455
  1.00  1.617822  0.9597706  0.8042350

Tuning parameter 'sigma' was held constant at a value of 0.5265069
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.5265069 and C = 1.

View the accuracy on the Testing Data

fit_ind_tst_cv

Repeated K-fold Cross Validation

tcr_cv <- trainControl(method = "repeatedcv", number = 10, repeats=3)
model_rep_cv <- train(
  MPG~GallonsPer100Miles+Cylinders + Displacement100ci + Horsepower100 + Weight1000lb + Seconds0to60,
  data = train_data, 
  method="svmRadial", 
  trControl = tcr_cv)

pred_tst_rep_cv <- predict(model_rep_cv, test_data)

fit_ind_tst_rep_cv <- data.frame(
  R2 = R2(pred_tst_rep_cv, test_data$MPG),
  RMSE = RMSE(pred_tst_rep_cv, test_data$MPG)
)

View the Model

model_rep_cv
Support Vector Machines with Radial Basis Function Kernel 

316 samples
  6 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 285, 285, 284, 284, 283, 284, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared   MAE      
  0.25  2.129605  0.9265818  1.0776446
  0.50  1.731314  0.9495626  0.8629829
  1.00  1.449375  0.9656161  0.7528262

Tuning parameter 'sigma' was held constant at a value of 0.3911894
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.3911894 and C = 1.

View the Accuracy on the Testing Data

fit_ind_tst_rep_cv

Compare the three Models

fit_indices_table <- data.frame(
  Model = c("SVM", "LOOCV", "k-fold CV", "Repeated k-fold CV"),
  R2 = c(fit_ind_tst_svm$R2, fit_ind_tst_loocv$R2, fit_ind_tst_cv$R2, fit_ind_tst_rep_cv$R2),
  RMSE = c(fit_ind_tst_svm$RMSE, fit_ind_tst_loocv$RMSE, fit_ind_tst_cv$RMSE, fit_ind_tst_rep_cv$RMSE)
)

fit_indices_table

The best model is

best_model <- fit_indices_table[which.max(fit_indices_table$R2), ]
print(best_model)
               Model        R2     RMSE
4 Repeated k-fold CV 0.9650457 1.726194

Hence, the best model is Repeated k-fold CV and can be used for better prediction results.

4. Make Predictions:

pred_tst_rep_cv <- predict(model_rep_cv, test_data)
pred_tst_rep_cv
 [1] 17.69508 14.98023 14.31272 18.11931 24.91703 23.39855 11.15968 11.64694
 [9] 24.94750 16.26864 14.03611 14.09027 18.85817 30.54781 31.45355 26.90076
[17] 20.32720 12.54158 15.13381 28.01410 26.88187 14.80701 12.57222 18.25677
[25] 12.57359 11.01399 21.21745 29.49214 22.90923 14.32734 27.94140 15.25337
[33] 18.17099 15.92608 20.47240 16.28532 29.87569 23.70514 23.37281 33.10317
[41] 15.95142 18.07220 17.79461 26.26589 20.72651 35.32813 33.78280 15.92823
[49] 33.69068 30.77930 30.90514 34.81331 35.22081 24.92834 19.53753 20.16339
[57] 19.17670 23.36154 23.48755 35.22230 27.70522 29.48482 40.75431 33.68206
[65] 28.19354 27.36794 26.22746 36.70723 32.75922 32.48806 26.69613 17.95927
[73] 36.51212 36.91017 35.35105 27.50183
# Combine data into a data frame
data3 <- data.frame(Actual_MPG = test_data$MPG, Predicted_MPG = pred_tst_rep_cv)
head(data3,15)

Plot the Predicted and Actual MPG

# Create line plot
ggplot(data3, aes(x = 1:nrow(data3))) +
  geom_line(aes(y = Actual_MPG, color = "Actual MPG")) +
  geom_line(aes(y = Predicted_MPG, color = "Predicted MPG")) +
  scale_color_manual(name = "Variable", values = c("Actual MPG" = "blue", "Predicted MPG" = "red")) +
  labs(x = "Time Axis", y = "MPG", title = "Actual vs Predicted MPG") +
  theme_economist()

This predicts the target variable for your testing data (test_data).

5. Plot the Predicted and Atual MPG