library(GGally)
library(ggplot2)
library(caret)
library(mice)
library(tidyverse)
library(DataExplorer)
library(MASS)
library(naniar)
library(kableExtra)
library(skimr)
library(randomForest)
library(fitdistrplus)
library(corrplot)
library(recipes)

Introduction

Nutritional factors play a key role in both innate and adaptive immunity. Further, we have learned that individuals with comorbidities are disproportionately affected with severe COVID-19 disease and mortality. Obesity, type 2 diabetes, atherosclerotic cardiovascular disease, and hypertension are risk factors for severe COVID-19.The aetiology of these conditions is largely driven by poor nutrition and unfavorable lifestyle choices (eg, physical inactivity or sedentary behaviour)

The goal of this analysis is to find out how a country’s diet correlates with its COVID-19 mortality rate. With different food cultures across the world, it would be interesting to see what are the food categories that can best predict a country’s rate of deaths.

The Dataset

Data for this project was taken from this very interesting kaggle dataset. the dataset combined data of different types of food, world population obesity and undernourished rate, and global COVID-19 cases count from around the world in order to learn more about how a healthy eating style could help combat the Corona Virus.

There are 5 files in the dataset:

  • Fat_Supply_Quantity_Data.csv: percentage of fat intake from different food groups for 170 different countries.
  • Food_Supply_Quantity_kg_Data.csv: percentage of food intake( in kg ) from different food groups for 170 different countries.
  • Food_Supply_kcal_Data.csv: percentage of energy intake (in kcal ) from different food groups for 170 different countries.
  • Protein_Supply_Quantity_Data.csv: percentage of protein intake from different food groups for 170 different countries. All of these files have, also, columns including obesity, undernourishment and COVID-19 cases as percentages of total population.
  • Supply_Food_Data_Descriptions.csv: This dataset is obtained from FAO.org, and is used to show the specific types of food that belongs to each category for the above datasets.

In each of the 4 datasets above, they have calculated fat quantity, energy intake (kcal), food supply quantity (kg), and protein for different categories of food (all calculated as percentage of total intake amount). they’ve also added on the obesity and undernourished rate (also in percentage) for comparison. The end of the datasets also included the most up to date confirmed/deaths/recovered/active cases (also in percentage of current population for each country).

Data Collection

  • Data for different food group supply quantities, nutrition values, obesity, and undernourished percentages were obtained from Food and Agriculture Organization of the United Nations FAO website To see the specific types of food included in each category from the FAO data.
  • Data for population count for each country comes from Population Reference Bureau PRB website.
  • Data for COVID-19 confirmed, deaths, recovered and active cases are obtained from Johns Hopkins Center for Systems Science and Engineering CSSE website.
  • The USDA Center for Nutrition Policy and Promotion diet intake guideline information can be found in ChooseMyPlate.gov.

Methodology

Before applying machine learning models, We started first with the process to understand the data which is called Exploratory Data Anaysis(EDA). It refers to the process of initial investigation and analysis of the dataset in order to understand the distribution, anomality, correlation and other data characteristics .

In this project we especially focus on using tree-based algorithms such as random forest, gradient boosting and Cubist. Decision tree analysis it is a general, predictive modelling tool, and It is one of the most widely used and practical methods for supervised learning. our goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. After that we’ll proceed with a Random forests analysis, it is commonly reported as the most accurate learning algorithm. and also because it reduces the variance seen in decision trees. Also to check for more comprehensive predictive accuracy we used a cubist algorithm and compare all three models results to demonstrate which one has the highest prediction performance

Data Exploration and Processing

Data overview

fat_data <- read.csv("./data/Fat_Supply_Quantity_Data.csv")
supply_kcal_data <- read.csv("./data/Food_Supply_kcal_Data.csv")
supply_kg_data <- read.csv("./data/Food_Supply_Quantity_kg_Data.csv")
protein_data <- read.csv("./data/Protein_Supply_Quantity_Data.csv")

After importing the files we have noticed that the datasets are similar, and we choose to focus on the easiest to understand: food_kcal. Caloric intake from specific food groups makes it easy to understand which food groups contribute the most to a country’s diet and overall calorie count. Diets can be high caloric but traditional thinking dictates that diets rich in vegetables and fruits will be healthier than those that eat a large amount of meat and sugars. Also the Unit column only contains the % sign, indicating that all the column except the Population one are in percentages. It is important to know the unit used.

The data are organized by countries. There are 170 countries in these datasets. 32 variables as following:

names(supply_kcal_data)
##  [1] "Country"                      "Alcoholic.Beverages"         
##  [3] "Animal.Products"              "Animal.fats"                 
##  [5] "Aquatic.Products..Other"      "Cereals...Excluding.Beer"    
##  [7] "Eggs"                         "Fish..Seafood"               
##  [9] "Fruits...Excluding.Wine"      "Meat"                        
## [11] "Milk...Excluding.Butter"      "Miscellaneous"               
## [13] "Offals"                       "Oilcrops"                    
## [15] "Pulses"                       "Spices"                      
## [17] "Starchy.Roots"                "Stimulants"                  
## [19] "Sugar.Crops"                  "Sugar...Sweeteners"          
## [21] "Treenuts"                     "Vegetal.Products"            
## [23] "Vegetable.Oils"               "Vegetables"                  
## [25] "Obesity"                      "Undernourished"              
## [27] "Confirmed"                    "Deaths"                      
## [29] "Recovered"                    "Active"                      
## [31] "Population"                   "Unit..all.except.Population."

Missing Data

Missing data in substantive research is common and can be problematic for structural equation modelling if not handled correctly. A well recognised statistical benchmark suggests datasets with missingness less than 5% is not a problem. Within this dataset we checked if our data variable had less than 5% missing data and therefore not be considered a problem to conduct further statistical tests. quick screening the data we can see that certain columns have some null values. The plot below generated using the R plot_missing function helps us understand that almost 4% of data are missing in the “Recovered”, “Deaths”, “Confirmed”, "“Undernourished” and “Active” Variables. and almost 2% in the “Obesity”, So still can imput the missing values using predictive mean matching method, that is a widely used statistical imputation method for missing values, so can get a better predictive accuracy.

plot_missing(supply_kcal_data,  missing_only = TRUE)

Imputation**

## 
##  iter imp variable
##   1   1  Obesity  Undernourished  Confirmed  Deaths  Recovered  Active
##   1   2  Obesity  Undernourished  Confirmed  Deaths  Recovered  Active
##   2   1  Obesity  Undernourished  Confirmed  Deaths  Recovered  Active
##   2   2  Obesity  Undernourished  Confirmed  Deaths  Recovered  Active

Statistical Analyses

Numerical Variables Distribution

The statistical analysis included a prior investigation on the distribution of the independent and dependent variable (recovery from COVID-19), testing different types of probability distribution.

First we will quickly display a broad overview of a data frame, using the skim function

skim(complete_data2)
Data summary
Name complete_data2
Number of rows 170
Number of columns 28
_______________________
Column type frequency:
numeric 28
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Alcoholic.Beverages 0 1 1.33 1.06 0.00 0.36 1.24 2.03 5.160000e+00 <U+2587><U+2586><U+2583><U+2581><U+2581>
Animal.Products 0 1 9.29 4.75 1.62 5.08 9.03 13.17 2.229000e+01 <U+2587><U+2587><U+2586><U+2586><U+2581>
Animal.fats 0 1 1.27 1.28 0.00 0.34 0.88 1.76 7.800000e+00 <U+2587><U+2582><U+2581><U+2581><U+2581>
Aquatic.Products..Other 0 1 0.00 0.03 0.00 0.00 0.00 0.00 4.000000e-01 <U+2587><U+2581><U+2581><U+2581><U+2581>
Cereals…Excluding.Beer 0 1 20.37 6.47 8.96 15.31 19.62 24.84 3.753000e+01 <U+2586><U+2587><U+2587><U+2585><U+2582>
Eggs 0 1 0.43 0.30 0.02 0.14 0.40 0.63 1.450000e+00 <U+2587><U+2585><U+2585><U+2581><U+2581>
Fish..Seafood 0 1 0.63 0.58 0.00 0.24 0.48 0.87 4.420000e+00 <U+2587><U+2582><U+2581><U+2581><U+2581>
Fruits…Excluding.Wine 0 1 2.01 1.42 0.15 1.22 1.69 2.37 8.850000e+00 <U+2587><U+2585><U+2581><U+2581><U+2581>
Meat 0 1 3.90 2.22 0.30 2.08 3.69 5.28 1.057000e+01 <U+2587><U+2587><U+2586><U+2583><U+2581>
Milk…Excluding.Butter 0 1 2.92 2.02 0.12 1.11 2.72 4.32 9.940000e+00 <U+2587><U+2586><U+2585><U+2581><U+2581>
Miscellaneous 0 1 0.16 0.22 0.00 0.02 0.09 0.19 1.180000e+00 <U+2587><U+2582><U+2581><U+2581><U+2581>
Offals 0 1 0.14 0.11 0.00 0.08 0.12 0.18 8.000000e-01 <U+2587><U+2583><U+2581><U+2581><U+2581>
Oilcrops 0 1 1.10 1.59 0.02 0.30 0.64 1.19 1.048000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
Pulses 0 1 1.11 1.20 0.00 0.30 0.71 1.55 7.560000e+00 <U+2587><U+2582><U+2581><U+2581><U+2581>
Spices 0 1 0.18 0.24 0.00 0.04 0.09 0.23 1.220000e+00 <U+2587><U+2582><U+2581><U+2581><U+2581>
Starchy.Roots 0 1 3.08 3.89 0.29 1.11 1.54 2.92 1.968000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
Stimulants 0 1 0.31 0.31 0.00 0.08 0.21 0.42 2.010000e+00 <U+2587><U+2582><U+2581><U+2581><U+2581>
Sugar.Crops 0 1 0.02 0.07 0.00 0.00 0.00 0.00 5.900000e-01 <U+2587><U+2581><U+2581><U+2581><U+2581>
Sugar…Sweeteners 0 1 4.82 2.14 0.68 3.42 4.68 6.35 9.550000e+00 <U+2585><U+2587><U+2587><U+2587><U+2582>
Treenuts 0 1 0.26 0.29 0.00 0.05 0.17 0.39 1.420000e+00 <U+2587><U+2582><U+2581><U+2581><U+2581>
Vegetal.Products 0 1 40.71 4.75 27.71 36.83 40.97 44.94 4.839000e+01 <U+2581><U+2586><U+2587><U+2587><U+2587>
Vegetable.Oils 0 1 4.87 2.16 0.93 3.13 4.66 6.43 1.038000e+01 <U+2585><U+2587><U+2586><U+2583><U+2582>
Vegetables 0 1 1.09 0.65 0.10 0.60 1.00 1.37 3.350000e+00 <U+2587><U+2587><U+2583><U+2581><U+2581>
Obesity 0 1 18.90 9.70 2.10 8.62 21.30 25.70 4.560000e+01 <U+2586><U+2583><U+2587><U+2582><U+2581>
Undernourished 0 1 11.17 12.38 2.00 2.00 6.65 14.22 5.960000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
Deaths 0 1 0.04 0.05 0.00 0.00 0.01 0.07 1.900000e-01 <U+2587><U+2582><U+2581><U+2581><U+2581>
Recovered 0 1 1.44 1.90 0.00 0.10 0.48 2.46 9.040000e+00 <U+2587><U+2582><U+2581><U+2581><U+2581>
Population 0 1 44523641.18 156418200.32 54000.00 2816250.00 10181500.00 32716250.00 1.402385e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>

from the histogram plot below we can see that the majority of the predictors are skewed and all of the values for Aquatic.Products..Other equals 0, making it a prime candidate to eliminate from further analysis. This data set will benefit from a Box Cox transformation.

plot_histogram(complete_data2, ggtheme = theme_linedraw())

Some other graphical methods, maybe more helpful than the simple histogram. we used the fitdisrplus package in R to visualize the recovery variable data together with some possible theoretical distributions in a skewness-kurtosis space:

plotdist(complete_data2$Re, histo = TRUE, demp = TRUE)

From the empirical density above, our distribution is right skewed and appears to be an exponential type of distribution. The Cullen and Frey Graph below is a good way to exempt some distributions by the parameters of skewness and kurtosis using the descdist function; The orange values around the blue (data) point are based on bootstrapping. From this Cullen and Frey Graph and the empirical graphs above, our choices for good fits would seem to be limited to the available distributions in the fitdistrplus package:

  • Weibull
  • gamma
  • exponential
library(fitdistrplus)
descdist(complete_data2$Recovered, boot=1000) 

## summary statistics
## ------
## min:  0   max:  9.039871 
## median:  0.4823613 
## mean:  1.437799 
## estimated sd:  1.899684 
## estimated skewness:  1.764737 
## estimated kurtosis:  6.033477
# I added the > 0 because the gamma dist doesn't allow zero values 
fw <- fitdist(complete_data2$Recovered[complete_data2$Recovered > 0], distr = "weibull", lower = 0.0)
fg <- fitdist(complete_data2$Recovered[complete_data2$Recovered > 0], distr = "gamma", lower = 0.0)
fe <- fitdist(complete_data2$Recovered[complete_data2$Recovered > 0], distr = "exp", lower = 0.0)
par(mfrow = c(2, 2))
plot.legend <- c("Weibull", "gamma", "expo")
denscomp(list(fw, fg, fe), legendtext = plot.legend)
qqcomp(list(fw, fg, fe), legendtext = plot.legend)
cdfcomp(list(fw, fg, fe), legendtext = plot.legend)
ppcomp(list(fw, fg, fe), legendtext = plot.legend)

From the plotted fitting metrics above, it appears that Weibull and gamma are the best contenders, but it seems a little unclear who is the clear winner form the density plot. Let’s take a closer look with a larger density plot for these two distributions.

denscomp(list(fw, fg), legendtext = c("Weibull", "gamma"))

It seems that still both distribution fits this data the best. Let us confirm this against the Akaline’s and Bayesian Information Criterion (AIC and BIC), which will give a sort of rank to the goodness of fit models passed to it using the gofstat function as well as the Goodness-of-fit statistics, which give distances between the fits and the empirical data.

gofstat(list(fw, fg))
## Goodness-of-fit statistics
##                              1-mle-weibull 2-mle-gamma
## Kolmogorov-Smirnov statistic     0.0789407  0.07033146
## Cramer-von Mises statistic       0.2074007  0.16509460
## Anderson-Darling statistic       1.2759014  0.91398920
## 
## Goodness-of-fit criteria
##                                1-mle-weibull 2-mle-gamma
## Akaike's Information Criterion      379.9996    374.4918
## Bayesian Information Criterion      386.2236    380.7158

Since the gamma distribution has the min AIC, BIC, and minimum goodness-of-fit statistics, we will choose the gamma distribution.

par(mfrow=c(2,2))
ggplot(top_obese_countries, aes(x=Obesity, y=Country)) + geom_histogram(stat = "identity")

ggplot(top_deaths_countries, aes(x=Deaths, y=Country)) + geom_histogram(stat = "identity")

ggplot(top_veggies_countries, aes(x=Vegetables, y=Country)) + geom_histogram(stat = "identity")

par(mfrow=c(2,2))
ggplot(top_aniproducts_countries, aes(x=Animal.Products, y=Country)) + geom_histogram(stat = "identity")

ggplot(top_meat_countries, aes(x=Meat, y=Country)) + geom_histogram(stat = "identity")

ggplot(top_sugar_countries, aes(x=Sugar...Sweeteners, y=Country)) + geom_histogram(stat = "identity")

Data Manipulation

Split the data

#Creating a non-normalized and scaled data set (only BoxCox transformation, eliminate near zero variance and highly correlated variables)
Processed <- preProcess(complete_data2, method = c("BoxCox", "nzv","corr"))
Processed
## Created from 170 samples and 16 variables
## 
## Pre-processing:
##   - Box-Cox transformation (14)
##   - ignored (0)
##   - removed (2)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.200   0.125   0.300   0.400   0.500   1.800
treeData <- predict(Processed, complete_data2)
set.seed(150)
predictors <- subset(treeData, select = -Deaths)
Deaths <- subset(treeData, select="Deaths")
initsplit <- createDataPartition(Deaths$Deaths, p=0.8, list=FALSE)

#Create Training Data to tune the model
X.train <- predictors[initsplit,]
Y.train <- Deaths[initsplit,]

#Create testing data to evaluate the model
X.test <- predictors[-initsplit,]
Y.test <- Deaths[-initsplit,]
#Creating a normalized, scaled data set along with the other transformations
Processed2 <- preProcess(complete_data2, method = c("center", "scale","BoxCox","nzv","corr"))
Processed2
## Created from 170 samples and 28 variables
## 
## Pre-processing:
##   - Box-Cox transformation (14)
##   - centered (26)
##   - ignored (0)
##   - removed (2)
##   - scaled (26)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.200   0.125   0.300   0.400   0.500   1.800
nrmData <- predict(Processed2, complete_data2)
set.seed(150)

predictors2 <- subset(nrmData, select = -Deaths)
Deaths2 <- subset(nrmData, select="Deaths")
initsplit2 <- createDataPartition(Deaths2$Deaths, p=0.8, list=FALSE)

#Create Training Data to tune the model
X.train.cs <- predictors2[initsplit2,]
Y.train.cs <- Deaths2[initsplit2,]

#Create testing data to evaluate the model
X.test.cs <- predictors2[-initsplit2,]
Y.test.cs <- Deaths2[-initsplit2,]

Modeling

PCA

Variable correlation

Before we start our PCA analysis we selected the variables that we are interested on and checked if a significant correlation among some of these variables. We want to create a list of the strongest correlations with diets for 2 covid states: deaths, and recovered. We will not use the active, and confirmed covid state.

M<- cor(complete_data2[, c('Alcoholic.Beverages','Animal.fats', 'Eggs', 'Fish..Seafood', 'Fish..Seafood', 'Fruits...Excluding.Wine','Meat', 'Milk...Excluding.Butter', 'Starchy.Roots','Sugar...Sweeteners','Vegetal.Products','Vegetable.Oils','Vegetables','Recovered', 'Deaths')])

#M <- cor.test(eng_imputed)

head(round(M,2))
##                         Alcoholic.Beverages Animal.fats  Eggs Fish..Seafood
## Alcoholic.Beverages                    1.00        0.56  0.41          0.06
## Animal.fats                            0.56        1.00  0.47          0.03
## Eggs                                   0.41        0.47  1.00          0.20
## Fish..Seafood                          0.06        0.03  0.20          1.00
## Fish..Seafood.1                        0.06        0.03  0.20          1.00
## Fruits...Excluding.Wine               -0.03       -0.14 -0.07          0.08
##                         Fish..Seafood.1 Fruits...Excluding.Wine Meat
## Alcoholic.Beverages                0.06                   -0.03 0.49
## Animal.fats                        0.03                   -0.14 0.46
## Eggs                               0.20                   -0.07 0.51
## Fish..Seafood                      1.00                    0.08 0.20
## Fish..Seafood.1                    1.00                    0.08 0.20
## Fruits...Excluding.Wine            0.08                    1.00 0.00
##                         Milk...Excluding.Butter Starchy.Roots
## Alcoholic.Beverages                        0.40         -0.11
## Animal.fats                                0.48         -0.27
## Eggs                                       0.46         -0.42
## Fish..Seafood                             -0.08         -0.01
## Fish..Seafood.1                           -0.08         -0.01
## Fruits...Excluding.Wine                   -0.02          0.20
##                         Sugar...Sweeteners Vegetal.Products Vegetable.Oils
## Alcoholic.Beverages                   0.21            -0.59           0.12
## Animal.fats                           0.29            -0.73           0.22
## Eggs                                  0.42            -0.65           0.26
## Fish..Seafood                         0.09            -0.20          -0.14
## Fish..Seafood.1                       0.09            -0.20          -0.14
## Fruits...Excluding.Wine              -0.01             0.04          -0.08
##                         Vegetables Recovered Deaths
## Alcoholic.Beverages           0.03      0.40   0.49
## Animal.fats                   0.10      0.45   0.54
## Eggs                          0.31      0.39   0.46
## Fish..Seafood                -0.03     -0.09  -0.15
## Fish..Seafood.1              -0.03     -0.09  -0.15
## Fruits...Excluding.Wine       0.03      0.00  -0.03

We also added significance test to the correlogram we shall compute the p-value. In the correlogram blow If the p-value is greater than 0.01 then it is an insignificant value for which the cells are either blank or crossed.

library(RColorBrewer)
cor.mtest <- function(mat, ...) 
{
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat<- matrix(NA, n, n)
  diag(p.mat) <- 0
  for (i in 1:(n - 1)) 
  {
    for (j in (i + 1):n)
    {
      tmp <- cor.test(mat[, i], mat[, j], ...)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
    }
  }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

p.mat <- cor.mtest(complete_data2[, c('Alcoholic.Beverages','Animal.fats', 'Eggs', 'Fish..Seafood', 'Fish..Seafood', 'Fruits...Excluding.Wine','Meat', 'Milk...Excluding.Butter', 'Starchy.Roots','Sugar...Sweeteners','Vegetal.Products','Vegetable.Oils','Vegetables','Recovered', 'Deaths')])

# Correlation 
M <- cor(complete_data2[, c('Alcoholic.Beverages','Animal.fats', 'Eggs', 'Fish..Seafood', 'Fish..Seafood', 'Fruits...Excluding.Wine','Meat', 'Milk...Excluding.Butter', 'Starchy.Roots','Sugar...Sweeteners','Vegetal.Products','Vegetable.Oils','Vegetables','Recovered', 'Deaths')])


# Specialized the insignificant value
# according to the significant level
corrplot(M, method="number", type = "upper", order = "hclust", 
         p.mat = p.mat, sig.level = 0.01, tl.col = "black")

From the plot we can see that when we compare death and recovered cases, the animal fat, animal products significantly higher in terms of correlation. Recovered cases diets have a lesser correlation to meat diet.

PCA Summary

Data.pca1 <- prcomp(complete_data2, center=TRUE, scale.=TRUE)
summary(Data.pca1)
## Importance of components:
##                           PC1     PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.9034 1.47529 1.36569 1.32158 1.1833 1.15260 1.07568
## Proportion of Variance 0.3011 0.07773 0.06661 0.06238 0.0500 0.04745 0.04132
## Cumulative Proportion  0.3011 0.37880 0.44541 0.50779 0.5578 0.60524 0.64657
##                            PC8     PC9   PC10    PC11   PC12    PC13    PC14
## Standard deviation     1.03489 0.99988 0.9742 0.96614 0.8663 0.83569 0.77697
## Proportion of Variance 0.03825 0.03571 0.0339 0.03334 0.0268 0.02494 0.02156
## Cumulative Proportion  0.68481 0.72052 0.7544 0.78775 0.8145 0.83950 0.86106
##                           PC15    PC16    PC17    PC18    PC19   PC20    PC21
## Standard deviation     0.73837 0.69250 0.65579 0.63972 0.63625 0.6033 0.57680
## Proportion of Variance 0.01947 0.01713 0.01536 0.01462 0.01446 0.0130 0.01188
## Cumulative Proportion  0.88053 0.89765 0.91301 0.92763 0.94209 0.9551 0.96697
##                           PC22    PC23    PC24    PC25    PC26     PC27
## Standard deviation     0.53723 0.50452 0.45911 0.41348 0.00224 0.001914
## Proportion of Variance 0.01031 0.00909 0.00753 0.00611 0.00000 0.000000
## Cumulative Proportion  0.97728 0.98637 0.99389 1.00000 1.00000 1.000000
##                             PC28
## Standard deviation     1.047e-05
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

Z=XV both give the same output

Data.pca1$x [,1:10] %>% head(1) 
##            PC1      PC2        PC3       PC4       PC5      PC6        PC7
## [1,] -3.641806 1.688469 -0.7698948 -1.319318 -0.766444 1.042338 -0.0207832
##             PC8        PC9       PC10
## [1,] -0.6304699 -0.2632205 -0.1823317

V matrix - Eigenvectors

head(Data.pca1$rotation)
##                                   PC1         PC2          PC3         PC4
## Alcoholic.Beverages       0.219602129 -0.01537951  0.221369867 -0.13095379
## Animal.Products           0.324856113 -0.09143975 -0.002199066 -0.15088002
## Animal.fats               0.250201694  0.07969506  0.122690074 -0.04331723
## Aquatic.Products..Other   0.005301169 -0.05802452 -0.198875692 -0.01126273
## Cereals...Excluding.Beer -0.235137813  0.24948407 -0.304501959 -0.09070498
## Eggs                      0.253318421  0.05185512 -0.158211592  0.06506757
##                                  PC5          PC6         PC7         PC8
## Alcoholic.Beverages       0.09827804 -0.008915206 -0.37548141  0.12467997
## Animal.Products          -0.03190862  0.010944437  0.02986672  0.05132614
## Animal.fats              -0.03521919 -0.054158797 -0.27381638  0.12270023
## Aquatic.Products..Other  -0.15283765 -0.431502011 -0.24066080 -0.24111676
## Cereals...Excluding.Beer -0.13656675  0.261388705 -0.06919340 -0.11348237
## Eggs                     -0.04508541 -0.112989613 -0.02678657  0.17283507
##                                  PC9         PC10        PC11        PC12
## Alcoholic.Beverages      -0.04151871 -0.178046737  0.11022080  0.06398933
## Animal.Products           0.03549022  0.001844158  0.01864656  0.12190697
## Animal.fats               0.09560538  0.099282406  0.03199443  0.12624994
## Aquatic.Products..Other   0.10337514 -0.638612369  0.17191032 -0.12849680
## Cereals...Excluding.Beer -0.02219595 -0.018024066 -0.15785792 -0.09177325
## Eggs                     -0.03021296 -0.037163159 -0.01499499 -0.12714657
##                                 PC13        PC14         PC15        PC16
## Alcoholic.Beverages       0.13394950 -0.11309573 -0.157228559  0.03405505
## Animal.Products          -0.16970483  0.03110275  0.008531113  0.07278086
## Animal.fats              -0.34953717  0.03838275  0.071296628  0.03285303
## Aquatic.Products..Other  -0.04987398 -0.07406443  0.251251687 -0.02025771
## Cereals...Excluding.Beer  0.02826212  0.07928799 -0.009400133  0.15460258
## Eggs                      0.08101779  0.23351057 -0.389053136 -0.07440349
##                                   PC17        PC18        PC19        PC20
## Alcoholic.Beverages       0.0141599626  0.19732674  0.40724557 -0.29528334
## Animal.Products           0.0002523726 -0.00976709 -0.03673126  0.05894300
## Animal.fats              -0.1227210548 -0.06707090  0.25230645  0.26248376
## Aquatic.Products..Other  -0.1808619663 -0.01060931 -0.10659687 -0.01637220
## Cereals...Excluding.Beer -0.0375543049  0.22092457  0.05221613 -0.02555296
## Eggs                      0.4489640077  0.05404104 -0.21429854 -0.23883145
##                                  PC21         PC22         PC23        PC24
## Alcoholic.Beverages      -0.172095430  0.434709384  0.007017531  0.29075815
## Animal.Products          -0.056969779 -0.158811223 -0.035860604  0.03307295
## Animal.fats               0.635940610 -0.001760916  0.029226005 -0.13261271
## Aquatic.Products..Other  -0.009765745 -0.111069214  0.181585504 -0.02038854
## Cereals...Excluding.Beer  0.072589118 -0.022840406  0.037916747 -0.04269639
## Eggs                      0.224549721 -0.146634423  0.465207219  0.07987972
##                                 PC25         PC26         PC27        PC28
## Alcoholic.Beverages      -0.05727171  0.007231468  0.073973925 0.093120150
## Animal.Products          -0.03259480 -0.733171797 -0.244154537 0.416034129
## Animal.fats               0.01465639  0.252970243 -0.048118942 0.112422887
## Aquatic.Products..Other  -0.07008190  0.005937122 -0.001265903 0.002704231
## Cereals...Excluding.Beer -0.15486020  0.041901853  0.449271357 0.566932746
## Eggs                      0.12326396  0.059968210 -0.011676285 0.026650790

Principal components

Since the principal components are orthogonal, there is no correlation whatsoever. The correlation plot is perfectly white, apart from autocorrelation.

res1 <- cor(Data.pca1$x, method="pearson")
corrplot::corrplot(res1, method= "color", order = "hclust", tl.pos = 'n')

PCA exploration

The plot below shows what percent of variance has been explained for each number of principal components (aggregate variance explained).

plot(summary(Data.pca1)$importance[3,])

Plots

PC1-PC2

library(factoextra)
fviz_pca_var(Data.pca1,axes = c(1, 2))

PC3-PC4

fviz_pca_var(Data.pca1,axes = c(3, 4))

PC5-PC6

fviz_pca_var(Data.pca1,axes = c(5, 6))

PC7-PC8

fviz_pca_var(Data.pca1,axes = c(7, 8))

Linear regression with principal components

Below are scatterplots showing the relation between PCs and y. Some of the principal components have a high correlation and others a lower correlation with the observed dependent variable.

Scatterplots

par(mfrow=c(2,2))
pcs <- as.data.frame(Data.pca1$x)
plot(complete_data2$Deaths, pcs$PC1)
plot(complete_data2$Deaths, pcs$PC2)
plot(complete_data2$Deaths, pcs$PC3)
plot(complete_data2$Deaths, pcs$PC4)

PCR

In order to perform the regression, we are combining both the explanatory variables - PCs, and explained variable - y. In this moment the dimensionality reduction should take place. Here however we won’t be getting rid of any principal components since we want to compare the results with prc from pls package.

ols.data <- cbind(complete_data2$Deaths, pcs)

Using the lm function I perform the linear regression.

lmodel <- lm(complete_data2$Deaths ~ ., data = ols.data)
summary(lmodel)
## 
## Call:
## lm(formula = complete_data2$Deaths ~ ., data = ols.data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.997e-17 -2.888e-18 -1.680e-19  3.134e-18  1.941e-17 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              7.451e-03  1.062e-03   7.014 9.08e-11 ***
## `complete_data2$Deaths`  8.093e-01  2.719e-02  29.766  < 2e-16 ***
## PC1                      2.049e-03  2.921e-04   7.014 9.08e-11 ***
## PC2                      2.509e-03  3.577e-04   7.014 9.08e-11 ***
## PC3                      1.499e-03  2.137e-04   7.014 9.08e-11 ***
## PC4                      7.695e-04  1.097e-04   7.014 9.08e-11 ***
## PC5                      1.111e-03  1.583e-04   7.014 9.08e-11 ***
## PC6                     -2.360e-04  3.365e-05  -7.014 9.08e-11 ***
## PC7                     -1.766e-03  2.518e-04  -7.014 9.08e-11 ***
## PC8                      3.164e-04  4.512e-05   7.014 9.08e-11 ***
## PC9                      9.827e-04  1.401e-04   7.014 9.08e-11 ***
## PC10                     5.380e-04  7.670e-05   7.014 9.08e-11 ***
## PC11                    -1.173e-03  1.672e-04  -7.014 9.08e-11 ***
## PC12                    -2.134e-03  3.043e-04  -7.014 9.08e-11 ***
## PC13                     1.009e-03  1.438e-04   7.014 9.08e-11 ***
## PC14                     5.432e-04  7.745e-05   7.014 9.08e-11 ***
## PC15                     1.084e-03  1.546e-04   7.014 9.08e-11 ***
## PC16                     1.149e-03  1.638e-04   7.014 9.08e-11 ***
## PC17                     1.236e-03  1.762e-04   7.014 9.08e-11 ***
## PC18                     3.442e-04  4.907e-05   7.014 9.08e-11 ***
## PC19                     5.339e-04  7.612e-05   7.014 9.08e-11 ***
## PC20                     3.549e-03  5.061e-04   7.014 9.08e-11 ***
## PC21                    -3.953e-03  5.636e-04  -7.014 9.08e-11 ***
## PC22                     6.678e-04  9.521e-05   7.014 9.08e-11 ***
## PC23                     2.606e-03  3.716e-04   7.014 9.08e-11 ***
## PC24                    -4.235e-03  6.038e-04  -7.014 9.08e-11 ***
## PC25                     1.500e-04  2.139e-05   7.014 9.08e-11 ***
## PC26                     2.972e-06  4.238e-07   7.014 9.08e-11 ***
## PC27                    -8.264e-07  1.178e-07  -7.014 9.08e-11 ***
## PC28                    -7.751e-09  1.105e-09  -7.014 9.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.41e-18 on 140 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.524e+32 on 29 and 140 DF,  p-value: < 2.2e-16

K-mean

complete_data2 %>% recipe(~.) %>% 
  #step_mutate_at(-Age, fn = ~ as.factor(.)) %>% 
  step_dummy(all_nominal(), one_hot = T) %>% 
  step_normalize(all_predictors()) %>%
  step_nzv(all_predictors()) %>% 
  step_corr(all_predictors()) %>% 
  prep() #%>% 
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor         28
## 
## Training data contained 170 data points and no missing data.
## 
## Operations:
## 
## Dummy variables were *not* created since no columns were selected. [trained]
## Centering and scaling for Alcoholic.Beverages, ... [trained]
## Sparse, unbalanced variable filter removed Aquatic.Products..Other [trained]
## Correlation filter removed Animal.Products [trained]

The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters. The choice also has has a strong influence on the clustering results.

the default distance measure is the Euclidean distance. However, depending on the type of the data and the research questions, other dissimilarity measures might be preferred and you should be aware of the options.

Within R it is simple to compute and visualize the distance matrix using the functions get_dist and fviz_dist from the factoextra R package.

distance <- get_dist(complete_data2, method = "spearman")
#head(as.matrix(distance), 2)[, 1:6]
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

df <- complete_data2

k2 <- kmeans(df, centers = 2, nstart = 25)
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)
k6 <- kmeans(df, centers = 6, nstart = 25)
k7 <- kmeans(df, centers = 7, nstart = 25)
k8 <- kmeans(df, centers = 8, nstart = 25)
k9 <- kmeans(df, centers = 9, nstart = 25)

str(k2)
## List of 9
##  $ cluster     : int [1:170] 1 1 1 1 1 1 1 1 1 1 ...
##  $ centers     : num [1:2, 1:28] 1.332 0.767 9.276 10.873 1.267 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:28] "Alcoholic.Beverages" "Animal.Products" "Animal.fats" "Aquatic.Products..Other" ...
##  $ totss       : num 4.13e+18
##  $ withinss    : num [1:2] 4.10e+17 2.61e+12
##  $ tot.withinss: num 4.1e+17
##  $ betweenss   : num 3.73e+18
##  $ size        : int [1:2] 168 2
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = df) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point",  data = df) + ggtitle("k = 6")
p6 <- fviz_cluster(k7, geom = "point",  data = df) + ggtitle("k = 6")
p7 <- fviz_cluster(k8, geom = "point",  data = df) + ggtitle("k = 6")
p8 <- fviz_cluster(k9, geom = "point",  data = df) + ggtitle("k = 6")

library(gridExtra)
grid.arrange(p1, p2, p3, p4,p5,p6,p7,p8, nrow = 4)

In order to determin the optimal clusters, we can use one of the following method which includes:

  • Elbow method
  • Silhouette method
  • Gap statisti

But here we see overlapping between cluster which we can’t explain

Trees

Decision trees do not need a normalized and scaled data set in order to build a model. Scaling the data can affect any nonlinear relationships in the data. Decision tree model are robust enough to not need normalization and can reflect any nonlinear relationships in the model that is created.

Random Forest

Building the random forest model

randoModel <- randomForest(X.train, Y.train, importance = TRUE, ntree = 1000)

Creating a prediction using the previously created model.

set.seed(150)
randomPred <- predict(randoModel, newdata = X.test)
randomResample <- postResample(pred=randomPred, obs = Y.test)

Below is a plot of the top ten important variables for the Random Forest model. The parameter “Eggs” is important for both node purity and mean standard error. The other important variables are Oilcrops, Alcholic.Beverages, Vegetal.Products, and Animal.Products. Both vegetable and animal products include raw, cooked, and processed products. The inclusion of vegetable and animal products is not surprising since that is a common and somewhat necessary staple of any human diet. What is most surprising is Eggs, since it is a common food item but also a known allergen.

varImpPlot(randoModel, n.var=10)

Cubist

The Cubist model is a rule based algorithm that uses regression at the end of the tree to determine the outcome.

set.seed(100)
cubeModel <- train(X.train, Y.train,
    method = "cubist",
    verbose = FALSE)

Plotting the most important features according to the cubist model, once again we see the inclusion of Eggs as the most important predictor. After Obesity and Population, the model determines Alcoholic.Beverages and Milk...Excluding.Butter as the other important food predictors.

plot(varImp(cubeModel), n=10)

cubistPred <-predict(cubeModel, newdata=X.test)
cubistSample <- postResample(pred=cubistPred, obs = Y.test)

Gradient Boosted

Gradient Boosted Machine builds multiple decision trees, one at a time, each new tree helps to correct errors made by the previous tree. In theory, the gradient boosted model should produce a more accurate model than the random forest.

gbmGrid <- expand.grid(
         interaction.depth = seq(1, 7, by = 2),
         n.trees = seq(100, 1000, by = 50),
         shrinkage = c(0.01, 0.1),
         n.minobsinnode = 10
         )

set.seed(100)

gbmModel <- train(X.train, Y.train,
    method = "gbm",
    tuneGrid = gbmGrid,
    verbose = FALSE)

View the most important predictors determined by the gradient boosted model. Much like the other tree based models, we see the importance of Eggs,Alcoholic.Beverages, and Oilcrops.

summary(gbmModel)

##                                               var    rel.inf
## Recovered                               Recovered 35.1600955
## Alcoholic.Beverages           Alcoholic.Beverages  9.0825219
## Eggs                                         Eggs  6.9010083
## Milk...Excluding.Butter   Milk...Excluding.Butter  5.3945097
## Vegetal.Products                 Vegetal.Products  5.2604965
## Animal.fats                           Animal.fats  4.6215363
## Undernourished                     Undernourished  4.4023647
## Miscellaneous                       Miscellaneous  3.5102497
## Population                             Population  3.2351117
## Vegetable.Oils                     Vegetable.Oils  2.8498912
## Oilcrops                                 Oilcrops  2.6793660
## Pulses                                     Pulses  2.3313082
## Obesity                                   Obesity  2.1397723
## Treenuts                                 Treenuts  2.0770761
## Cereals...Excluding.Beer Cereals...Excluding.Beer  1.7274599
## Fish..Seafood                       Fish..Seafood  1.4909817
## Offals                                     Offals  1.2420637
## Stimulants                             Stimulants  1.1062169
## Meat                                         Meat  0.9992101
## Vegetables                             Vegetables  0.9882710
## Spices                                     Spices  0.9373554
## Fruits...Excluding.Wine   Fruits...Excluding.Wine  0.7179831
## Sugar...Sweeteners             Sugar...Sweeteners  0.6502176
## Starchy.Roots                       Starchy.Roots  0.4949325
## Sugar.Crops                           Sugar.Crops  0.0000000
gbmPred <-predict(gbmModel, newdata=X.test)
gbmResample <- postResample(pred=gbmPred, obs=Y.test)

Prediction Results:

display <- rbind(
  "Random Forest" = randomResample,
  "Gradient Boosted Tree" = gbmResample,
  "Cubist" = cubistSample)


display %>% kable() %>% kable_paper()
RMSE Rsquared MAE
Random Forest 0.0224314 0.7182063 0.0153533
Gradient Boosted Tree 0.0228734 0.7221370 0.0165111
Cubist 0.0278355 0.6732597 0.0170036

The best model using Rsquare and RMSE as the primary selection criteria, random forest produces the optimal model. The Gradient Boosted tree performed only marginally worse, but Rsquare parameter still fell below the .50 threshold. However, when it boils down to determining the most influential food predictor in deaths relating to COVID-19, it appears to be Eggs, according to the models.

Non-Linear

Due to the nonlinear nature of the target variable and the predictors, it is important to explore other nonlinear models such as support vector machine. SVM is typically used for classification but it can be used for regression. The algorithm finds a function that creates a hyperplane that separates the space between the target classes. In regards to regression, the principal is similar, finding a good fitting hyperplane.

Since our target variable Deaths is numerical, we will have to perform support vector regression.

Support Vector Regression

Use the svm function with a radial kernal to build a model. We’re going to use the normalized training and testing set.

library(e1071)
svr_model <- svm(X.train.cs,
                   Y.train.cs,
                   kernal = "Radial",
                   cost=10)
print(svr_model)
## 
## Call:
## svm.default(x = X.train.cs, y = Y.train.cs, cost = 10, kernal = "Radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.04 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  111

Plot of Actual Points vs Line of Predicted Points

plot(x, Y.test.cs, pch=18, col="red")
lines(x, pred, lwd="1", col="blue")

### Check Support Vector Regression Accuracy 
library(MLmetrics)
mse = MSE(Y.test.cs, pred)
mae = MAE(Y.test.cs, pred)
rmse = RMSE(Y.test.cs, pred)

r2 = R2(Y.test.cs, pred, form = "traditional")

cat(" MAE:", mae, "\n", "MSE", mse, "\n",
     "RMSE:", rmse, "\n", "R-Squared:", r2)
##  MAE: 0.3288479 
##  MSE 0.2374864 
##  RMSE: 0.4873258 
##  R-Squared: 0.673371

The basic support vector regression model produces an R-Square of 0.4499807 and an RMSE of 0.5300456. The plot shows that the model produces a poor fit. We can tune the model with different values of epsilon and cost to create a better model.

Tune SVM:

The tune function identifies the best parameters by performing a grid search.

tuneResult <- tune(svm, Deaths~., data=nrmData, ranges = list(epsilon=seq(0,1,0.1), cost = 2^(seq(0.5,8,.5))))

Here is the plot of the tuned model. The darker shades of the plot show the optimal parameters.

plot(tuneResult)

print(tuneResult)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  epsilon     cost
##        0 1.414214
## 
## - best performance: 0.4380552
tunedVals <- tuneResult$best.model
predict2 <- predict(tunedVals, X.test.cs)

The updated plot of the tuned model. It shows a better fit than the non-tuned model.

x1 = 1:length(Y.test.cs)

plot(x1, Y.test.cs, pch=18, col="red")
lines(x1, predict2, lwd="1", col="blue")

Checking the accuracy of the tuned model:

mse = MSE(Y.test.cs, predict2)
mae = MAE(Y.test.cs, predict2)
rmse = RMSE(Y.test.cs, predict2)

r2 = R2(Y.test.cs, predict2, form = "traditional")

cat(" MAE:", mae, "\n", "MSE", mse, "\n",
     "RMSE:", rmse, "\n", "R-Squared:", r2)
##  MAE: 0.1001404 
##  MSE 0.06744831 
##  RMSE: 0.2597081 
##  R-Squared: 0.8997206

The tuned model’s RSquare is much higher at 0.7057181 and a better RMSE score oef 0.3462124. We can manually check the tuned model’s most important features:

w <- t(tunedVals$coefs) %*% tunedVals$SV

w <- apply(w, 2, function(v){sqrt(sum(v^2))})

w <- sort(w, decreasing = T)

print(w)
##                Recovered      Alcoholic.Beverages              Animal.fats 
##               28.2607712               13.4241681                8.3170989 
##               Stimulants           Vegetable.Oils                  Obesity 
##                7.5151268                7.3795706                6.6640575 
##         Vegetal.Products            Fish..Seafood  Milk...Excluding.Butter 
##                6.6186768                6.6145386                6.4610278 
##                 Oilcrops                     Meat       Sugar...Sweeteners 
##                6.4095564                6.3819978                5.7479848 
## Cereals...Excluding.Beer            Starchy.Roots           Undernourished 
##                5.6192914                5.3327217                5.2363520 
##                   Offals                   Pulses                     Eggs 
##                4.7150095                4.3883377                4.2977024 
##              Sugar.Crops  Fruits...Excluding.Wine                 Treenuts 
##                4.2012758                3.3345405                3.1727651 
##               Vegetables            Miscellaneous                   Spices 
##                2.2829954                0.9078458                0.6570656 
##               Population 
##                0.4452825

The tuned SVR model places a high level of importance on Alcoholic.Beverages, Animal.fats, Obesity, Stimulants, and Oilcrops when predicting deaths related to COVID-19. This makes logical sense since these factors are known (when consumed in large amounts) to be associated with overall poor health.

Conclusion & Findings

In this analysis, we used several models to study the correlation between countries’ diet and COVID-19 mortality rate.

By exploring each of those methods (which we include details in related section), we found out that food cultures is very important key as in predicting the mortality rate of COVID-19.

The models show foods associated with overall poor health increases the mortality rate of COVID-19 in a given country. That’s it, a population which has a healthy diet (food based on vegetal products, cereals,..) has a lower death rate in comparison with a population which has a higher obesity rate and consumes more animal products. In other words, a population with a healthy diet has a low rate of deaths related to the COVID-19.

Sources:

https://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf https://alex.smola.org/papers/2004/SmoSch04.pdf https://stackoverflow.com/questions/34781495/how-to-find-important-factors-in-support-vector-machine