Introduction

1. Project Objective

In this Project, we want to predict the Obesity Level of individuals. We use the Machine Learning algorithms to create a model to estimate obesity levels in individuals based on their eating habits and physical condition. In the project, we use Multinomial Classification Model in predicting multiple outputs simultaneously. The dataset was found on Kaggle and data contains 17 attributes and 2111 records. The first part of the project involves exploring the GGPLOT package and then second part of the project involves building a model for predictions.

Below are the outcomes of the Obesity Levels:
1. Insufficient Weight
2. Normal Weight
3. Overweight Level I
4. Overweight Level II
5. Obesity Type I
6. Obesity Type II
7. Obesity Type III

Load the packages used in the project

Data Preprocessing

## Rows: 2111 Columns: 17
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CAL...
## dbl (8): Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] 0
# Convert the measure of height from meters into centimeters
obesity$Height <- obesity$Height * 100

#round off some numeric columns to get discrete values
obesity$NCP <- round(obesity$NCP)
obesity$CH2O <- round(obesity$CH2O)
obesity$TUE <- round(obesity$TUE)
obesity$FAF <- round(obesity$FAF)

Data Exploration with GGPLOT PACKAGE

theme_set(theme_bw())
gg1 <- ggplot(obesity, aes(x=Height, y=Weight)) + 
  geom_point(aes(col=NObeyesdad)) + 
  geom_smooth(method="loess", se=F) +
  theme_set(theme_bw()) +
  scale_color_manual(values=c("#3a9efd","#ffc14d","#64379f","#ddacf5","#ff6150","#00ff00","#ffff4d"))+
  labs(subtitle="Body Mass Index : Weight vs Height", 
       y="Weight", 
       x="Height", 
       title="Scatterplot", 
       caption = "Source: Keith Mpala")

gg1
## `geom_smooth()` using formula 'y ~ x'

Height and weight measurements are used to calculate body mass index of individuals, or BMI, a measure that uses your height and weigh to work out if your weight is healthy or unhealthy.
For most adults, an ideal BMI is in the 18.5 to 24.9 range.

average_male_height <- obesity[obesity$Height & obesity$Gender=='Male',]$Height
average_male_weight <- obesity[obesity$Weight & obesity$Gender=='Male',]$Weight
average_female_height <- obesity[obesity$Height & obesity$Gender=='Female',]$Height
average_female_weight <- obesity[obesity$Weight & obesity$Gender=='Female',]$Weight

#average male and female heights
sum(average_male_height)/length(average_male_height)
## [1] 175.869
sum(average_female_height)/length(average_female_height)
## [1] 164.3298
#average male and female weights
sum(average_male_weight)/length(average_male_weight)
## [1] 90.76948
sum(average_female_weight)/length(average_female_weight)
## [1] 82.30236

Distribution of Obesity Level with BMI.

obesity_select <- obesity[obesity$Weight > 125 & 
                            obesity$Weight < 170 & 
                            obesity$Height > 165 & 
                            obesity$Height < 185, ]

gg2 <- ggplot(obesity, aes(x=Height, y=Weight)) + 
  geom_point(aes(col=Gender, size=Age)) + 
  theme_set(theme_bw()) +
  geom_encircle(aes(x=Height, y=Weight), 
                data=obesity_select, 
                color="red", 
                size=2, 
                expand=0.08) +
  scale_color_manual(values=c("#410f70","#ffff00"))+
  labs(subtitle="Body Mass Index : Gender Distribution", 
       y="Weight", 
       x="Height", 
       title="Scatterplot", 
       caption = "Source: Keith Mpala") 

gg2

Most men have a height greater than 175cm as compared to women who have an average height of 164cm.
We also realise that, women who have a weight greater than 120 fall in the category of Obesity level of Type 3.
Men who have Weight greater than 100 generally fall in the Obesity Level of Type 1 & 2.

DENSITY PLOT

gg3 <- ggplot(obesity, aes(Age))
gg3 + geom_density(aes(fill=factor(NObeyesdad)), alpha=0.8) + 
  theme_set(theme_classic()) +
  labs(title="Density plot", 
       subtitle="Age Grouped by Obesity Level",
       caption="Source: Keith Mpala",
       x="Age (years)",
       fill="Obesity Level")

The density curve compares the distribution of Obesity Level with Age variable. It uses the kernel density estimate to show the probability density function of the variable. The density curve gives us a good idea of the shape of Age distribution by Obesity Level, whether or not the distribution is left skewed or right skewed and lets us visually see where the mean and the median of Age distribution are located with each Obesity Level.

PIE Chart

gg4 <- ggplot(obesity, aes(x = "", fill = factor(MTRANS))) + 
  geom_bar(width = 1) +
  theme(axis.line = element_blank(), plot.title = element_text(hjust=0.5)) + 
  theme_set(theme_classic()) +
  scale_fill_manual(values=c("#002db3","#ffff00","#410f70","#00ff00","#ff0000"))+
  labs(fill="MTRANS", 
       x=NULL, 
       y=NULL, 
       title="Pie Chart",
       subtitle = "Mode of transportation Used",
       caption="Source: Keith Mpala")
plot(gg4 + coord_polar(theta = "y", start=0))

The pie chart illustrates the proportion of the mode of transportation used. The most used transportation method is public transport followed by Automobile. A small fraction of individuals use either a motorbike of a bike.

HISTOGRAM

gg5 <- ggplot(obesity, aes(CAEC))
gg5 + geom_bar(aes(fill=FAVC), width = 0.3) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Histogram on Categorical Variable", 
       subtitle="Consumption of Food Between Meals",
       caption="Source: Keith Mpala") 

Histogram shows the count of the consumption between meals (CAEC). The diagram shows that most individuals “sometimes” take food between meals followed by individuals who frequently take food between meals. The diagram also shows frequency of high caloric food (FAVC) that most people take high caloric food.

gg6 <- ggplot(obesity, aes(NObeyesdad))
gg6 + geom_bar(aes(fill=family_history_with_overweight), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.5)) + 
  labs(title="Histogram on Categorical Variable", 
       subtitle="Obesity Level & Family History with Overweight",
       caption="Source: Keith Mpala") 

The histogram shows Obesity Levels with the distribution of family history with overweight. It can be noted that most individuals who fall under the obesity Level of Type 1, 2 or 3 are most likely to have a family background of someone who has endured overweight. All individuals with Obesity Level of Type 3 have/ had a family who suffered from overweight.

Bar Chart

We use the Bar Chart to check the distribution frequency of physical activity consumption of Vegetables, water and number of main meals based on the Obesity Levels.

ggplot(obesity, aes(x=NObeyesdad, y=FCVC)) + 
  geom_bar(stat="identity", width=.4, fill="#70c8aa") + 
  labs(title="Bar Chart", 
       subtitle="Obesity Levels vs Average Frequency consumption of Vegetables", 
       caption="source: Keith Mpala") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.5))

Individuals who have an obesity Level of Type 3 have a much higher consumption of Vegetables followed by individuals who have obesity Level of Type 1.

ggplot(obesity, aes(x=NObeyesdad, y=CH2O)) + 
  geom_bar(stat="identity", width=.5, fill="#009E73") + 
  labs(title="Bar Chart", 
       subtitle="Obesity Levels vs Average Consumption of Water", 
       caption="source: Keith Mpala") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.5))

Individuals with Obesity levels of Type 1 & 3 on average consume more water.

ggplot(obesity, aes(x=NObeyesdad, y=FAF)) + 
  geom_bar(stat="identity", width=.5, fill="#429579") + 
  labs(title="Bar Chart", 
       subtitle="Obesity Level vs Average Physical Activity Frequency", 
       caption="source: Keith Mpala") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.5))

Individuals with Obesity Level of Type 3 on average have more meals a day.

ggplot(obesity, aes(x=NObeyesdad, y=NCP)) + 
  geom_bar(stat="identity", width=.5, fill="#3d6c5c") + 
  labs(title="Bar Chart", 
       subtitle="Obesity Levels vs Number Of Main Meals a Day", 
       caption="source: Keith Mpala") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.5))

Individuals with Obesity Level of type 3 on average is the least group of individuals that conduct Physical Activity, followed by individuals with Overweight Level 2 and Obesity Level 2.

Data Preprocessing for Machine Learning model

# Check the structure of the dataset.
glimpse(obesity)
## Rows: 2,111
## Columns: 17
## $ Gender                         <chr> "Female", "Female", "Male", "Male", "Ma~
## $ Age                            <dbl> 21, 21, 23, 27, 22, 29, 23, 22, 24, 22,~
## $ Height                         <dbl> 162, 152, 180, 180, 178, 162, 150, 164,~
## $ Weight                         <dbl> 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55.~
## $ family_history_with_overweight <chr> "yes", "yes", "yes", "no", "no", "no", ~
## $ FAVC                           <chr> "no", "no", "no", "no", "no", "yes", "y~
## $ FCVC                           <dbl> 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, ~
## $ NCP                            <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, ~
## $ CAEC                           <chr> "Sometimes", "Sometimes", "Sometimes", ~
## $ SMOKE                          <chr> "no", "yes", "no", "no", "no", "no", "n~
## $ CH2O                           <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, ~
## $ SCC                            <chr> "no", "yes", "no", "no", "no", "no", "n~
## $ FAF                            <dbl> 0, 3, 2, 2, 0, 0, 1, 3, 1, 1, 2, 2, 2, ~
## $ TUE                            <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, ~
## $ CALC                           <chr> "no", "Sometimes", "Frequently", "Frequ~
## $ MTRANS                         <chr> "Public_Transportation", "Public_Transp~
## $ NObeyesdad                     <chr> "Normal_Weight", "Normal_Weight", "Norm~

We combine some variables to improve the performance of our model. The variables combined are those that are least represented in a categorical variable.

# combine some variables to improve accuracy of the model
obesity$CAEC[obesity$CAEC == "Always"] <- "Frequently"
obesity$CALC[obesity$CALC == "Always"] <- "Frequently"
obesity$MTRANS[obesity$MTRANS == "Bike"] <- "Motorbike"
obesity$MTRANS[obesity$MTRANS == "Motorbike"] <- "Bike/Motorbike"

Categorical variables that have a logical/ natural order are converted into ordered factors using the factor() function in R.

# converted some categorical variables into ordered factors
obesity$CAEC <- factor(obesity$CAEC,
                       levels = c("Frequently",
                                  "Sometimes",
                                  "no"),
                       ordered = TRUE)

obesity$CALC <- factor(obesity$CALC,
                       levels = c("Frequently",
                                  "Sometimes",
                                  "no"),
                       ordered = TRUE)

The rest of categorical variables are converted into factors using the as.factor() function in R. We make use of a for loop to convert all categorical variables into factors.

# Convert the rest of categorical variables into factors
obes_char_var <- 
  sapply(obesity, is.character) %>% 
  which() %>% 
  names()

for (variable in obes_char_var) {
  obesity[[variable]] <- as.factor(obesity[[variable]])
}

Structure of the dataset after Preprocessing

# Structure of the dataset.
glimpse(obesity)
## Rows: 2,111
## Columns: 17
## $ Gender                         <fct> Female, Female, Male, Male, Male, Male,~
## $ Age                            <dbl> 21, 21, 23, 27, 22, 29, 23, 22, 24, 22,~
## $ Height                         <dbl> 162, 152, 180, 180, 178, 162, 150, 164,~
## $ Weight                         <dbl> 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55.~
## $ family_history_with_overweight <fct> yes, yes, yes, no, no, no, yes, no, yes~
## $ FAVC                           <fct> no, no, no, no, no, yes, yes, no, yes, ~
## $ FCVC                           <dbl> 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, ~
## $ NCP                            <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, ~
## $ CAEC                           <ord> Sometimes, Sometimes, Sometimes, Someti~
## $ SMOKE                          <fct> no, yes, no, no, no, no, no, no, no, no~
## $ CH2O                           <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, ~
## $ SCC                            <fct> no, yes, no, no, no, no, no, no, no, no~
## $ FAF                            <dbl> 0, 3, 2, 2, 0, 0, 1, 3, 1, 1, 2, 2, 2, ~
## $ TUE                            <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, ~
## $ CALC                           <ord> no, Sometimes, Frequently, Frequently, ~
## $ MTRANS                         <fct> Public_Transportation, Public_Transport~
## $ NObeyesdad                     <fct> Normal_Weight, Normal_Weight, Normal_We~
# check the distribution of our dependent variable
ggplot(obesity, aes(x = NObeyesdad)) +
  geom_bar(stat="count", width=.5,fill = "#2c698d")+
  labs(title="Distribution of our dependent variable",
       caption="source: Keith Mpala") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.5))

The graph above shows the distribution of our dependent variable (NObeyesdad) Obesity Level. The predicting variables are evenly distributed and the model will not result to any bias.

APPLYING THE MACHINE LEARNING ALGORITHM

Our data set is divided into 80% train data and 20% test data using the dependent variable.

which_train <- createDataPartition(obesity$NObeyesdad,
                                   p = 0.8, 
                                   list = FALSE) 
obesity_train <- obesity[which_train,]
obesity_test <- obesity[-which_train,]

We apply the 5 fold cross validation technique in our model to overcome overfitting problems

# Apply cross validation
set.seed(7000)
ctrl_cv5 <- trainControl(method = "cv",
                         number = 5,
                         classProbs = TRUE,
                         summaryFunction = multiClassSummary)

We train the model using the Multinom method to predict multiple variables simultaneously

Model summary

#model summary
model_predictions
## Penalized Multinomial Regression 
## 
## 1691 samples
##   16 predictor
##    7 classes: 'Insufficient_Weight', 'Normal_Weight', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III', 'Overweight_Level_I', 'Overweight_Level_II' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1351, 1353, 1353, 1355, 1352 
## Resampling results across tuning parameters:
## 
##   decay  logLoss    AUC        prAUC      Accuracy   Kappa      Mean_F1  
##   0e+00  0.5059173  0.9913959  0.8118194  0.9461916  0.9371563  0.9452843
##   1e-04  0.4595262  0.9916075  0.8696153  0.9414525  0.9316209  0.9403178
##   1e-01  0.2619662  0.9915663  0.9299725  0.9130269  0.8984037  0.9099142
##   Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
##   0.9457984         0.9910405         0.9463941            0.9910916          
##   0.9408992         0.9902543         0.9416811            0.9903156          
##   0.9106010         0.9855484         0.9122414            0.9856749          
##   Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.9463941       0.9457984    0.1351702            0.9684194             
##   0.9416811       0.9408992    0.1344932            0.9655768             
##   0.9122414       0.9106010    0.1304324            0.9480747             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.

Predict test data

# predict using test data
obesity_forecast <- predict(model_predictions,obesity_test)

Results of Predictions

# results of prediction from test data
confusionMatrix(obesity_forecast,obesity_test$NObeyesdad)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  53             1              0
##   Normal_Weight                         1            55              0
##   Obesity_Type_I                        0             0             67
##   Obesity_Type_II                       0             0              1
##   Obesity_Type_III                      0             0              0
##   Overweight_Level_I                    0             1              0
##   Overweight_Level_II                   0             0              2
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  0
##   Normal_Weight                     0                0                  3
##   Obesity_Type_I                    3                0                  0
##   Obesity_Type_II                  55                0                  0
##   Obesity_Type_III                  0               64                  0
##   Overweight_Level_I                0                0                 51
##   Overweight_Level_II               1                0                  4
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   0
##   Normal_Weight                         0
##   Obesity_Type_I                        1
##   Obesity_Type_II                       0
##   Obesity_Type_III                      1
##   Overweight_Level_I                    2
##   Overweight_Level_II                  54
## 
## Overall Statistics
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.9246, 0.9688)
##     No Information Rate : 0.1667          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9416          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              0.9815               0.9649
## Specificity                              0.9973               0.9890
## Pos Pred Value                           0.9815               0.9322
## Neg Pred Value                           0.9973               0.9945
## Prevalence                               0.1286               0.1357
## Detection Rate                           0.1262               0.1310
## Detection Prevalence                     0.1286               0.1405
## Balanced Accuracy                        0.9894               0.9769
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         0.9571                 0.9322
## Specificity                         0.9886                 0.9972
## Pos Pred Value                      0.9437                 0.9821
## Neg Pred Value                      0.9914                 0.9890
## Prevalence                          0.1667                 0.1405
## Detection Rate                      0.1595                 0.1310
## Detection Prevalence                0.1690                 0.1333
## Balanced Accuracy                   0.9729                 0.9647
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           1.0000                    0.8793
## Specificity                           0.9972                    0.9917
## Pos Pred Value                        0.9846                    0.9444
## Neg Pred Value                        1.0000                    0.9809
## Prevalence                            0.1524                    0.1381
## Detection Rate                        0.1524                    0.1214
## Detection Prevalence                  0.1548                    0.1286
## Balanced Accuracy                     0.9986                    0.9355
##                      Class: Overweight_Level_II
## Sensitivity                              0.9310
## Specificity                              0.9807
## Pos Pred Value                           0.8852
## Neg Pred Value                           0.9889
## Prevalence                               0.1381
## Detection Rate                           0.1286
## Detection Prevalence                     0.1452
## Balanced Accuracy                        0.9558

Conclusion