Name Matric Number
WANG RUI 23067445
JIN YUXIN 23061984
ZHENG KEXIN 22093955
HEJUNFENG 22120634

Introduction

According to The World Obesity Federation,obesity has become a major public health problem worldwide since 1987. There are lots of underlying factors, for example eating habits, exercise amount, genetic information and so on, might cause people get overweight. Then the goal of us are relevance between obesity and such factors and found the most accurate model for predict obesity.

Objectives

Based on the background there are two research objectives we set as follow:

Regression objective: To understand the impact of these variables on an individual’s weight status by accurately predicting BMI

Classification objective: To build a model to categorize individuals into their respective obesity levels, helping to identify and target interventions for different weight categories

Data collection and Data Understanding

We get those dataset from kaggle(https://www.kaggle.com/), the dataset (both train and test) was generated from a deep learning model trained on the Obesity or CVD risk dataset. Those datasets include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III.

Note:

NObesity values are:

Let’s get some basic information from datasets:

train.csv

head(train)
##   id Gender      Age   Height    Weight family_history_with_overweight FAVC
## 1  0   Male 24.44301 1.699998  81.66995                            yes  yes
## 2  1 Female 18.00000 1.560000  57.00000                            yes  yes
## 3  2 Female 18.00000 1.711460  50.16575                            yes  yes
## 4  3 Female 20.95274 1.710730 131.27485                            yes  yes
## 5  4   Male 31.64108 1.914186  93.79806                            yes  yes
## 6  5   Male 18.12825 1.748524  51.55259                            yes  yes
##       FCVC      NCP       CAEC SMOKE     CH2O SCC      FAF      TUE      CALC
## 1 2.000000 2.983297  Sometimes    no 2.763573  no 0.000000 0.976473 Sometimes
## 2 2.000000 3.000000 Frequently    no 2.000000  no 1.000000 1.000000        no
## 3 1.880534 1.411685  Sometimes    no 1.910378  no 0.866045 1.673584        no
## 4 3.000000 3.000000  Sometimes    no 1.674061  no 1.467863 0.780199 Sometimes
## 5 2.679664 1.971472  Sometimes    no 1.979848  no 1.967973 0.931721 Sometimes
## 6 2.919751 3.000000  Sometimes    no 2.137550  no 1.930033 1.000000 Sometimes
##                  MTRANS          NObeyesdad
## 1 Public_Transportation Overweight_Level_II
## 2            Automobile       Normal_Weight
## 3 Public_Transportation Insufficient_Weight
## 4 Public_Transportation    Obesity_Type_III
## 5 Public_Transportation Overweight_Level_II
## 6 Public_Transportation Insufficient_Weight
dim(train)
## [1] 20758    18
str(train)
## 'data.frame':    20758 obs. of  18 variables:
##  $ id                            : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Gender                        : chr  "Male" "Female" "Female" "Female" ...
##  $ Age                           : num  24.4 18 18 21 31.6 ...
##  $ Height                        : num  1.7 1.56 1.71 1.71 1.91 ...
##  $ Weight                        : num  81.7 57 50.2 131.3 93.8 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "yes" ...
##  $ FAVC                          : chr  "yes" "yes" "yes" "yes" ...
##  $ FCVC                          : num  2 2 1.88 3 2.68 ...
##  $ NCP                           : num  2.98 3 1.41 3 1.97 ...
##  $ CAEC                          : chr  "Sometimes" "Frequently" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "no" "no" "no" ...
##  $ CH2O                          : num  2.76 2 1.91 1.67 1.98 ...
##  $ SCC                           : chr  "no" "no" "no" "no" ...
##  $ FAF                           : num  0 1 0.866 1.468 1.968 ...
##  $ TUE                           : num  0.976 1 1.674 0.78 0.932 ...
##  $ CALC                          : chr  "Sometimes" "no" "no" "Sometimes" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Automobile" "Public_Transportation" "Public_Transportation" ...
##  $ NObeyesdad                    : chr  "Overweight_Level_II" "Normal_Weight" "Insufficient_Weight" "Obesity_Type_III" ...

According to above, we can get those information about train.csv:

This dataset consist of the estimation of obesity levels in people with ages between 14 and 61 and diverse eating habits and physical condition, the information was processed obtaining 18 columns and 20758 rows. And we can see that the attributes related with eating habits are: Frequent consumption of high caloric food (FAVC), Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Consumption of food between meals (CAEC), Consumption of water daily (CH20), and Consumption of alcohol (CALC). The attributes related with the physical condition are: Calories consumption monitoring (SCC), Physical activity frequency (FAF), Time using technology devices (TUE), Transportation used (MTRANS). And we also can see the type of each columns.

obesity.csv

head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight
dim(obesity)
## [1] 2111   17
str(obesity)
## 'data.frame':    2111 obs. of  17 variables:
##  $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr  "no" "no" "no" "no" ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "yes" "no" "no" ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr  "no" "yes" "no" "no" ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...

According to above, we can get those information about obesity.csv:

This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level). And we also can see the type of each columns.

Data preparation

Data preparation is the process of cleaning, transforming, and reorganizing data so that it can be easily analyzed and is important step to ensure that the data is accurate, complete, and in the right format for analysis in R.

Here are some steps involved:

Data Cleaning : Data cleaning involves finding and fixing or removing any incorrect, incomplete, inaccurate, irrelevant, or missing parts of the data as needed.

Data Transformation: Data transformation is the process of converting raw data into a suitable format or structure for analysis which involves changing the original form of the data to make it more understandable or useful for specific analytical tasks.

Data Integration: Data integration is the process of combining data from multiple sources into a unified view to provide a complete and accurate picture of data.

The first step of us could remove rows with missing values. The question is removing rows with missing values may lead to a reduction in the amount of data, so should be assessed to ensure that valuable information is not lost.

train <- na.omit(train)
obesity <- na.omit(obesity)

dim(train)
## [1] 20758    18
dim(obesity)
## [1] 2111   17
sum(duplicated(train))
## [1] 0
sum(duplicated(obesity))
## [1] 24

As we can see, these dataset have no missing values, but there are some repeated rows.

#Combine datasets
train <- train[,!names(train) %in% 'id']
train <- data.frame(train)
combined <- bind_rows(train,obesity)

dim(combined)
## [1] 22869    17
head(combined)
##   Gender      Age   Height    Weight family_history_with_overweight FAVC
## 1   Male 24.44301 1.699998  81.66995                            yes  yes
## 2 Female 18.00000 1.560000  57.00000                            yes  yes
## 3 Female 18.00000 1.711460  50.16575                            yes  yes
## 4 Female 20.95274 1.710730 131.27485                            yes  yes
## 5   Male 31.64108 1.914186  93.79806                            yes  yes
## 6   Male 18.12825 1.748524  51.55259                            yes  yes
##       FCVC      NCP       CAEC SMOKE     CH2O SCC      FAF      TUE      CALC
## 1 2.000000 2.983297  Sometimes    no 2.763573  no 0.000000 0.976473 Sometimes
## 2 2.000000 3.000000 Frequently    no 2.000000  no 1.000000 1.000000        no
## 3 1.880534 1.411685  Sometimes    no 1.910378  no 0.866045 1.673584        no
## 4 3.000000 3.000000  Sometimes    no 1.674061  no 1.467863 0.780199 Sometimes
## 5 2.679664 1.971472  Sometimes    no 1.979848  no 1.967973 0.931721 Sometimes
## 6 2.919751 3.000000  Sometimes    no 2.137550  no 1.930033 1.000000 Sometimes
##                  MTRANS          NObeyesdad
## 1 Public_Transportation Overweight_Level_II
## 2            Automobile       Normal_Weight
## 3 Public_Transportation Insufficient_Weight
## 4 Public_Transportation    Obesity_Type_III
## 5 Public_Transportation Overweight_Level_II
## 6 Public_Transportation Insufficient_Weight
#Remove repeated rows after combine
any(duplicated(combined))
## [1] TRUE
sum(duplicated(combined))
## [1] 24
train_cleaned <- unique(combined)
dim(train_cleaned)
## [1] 22845    17

Then we should identify and categorize the numerical and character feature columns in the data frame, ensuring that certain specific columns are not mistakenly included in the numerical features.

numerical_feats_train <- names(train_cleaned)[sapply(train_cleaned, is.numeric)]
numerical_feats_test <- names(test)[sapply(test, is.numeric)]
categorical_feats_train <- names(train_cleaned)[sapply(train_cleaned, is.character)]
categorical_feats_test <- names(test)[sapply(test, is.character)]
numerical_feats_test <- setdiff(numerical_feats_test, "id")

y <- train_cleaned$NObeyesdad
train <- train_cleaned[, !names(train_cleaned) %in% "NObeyesdad"]
y_le <- as.numeric(as.factor(y)) - 1
set.seed(42)
trainIndex <- createDataPartition(y_le, p = 0.8, list = FALSE, times = 1)
X_train <- train[trainIndex, ]
X_val <- train[-trainIndex, ]
y_train <- y_le[trainIndex]
y_val <- y_le[-trainIndex]

Exploratory Data Analysis

Exploratory Data Analysis (EDA) allows data scientists to examine data sets and summarize their main characteristics, often using data visualization techniques. It aids in manipulating data sources to extract the needed answers, enabling the discovery of patterns, identification of anomalies, hypothesis testing, and assumption checking. EDA enhances the understanding of data set variables and their relationships, providing insights based on size, patterns, and other attributes.

Data Exploration

We should analyze the conditions of our participants, that can help us understand the basic structure and background information of the data.

Let’s analyze the frequency of Nobesity in participants,first.

We can see that there are the most people with Obesity Type III, followed by Obesity Type II.

Moreover, let’s analyze another conditions about participants, involve target of obesity, alcohol consumption, smoking, family history, gender and even the transportation they chosen.

According to those plots, there are some information we can get as follow:

  • Most participants (71.99%) do not consume alcohol.

  • 25.38% of participants drink occasionally, and 2.62% of participants drink frequently.

  • The most of participants (98.73%) do not smoke, and only 1.27% having smoking habit.

  • 50.15% of participants have a family history of overweight, and 49.85% of participants do not.

  • 50.15% of participants are female and 49.85% of participants are male.

  • Most participants use public transportation, fewer participants use bikes and motorbikes.

Let’s compare the influence of different factors on Nobesity:

We can see people who consume high-calorie foods and those with a family history of obesity are more likely to develop Obesity Type III and Obesity Type II. In contrast, individuals with normal weight and insufficient weight tend not to have a family history of obesity and do not consume high-calorie foods.

Correlation Analysis

for (i in categorical_feats_train) {
  print(table(train_cleaned[[i]]))
  print("-----------")
}
## 
## Female   Male 
##  11457  11388 
## [1] "-----------"
## 
##    no   yes 
##  4109 18736 
## [1] "-----------"
## 
##    no   yes 
##  2019 20826 
## [1] "-----------"
## 
##     Always Frequently         no  Sometimes 
##        531       2708        316      19290 
## [1] "-----------"
## 
##    no   yes 
## 22556   289 
## [1] "-----------"
## 
##    no   yes 
## 22062   783 
## [1] "-----------"
## 
##     Always Frequently         no  Sometimes 
##          1        599       5799      16446 
## [1] "-----------"
## 
##            Automobile                  Bike             Motorbike 
##                  3990                    39                    49 
## Public_Transportation               Walking 
##                 18245                   522 
## [1] "-----------"
## 
## Insufficient_Weight       Normal_Weight      Obesity_Type_I     Obesity_Type_II 
##                2790                3364                3261                3545 
##    Obesity_Type_III  Overweight_Level_I Overweight_Level_II 
##                4370                2703                2812 
## [1] "-----------"
standardize <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
train_cleaned[, numerical_feats_train] <- lapply(train_cleaned[, numerical_feats_train], standardize)
test[, numerical_feats_test] <- lapply(test[, numerical_feats_test], standardize)

train_cleaned <- train_cleaned %>% mutate_if(is.character,as.factor)
test <- test %>% mutate_if(is.character,as.factor)
str(train)
## 'data.frame':    22845 obs. of  16 variables:
##  $ Gender                        : chr  "Male" "Female" "Female" "Female" ...
##  $ Age                           : num  24.4 18 18 21 31.6 ...
##  $ Height                        : num  1.7 1.56 1.71 1.71 1.91 ...
##  $ Weight                        : num  81.7 57 50.2 131.3 93.8 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "yes" ...
##  $ FAVC                          : chr  "yes" "yes" "yes" "yes" ...
##  $ FCVC                          : num  2 2 1.88 3 2.68 ...
##  $ NCP                           : num  2.98 3 1.41 3 1.97 ...
##  $ CAEC                          : chr  "Sometimes" "Frequently" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "no" "no" "no" ...
##  $ CH2O                          : num  2.76 2 1.91 1.67 1.98 ...
##  $ SCC                           : chr  "no" "no" "no" "no" ...
##  $ FAF                           : num  0 1 0.866 1.468 1.968 ...
##  $ TUE                           : num  0.976 1 1.674 0.78 0.932 ...
##  $ CALC                          : chr  "Sometimes" "no" "no" "Sometimes" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Automobile" "Public_Transportation" "Public_Transportation" ...
for (categorical_feat in categorical_feats_train) {
  train_cleaned[[categorical_feat]] <- as.integer(train_cleaned[[categorical_feat]])
}

for (categorical_feat in categorical_feats_test) {
  test[[categorical_feat]] <- as.integer(test[[categorical_feat]])
}
X_test <- test[, !names(test) %in% "id"]
head(train_cleaned)
##   Gender         Age       Height     Weight family_history_with_overweight
## 1      2  0.09634504 -0.005336032 -0.2322849                              2
## 2      1 -1.02313940 -1.598658082 -1.1680511                              2
## 3      1 -1.02313940  0.125113384 -1.4272838                              2
## 4      1 -0.51009618  0.116805229  1.6492996                              2
## 5      2  1.34702230  2.432344957  0.2277514                              2
## 6      2 -1.00085592  0.546940042 -1.3746790                              2
##   FAVC       FCVC        NCP CAEC SMOKE        CH2O SCC        FAF       TUE
## 1    2 -0.8318003  0.3198325    4     1  1.21024543   1 -1.1725038 0.5897277
## 2    2 -0.8318003  0.3433186    2     1 -0.04464353   1  0.0183577 0.6287571
## 3    2 -1.0557745 -1.8900180    4     1 -0.19193221   1 -0.1411642 1.7461791
## 4    2  1.0429941  0.3433186    4     1 -0.58030581   1  0.5755177 0.2641249
## 5    2  0.4424300 -1.1028990    4     1 -0.07776220   1  1.1710795 0.5154877
## 6    2  0.8925437  0.3433186    4     1  0.18141210   1  1.1258982 0.6287571
##   CALC MTRANS NObeyesdad      BMI
## 1    4      4          7 28.25956
## 2    3      1          2 23.42209
## 3    3      4          1 17.12671
## 4    4      4          5 44.85580
## 5    4      4          7 25.59915
## 6    4      4          1 16.86193
head(test)
##   id Gender        Age       Height    Weight family_history_with_overweight
## 1  0      2  0.1056960 -0.002828196 -0.235707                              2
## 2  1      1 -1.0270272 -1.606252138 -1.170903                              2
## 3  2      1 -1.0270272  0.128448288 -1.429978                              2
## 4  3      1 -0.5079168  0.120087458  1.644731                              2
## 5  4      2  1.3711635  2.450308044  0.224049                              2
## 6  5      2 -1.0044802  0.552949388 -1.377405                              2
##   FAVC       FCVC        NCP CAEC SMOKE        CH2O SCC         FAF       TUE
## 1    2 -0.8362588  0.3146763    4     1  1.20656459   1 -1.17111325 0.5974236
## 2    2 -0.8362588  0.3383560    2     1 -0.04834813   1  0.02177431 0.6364976
## 3    2 -1.0603060 -1.9133767    4     1 -0.19563960   1 -0.13801895 1.7551970
## 4    2  1.0391462  0.3383560    4     1 -0.58402055   1  0.57988226 0.2714485
## 5    2  0.4383864 -1.1197742    4     1 -0.08146743   1  1.17645726 0.5230987
## 6    2  0.8886468  0.3383560    4     1  0.17771178   1  1.13119910 0.6364976
##   CALC MTRANS NObeyesdad
## 1    3      4          7
## 2    2      1          2
## 3    2      4          1
## 4    3      4          5
## 5    3      4          7
## 6    3      4          1

The above charts reveal the relationship between different features and target categories, leading to the following conclusions:

  • The distribution of different categories of NObeyesdad is relatively balanced, but the frequency of category 4 is the highest, and category 1 is the lowest. This indicates that there are more samples of category 4 in the dataset, while category 1 has relatively fewer samples.

  • From a gender perspective, there are differences in the gender distribution across different categories. The proportion of males and females varies significantly in some categories, suggesting that gender may have some influence on category prediction.

  • Most samples have high values for the FAVC feature, indicating that high caloric food intake may have an important impact on predicting target categories.

  • For the family history with overweight feature, most samples have a family history of overweight, suggesting that family history of obesity is also a common phenomenon in the dataset and may have a significant impact on predicting target categories.

  • The vast majority of samples have low values for the SCC feature, indicating that the SCC feature may have a smaller impact on predicting target categories in the dataset.

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in geom_boxplot(aes(fill = target), whis = c(1, 99), color = "#756bb1"): Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`

These box plots provide a visual representation of the distribution and variability of each feature under the “NObeyesdad” category. The presence of outliers and the width of the interquartile ranges indicate the variability within the data for each feature. Based on these box plots, we can draw the following conclusions:

  • Most individuals’ ages are concentrated in the lower range, with some outliers indicating a few individuals of older age.

  • The height distribution is relatively normal, with most individuals’ heights around the median and a few high-end outliers.

  • Weight distribution is quite broad, with many individuals near the median and some high-end outliers.

  • The frequency of vegetable consumption is relatively concentrated, with a narrow interquartile range, indicating similar vegetable consumption habits among most individuals.

  • The number of main meals consumed shows a very narrow distribution, with most individuals consuming the same number of main meals, but there are some extreme outliers.

  • Daily water intake is relatively concentrated, with some outliers, indicating similar patterns of daily water consumption among individuals.

  • The distribution of physical activity frequency is relatively uniform, with a wide range and some outliers, indicating diversity in physical activity levels.

  • The distribution of time spent using electronic devices is also relatively uniform, with a wide range and some outliers, suggesting diverse usage patterns of electronic devices.

corr_matrix <- cor(train_cleaned[, numerical_feats_train])

p <- ggplot(reshape2::melt(corr_matrix), aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#fee08b", mid = "#ffffbf", high = "#1f78b4",
                       midpoint = 0, limits = c(-1,1), name = "Correlation") +
  theme_minimal() +
  labs(title = "Correlation Matrix of Numerical Features")

print(p)

Here are some key observations:

  • Age and Height have a high correlation.

  • Weight and Height also show a high correlation.

  • FCVC and CH2O have a moderate correlation.

  • TUE and CH2O show some positive correlation.

  • FAF and TUE have a certain negative correlation.

Perform Principal Component Analysis (PCA) on the train_cleaned data frame and extract the first two principal components to reduce the dimensionality of the data. This helps to simplify the data, making it easier to visualize and analyze, while retaining as much of the original information as possible.

pca <- prcomp(train_cleaned, center = TRUE, scale. = TRUE)
reduced_X <- data.frame(pca$x[, 1:2])
colnames(reduced_X) <- c("principal component 1", "principal component 2")
head(reduced_X)
##   principal component 1 principal component 2
## 1            -0.9596962            -0.4941391
## 2             2.4321077             0.4620528
## 3             2.4649425            -0.3535129
## 4            -1.9662601             0.7165956
## 5            -1.4319683            -1.9214633
## 6             1.5763951            -1.5898702
explained_variance_ratio <- summary(pca)$importance[2,]
print(sprintf('Explained variation per principal component: %s', paste(round(explained_variance_ratio, 4), collapse = ', ')))
## [1] "Explained variation per principal component: 0.1976, 0.1136, 0.0982, 0.0707, 0.0629, 0.0562, 0.0534, 0.0503, 0.0472, 0.0432, 0.0406, 0.0367, 0.0358, 0.0327, 0.0271, 0.0182, 0.0153, 2e-04"
reduced_X$y <- y
ggplot(reduced_X, aes(x = `principal component 1`, y = `principal component 2`, color = as.factor(y))) +
  geom_point(alpha = 0.3) +
  scale_color_manual(values = scales::hue_pal()(length(unique(reduced_X$y)))) +
  theme_minimal() +
  labs(title = "PCA Scatter Plot",
       x = "Principal Component 1",
       y = "Principal Component 2",
       color = "Class")

cumvar <- cumsum(pca$sdev^2 / sum(pca$sdev^2))
n_components <- which(cumvar >= 0.95)[1]
reduced_X2 <- data.frame(pca$x[, 1:n_components])
colnames(reduced_X2) <- paste0("principal component ", 1:n_components)
print(dim(reduced_X2))
## [1] 22845    15
# Perform t-SNE
tsne_results <- Rtsne(as.matrix(train_cleaned), dims = 2, perplexity = 3)
# Create a data frame
tsne_df <- data.frame(Component_1= tsne_results$Y[, 1], Component_2 = tsne_results$Y[, 2], y = y)

# Plot the t-SNE results
ggplot(tsne_df, aes(x = Component_1, y = Component_2, color = factor(y))) +
  geom_point() +
  labs(x = 'Component 1', y = 'Component 2', color = 'Class') +
  ggtitle('t-SNE Visualization') +
  theme_minimal()

# Perform t-SNE
tsne_results2 <- Rtsne(as.matrix(train_cleaned),dims=2)
#Create a data frame
tsne_df2 <- data.frame(Component_1 = tsne_results2$Y[,1], Component_2 = tsne_results2$Y[,2], y = y)
# Plot the tSNE results
p <- ggplot(tsne_df2, aes(x = Component_1, y = Component_2, color = factor(y))) +
  geom_point() +
  labs(x = 'Component 1', y = 'Component 2', title = 't-SNE Visualization')
print(p)

Modelling and Results

Model training is a crucial step in a machine learning project. In this process, we use the cleaned and processed data to train the model, enabling it to recognize patterns and relationships within the data. By continuously adjusting the model’s parameters and structure, we can enhance its predictive power and accuracy.

Decision tree model

A decision tree is a machine learning model used for classification and regression tasks. It splits data into subsets based on feature values, creating a tree-like decision structure that clearly illustrates the decision-making process, making it easy to interpret and understand. It can handle both numerical and categorical data and performs well in dealing with missing values and high-dimensional data.

decision_tree_model <- rpart(y_train ~ ., data = X_train, method = "class")

rpart.plot(decision_tree_model)
## Warning: All boxes will be white (the box.palette argument will be ignored) because
## the number of classes in the response 7 is greater than length(box.palette) 6.
## To silence this warning use box.palette=0 or trace=-1.

y_pred <- predict(decision_tree_model, newdata = X_val, type = "class")

accuracy <- sum(y_pred == y_val) / length(y_val)
cat("Decision Tree Accuracy:", accuracy, "\n")
## Decision Tree Accuracy: 0.8404116
conf_matrix <- table(y_pred, y_val)
print(conf_matrix)
##       y_val
## y_pred   0   1   2   3   4   5   6
##      0 542  51   1   0   0   4   0
##      1  46 527   0   0   0  59   9
##      2   0   1 456  50   0  15  96
##      3   0   0  66 691   0   0   4
##      4   0   0   5   2 874   0   0
##      5   1  59  19   1   0 404 122
##      6   0   3  70   0   0  45 345

We can see that the accuracy of the decision tree model is 0.8404116, which is approximately 84.04%. And from the confusion matrix, it is evident that most samples are correctly classified, but there is some confusion between certain classes. The model’s ability to distinguish between these classes is insufficient and requires further optimization.

Random Forest

Random forest is an ensemble learning method that improves prediction accuracy and stability by constructing multiple decision trees and averaging or voting on their results. It has the advantage of handling high-dimensional data and preventing overfitting. Random forest excels in dealing with data noise and missing values, making it a powerful tool for classification and regression tasks.

y_train <- factor(y_train)
random_forest_model <- randomForest(y_train~. , data = cbind(X_train,y_train),
                         ntree = 1000, # total number of decision trees in ensemble --> larger is more accurate but slower ro run
                         mtry= round(sqrt(ncol(train))), # number of features (potential for hyperparameter tuning)
                         nodesize = 20 # minimum number of terminal leaf nodes --> larger the number the smaller the tree 
                        )

# Plot the error
plot(random_forest_model, main = "Random Forest: Error per number of trees")

y_probs <- predict(random_forest_model, newdata = X_val, type = "prob") #probability 
y_preds <- predict(random_forest_model, newdata = X_val, type = "class")
y_preds<-as.vector(y_preds)
y_preds <-factor(as.numeric(gsub("\"", "", y_preds)))
y_val<-factor(y_val)

cm<- confusionMatrix(y_preds, y_val)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6
##          0 536  18   1   0   0   5   0
##          1  52 592   1   0   0  75  14
##          2   0   0 547  20   0  13  52
##          3   0   0  21 720   0   0   5
##          4   0   0   1   0 874   0   0
##          5   1  23   7   0   0 348  36
##          6   0   8  39   4   0  86 469
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8945          
##                  95% CI : (0.8852, 0.9032)
##     No Information Rate : 0.1913          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8762          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9100   0.9236   0.8865   0.9677   1.0000  0.66034
## Specificity            0.9940   0.9638   0.9785   0.9932   0.9997  0.98342
## Pos Pred Value         0.9571   0.8065   0.8655   0.9651   0.9989  0.83855
## Neg Pred Value         0.9868   0.9872   0.9822   0.9937   1.0000  0.95690
## Prevalence             0.1289   0.1403   0.1351   0.1629   0.1913  0.11537
## Detection Rate         0.1173   0.1296   0.1197   0.1576   0.1913  0.07618
## Detection Prevalence   0.1226   0.1607   0.1384   0.1633   0.1915  0.09085
## Balanced Accuracy      0.9520   0.9437   0.9325   0.9805   0.9999  0.82188
##                      Class: 6
## Sensitivity            0.8142
## Specificity            0.9657
## Pos Pred Value         0.7739
## Neg Pred Value         0.9730
## Prevalence             0.1261
## Detection Rate         0.1027
## Detection Prevalence   0.1327
## Balanced Accuracy      0.8900
importance_data <- as.data.frame(random_forest_model$importance)
importance_data$Feature <- rownames(importance_data)
rownames(importance_data) <- NULL


ggplot(importance_data, aes(x = reorder(Feature, -MeanDecreaseGini), y = MeanDecreaseGini,fill = factor(Feature))) +  
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Feature", y = "Importance", title = "Feature Importances from Random Forest")

y_preds <- as.factor(y_preds)
y_val <- as.factor(y_val)

all_levels <- union(levels(y_preds), levels(y_val))


y_preds <- factor(y_preds, levels = all_levels)
y_val <- factor(y_val, levels = all_levels)


confusionMatrix(y_preds, y_val)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6
##          0 536  18   1   0   0   5   0
##          1  52 592   1   0   0  75  14
##          2   0   0 547  20   0  13  52
##          3   0   0  21 720   0   0   5
##          4   0   0   1   0 874   0   0
##          5   1  23   7   0   0 348  36
##          6   0   8  39   4   0  86 469
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8945          
##                  95% CI : (0.8852, 0.9032)
##     No Information Rate : 0.1913          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8762          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9100   0.9236   0.8865   0.9677   1.0000  0.66034
## Specificity            0.9940   0.9638   0.9785   0.9932   0.9997  0.98342
## Pos Pred Value         0.9571   0.8065   0.8655   0.9651   0.9989  0.83855
## Neg Pred Value         0.9868   0.9872   0.9822   0.9937   1.0000  0.95690
## Prevalence             0.1289   0.1403   0.1351   0.1629   0.1913  0.11537
## Detection Rate         0.1173   0.1296   0.1197   0.1576   0.1913  0.07618
## Detection Prevalence   0.1226   0.1607   0.1384   0.1633   0.1915  0.09085
## Balanced Accuracy      0.9520   0.9437   0.9325   0.9805   0.9999  0.82188
##                      Class: 6
## Sensitivity            0.8142
## Specificity            0.9657
## Pos Pred Value         0.7739
## Neg Pred Value         0.9730
## Prevalence             0.1261
## Detection Rate         0.1027
## Detection Prevalence   0.1327
## Balanced Accuracy      0.8900

In this analysis, the random forest model performed well, achieving an accuracy of 89.45%, which is significantly higher than that of the decision tree model. The feature importance analysis showed that Weight is the most important feature for predicting the class, followed by FCVC, Height, and Age.

XGBoost

XGBoost is a powerful gradient boosting algorithm designed for efficiency and performance. It provides high accuracy through optimized handling of sparse data, regularization to prevent overfitting, and parallel processing capabilities. XGBoost is widely chosen for its speed, scalability, and superior performance in various machine learning competitions.

X_train[] <- lapply(X_train, function(x) {
  if (is.character(x)) {
    as.numeric(as.factor(x))
  } else {
    x
  }
})
X_train_matrix <- as.matrix(X_train)

X_val[] <- lapply(X_val, function(x) {
  if (is.character(x)) {
    as.numeric(as.factor(x))
  } else {
    x
  }
})
X_val_matrix <- as.matrix(X_val)


dval <- xgb.DMatrix(data = X_val_matrix, label = y_val)
dtrain <- xgb.DMatrix(data = X_train_matrix, label = y_train)


dtest <- xgb.DMatrix(as.matrix(X_val), label=y_val)

watchlist <- list(train=dtrain, test=dtest)

param <- list(
  objective = "multi:softmax",
  eta = 0.3,
  max_depth = 8,
    num_class = 8)

xgb_model <- xgb.train(params = param, data = dtrain, nrounds = 1000, watchlist = watchlist, early_stopping_rounds = 10)
## [1]  train-mlogloss:1.223446 test-mlogloss:1.247614 
## Multiple eval metrics are present. Will use test_mlogloss for early stopping.
## Will train until test_mlogloss hasn't improved in 10 rounds.
## 
## [2]  train-mlogloss:0.909093 test-mlogloss:0.962844 
## [3]  train-mlogloss:0.717562 test-mlogloss:0.789666 
## [4]  train-mlogloss:0.584507 test-mlogloss:0.666956 
## [5]  train-mlogloss:0.485385 test-mlogloss:0.577298 
## [6]  train-mlogloss:0.412554 test-mlogloss:0.510495 
## [7]  train-mlogloss:0.356395 test-mlogloss:0.461828 
## [8]  train-mlogloss:0.313149 test-mlogloss:0.423535 
## [9]  train-mlogloss:0.279527 test-mlogloss:0.393768 
## [10] train-mlogloss:0.253957 test-mlogloss:0.374398 
## [11] train-mlogloss:0.231790 test-mlogloss:0.355831 
## [12] train-mlogloss:0.214276 test-mlogloss:0.341758 
## [13] train-mlogloss:0.200700 test-mlogloss:0.331954 
## [14] train-mlogloss:0.189028 test-mlogloss:0.324934 
## [15] train-mlogloss:0.179011 test-mlogloss:0.317822 
## [16] train-mlogloss:0.170610 test-mlogloss:0.312430 
## [17] train-mlogloss:0.163308 test-mlogloss:0.308711 
## [18] train-mlogloss:0.156174 test-mlogloss:0.304329 
## [19] train-mlogloss:0.150731 test-mlogloss:0.302678 
## [20] train-mlogloss:0.144329 test-mlogloss:0.299519 
## [21] train-mlogloss:0.140092 test-mlogloss:0.297199 
## [22] train-mlogloss:0.134562 test-mlogloss:0.294934 
## [23] train-mlogloss:0.130177 test-mlogloss:0.293503 
## [24] train-mlogloss:0.126419 test-mlogloss:0.291261 
## [25] train-mlogloss:0.122862 test-mlogloss:0.289810 
## [26] train-mlogloss:0.118999 test-mlogloss:0.288696 
## [27] train-mlogloss:0.115531 test-mlogloss:0.288332 
## [28] train-mlogloss:0.112421 test-mlogloss:0.287921 
## [29] train-mlogloss:0.109026 test-mlogloss:0.286756 
## [30] train-mlogloss:0.106142 test-mlogloss:0.286445 
## [31] train-mlogloss:0.103784 test-mlogloss:0.286149 
## [32] train-mlogloss:0.101143 test-mlogloss:0.285863 
## [33] train-mlogloss:0.098532 test-mlogloss:0.286178 
## [34] train-mlogloss:0.096340 test-mlogloss:0.285950 
## [35] train-mlogloss:0.094500 test-mlogloss:0.286541 
## [36] train-mlogloss:0.092808 test-mlogloss:0.287382 
## [37] train-mlogloss:0.090787 test-mlogloss:0.287563 
## [38] train-mlogloss:0.088297 test-mlogloss:0.289106 
## [39] train-mlogloss:0.086840 test-mlogloss:0.288789 
## [40] train-mlogloss:0.084415 test-mlogloss:0.289494 
## [41] train-mlogloss:0.081818 test-mlogloss:0.289674 
## [42] train-mlogloss:0.080206 test-mlogloss:0.289739 
## Stopping. Best iteration:
## [32] train-mlogloss:0.101143 test-mlogloss:0.285863
y_preds <- predict(xgb_model, as.matrix(X_val))
confusionMatrix(as.factor(y_preds),as.factor(as.numeric(y_val)))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5   6   7
##          1 548  20   1   0   0   6   0
##          2  39 583   0   0   0  57   9
##          3   0   0 549  18   0  16  58
##          4   0   0  22 721   0   0   6
##          5   0   0   1   0 874   0   0
##          6   2  28   7   1   0 365  30
##          7   0  10  37   4   0  83 473
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9004          
##                  95% CI : (0.8913, 0.9089)
##     No Information Rate : 0.1913          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8832          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            0.9304   0.9095   0.8898   0.9691   1.0000  0.69260
## Specificity            0.9932   0.9733   0.9767   0.9927   0.9997  0.98317
## Pos Pred Value         0.9530   0.8474   0.8565   0.9626   0.9989  0.84296
## Neg Pred Value         0.9897   0.9851   0.9827   0.9940   1.0000  0.96082
## Prevalence             0.1289   0.1403   0.1351   0.1629   0.1913  0.11537
## Detection Rate         0.1200   0.1276   0.1202   0.1578   0.1913  0.07990
## Detection Prevalence   0.1259   0.1506   0.1403   0.1640   0.1915  0.09479
## Balanced Accuracy      0.9618   0.9414   0.9333   0.9809   0.9999  0.83789
##                      Class: 7
## Sensitivity            0.8212
## Specificity            0.9664
## Pos Pred Value         0.7792
## Neg Pred Value         0.9740
## Prevalence             0.1261
## Detection Rate         0.1035
## Detection Prevalence   0.1329
## Balanced Accuracy      0.8938

According to above, the accuracy of the XGBoost model is about 90.04%. The confusion matrix provides a detailed view of the classification results for each class, including true positives, false positives, true negatives, and false negatives. The statistical metrics include sensitivity, specificity, positive predictive value, and negative predictive value for each class, showcasing the model’s performance across different categories.

Regression

The project involves a multi-class classification task with ordered target classes, which can also be viewed as a regression task. In general, the regression approach does not achieve higher accuracy compared to direct classification; however, it requires less time as some models are built equal to the number of categories.

set.seed(42)

x <- train_cleaned[, c("Age", "Height", "Weight")]
y <- train_cleaned$NObeyesdad

index <- createDataPartition(y, times = 1, p = 0.8, list = FALSE)
x_train <- x[index, ]
x_val <- x[-index, ]
y_train <- y[index]
y_val <- y[-index]

cat("Train set:", length(index), "Test set:", length(-index))
## Train set: 18277 Test set: 18277
# A decision tree regression model was used to fit the data
decision_tree_model <- rpart(y_train ~ ., data = cbind(x_train, y_train), method = "anova")
print(decision_tree_model)
## n= 18277 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 18277 65667.9700 3.992504  
##    2) Weight< -0.996845 3743  3258.4010 1.537804 *
##    3) Weight>=-0.996845 14534 34047.5900 4.624673  
##      6) Height>=0.5553336 5416 10747.3300 4.223966  
##       12) Weight< -0.4733502 549  1088.7540 2.530055 *
##       13) Weight>=-0.4733502 4867  7905.6190 4.415040  
##         26) Weight>=0.2333382 3733  2433.9130 3.995178 *
##         27) Weight< 0.2333382 1134  2647.3510 5.797178 *
##      7) Height< 0.5553336 9118 21914.0900 4.862689  
##       14) Weight>=-0.04518605 4335  3079.1530 4.610150  
##         28) Weight< 0.5201781 859  1338.2400 3.599534 *
##         29) Weight>=0.5201781 3476   646.7696 4.859896 *
##       15) Weight< -0.04518605 4783 18307.8900 5.091574  
##         30) Weight< -0.5970722 1637  7112.1990 4.417838  
##           60) Height>=-0.8035392 786  2652.7580 3.119593 *
##           61) Height< -0.8035392 851  1911.1160 5.616921 *
##         31) Weight>=-0.5970722 3146 10065.9700 5.442149  
##           62) Height< -0.5790597 967  1789.9190 3.600827 *
##           63) Height>=-0.5790597 2179  3542.4990 6.259293 *
folds <- createFolds(y_train, k = 3, list = TRUE, returnTrain = FALSE)

# Create the trainControl object
ctrl <- trainControl(method = "cv", index = folds)

# Use the train function for cross-validation
model <- train(x = x_train,
               y = y_train,
               method = "rpart",
               trControl = ctrl,
               metric = "Rsquared")

# Output the R² score for each fold
print(paste("Scores for each fold:", model$results$Rsquared))
## [1] "Scores for each fold: 0.644799780731035"
## [2] "Scores for each fold: 0.523370430826407"
## [3] "Scores for each fold: 0.428312173615942"
print(paste("Average R²:", mean(model$results$Rsquared), collapse = NULL))
## [1] "Average R²: 0.532160795057795"

Conclusion

In this project, we compared various factors in the dataset that influence obesity and evaluated several models (decision tree, random forest, XGBoost). We found that XGBoost had the highest accuracy, significantly outperforming both the random forest and decision tree models. Although the current sample is limited to people aged 14 to 61, the generalizability and prediction accuracy of the model can be further improved by expanding the sample range, optimizing features, experimenting with more models and parameter tuning, and using model interpretation tools to analyze the impact of each feature on the model’s prediction results.