| Name | Matric Number |
|---|---|
| WANG RUI | 23067445 |
| JIN YUXIN | 23061984 |
According to The World Obesity Federation,obesity has become a major public health problem worldwide since 1987. There are lots of underlying factors, for example eating habits, exercise amount, genetic information and so on, might cause people get overweight. Then the goal of us are relevance between obesity and such factors and found the most accurate model for predict obesity.
Based on the background there are two research objectives we set as follow:
Regression objective: To understand the impact of these variables on an individual’s weight status by accurately predicting BMI
Classification objective: To build a model to categorize individuals into their respective obesity levels, helping to identify and target interventions for different weight categories
We get those dataset from kaggle(https://www.kaggle.com/), the dataset (both train and test) was generated from a deep learning model trained on the Obesity or CVD risk dataset. Those datasets include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III.
Note:
NObesity values are:
Underweight Less than 18.5
Normal 18.5 to 24.9
Overweight 25.0 to 29.9
Obesity I 30.0 to 34.9
Obesity II 35.0 to 39.9
Obesity III Higher than 40
Let’s get some basic information from datasets:
head(train)
## id Gender Age Height Weight family_history_with_overweight FAVC
## 1 0 Male 24.44301 1.699998 81.66995 yes yes
## 2 1 Female 18.00000 1.560000 57.00000 yes yes
## 3 2 Female 18.00000 1.711460 50.16575 yes yes
## 4 3 Female 20.95274 1.710730 131.27485 yes yes
## 5 4 Male 31.64108 1.914186 93.79806 yes yes
## 6 5 Male 18.12825 1.748524 51.55259 yes yes
## FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC
## 1 2.000000 2.983297 Sometimes no 2.763573 no 0.000000 0.976473 Sometimes
## 2 2.000000 3.000000 Frequently no 2.000000 no 1.000000 1.000000 no
## 3 1.880534 1.411685 Sometimes no 1.910378 no 0.866045 1.673584 no
## 4 3.000000 3.000000 Sometimes no 1.674061 no 1.467863 0.780199 Sometimes
## 5 2.679664 1.971472 Sometimes no 1.979848 no 1.967973 0.931721 Sometimes
## 6 2.919751 3.000000 Sometimes no 2.137550 no 1.930033 1.000000 Sometimes
## MTRANS NObeyesdad
## 1 Public_Transportation Overweight_Level_II
## 2 Automobile Normal_Weight
## 3 Public_Transportation Insufficient_Weight
## 4 Public_Transportation Obesity_Type_III
## 5 Public_Transportation Overweight_Level_II
## 6 Public_Transportation Insufficient_Weight
dim(train)
## [1] 20758 18
str(train)
## 'data.frame': 20758 obs. of 18 variables:
## $ id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Gender : chr "Male" "Female" "Female" "Female" ...
## $ Age : num 24.4 18 18 21 31.6 ...
## $ Height : num 1.7 1.56 1.71 1.71 1.91 ...
## $ Weight : num 81.7 57 50.2 131.3 93.8 ...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "yes" ...
## $ FAVC : chr "yes" "yes" "yes" "yes" ...
## $ FCVC : num 2 2 1.88 3 2.68 ...
## $ NCP : num 2.98 3 1.41 3 1.97 ...
## $ CAEC : chr "Sometimes" "Frequently" "Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "no" "no" "no" ...
## $ CH2O : num 2.76 2 1.91 1.67 1.98 ...
## $ SCC : chr "no" "no" "no" "no" ...
## $ FAF : num 0 1 0.866 1.468 1.968 ...
## $ TUE : num 0.976 1 1.674 0.78 0.932 ...
## $ CALC : chr "Sometimes" "no" "no" "Sometimes" ...
## $ MTRANS : chr "Public_Transportation" "Automobile" "Public_Transportation" "Public_Transportation" ...
## $ NObeyesdad : chr "Overweight_Level_II" "Normal_Weight" "Insufficient_Weight" "Obesity_Type_III" ...
According to above, we can get those information about train.csv:
This dataset consist of the estimation of obesity levels in people with ages between 14 and 61 and diverse eating habits and physical condition, the information was processed obtaining 18 columns and 20758 rows. And we can see that the attributes related with eating habits are: Frequent consumption of high caloric food (FAVC), Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Consumption of food between meals (CAEC), Consumption of water daily (CH20), and Consumption of alcohol (CALC). The attributes related with the physical condition are: Calories consumption monitoring (SCC), Physical activity frequency (FAF), Time using technology devices (TUE), Transportation used (MTRANS). And we also can see the type of each columns.
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
dim(obesity)
## [1] 2111 17
str(obesity)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
## $ FAVC : chr "no" "no" "no" "no" ...
## $ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "yes" "no" "no" ...
## $ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr "no" "yes" "no" "no" ...
## $ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ MTRANS : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
According to above, we can get those information about obesity.csv:
This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level). And we also can see the type of each columns.
Data preparation is the process of cleaning, transforming, and reorganizing data so that it can be easily analyzed and is important step to ensure that the data is accurate, complete, and in the right format for analysis in R.
Here are some steps involved:
Data Cleaning : Data cleaning involves finding and fixing or removing any incorrect, incomplete, inaccurate, irrelevant, or missing parts of the data as needed.
Data Transformation: Data transformation is the process of converting raw data into a suitable format or structure for analysis which involves changing the original form of the data to make it more understandable or useful for specific analytical tasks.
Data Integration: Data integration is the process of combining data from multiple sources into a unified view to provide a complete and accurate picture of data.
The first step of us could remove rows with missing values. The question is removing rows with missing values may lead to a reduction in the amount of data, so should be assessed to ensure that valuable information is not lost.
train <- na.omit(train)
obesity <- na.omit(obesity)
dim(train)
## [1] 20758 18
dim(obesity)
## [1] 2111 17
sum(duplicated(train))
## [1] 0
sum(duplicated(obesity))
## [1] 24
As we can see, these dataset have no missing values, but there are some repeated rows.
#Combine datasets
train <- train[,!names(train) %in% 'id']
train <- data.frame(train)
combined <- bind_rows(train,obesity)
dim(combined)
## [1] 22869 17
head(combined)
## Gender Age Height Weight family_history_with_overweight FAVC
## 1 Male 24.44301 1.699998 81.66995 yes yes
## 2 Female 18.00000 1.560000 57.00000 yes yes
## 3 Female 18.00000 1.711460 50.16575 yes yes
## 4 Female 20.95274 1.710730 131.27485 yes yes
## 5 Male 31.64108 1.914186 93.79806 yes yes
## 6 Male 18.12825 1.748524 51.55259 yes yes
## FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC
## 1 2.000000 2.983297 Sometimes no 2.763573 no 0.000000 0.976473 Sometimes
## 2 2.000000 3.000000 Frequently no 2.000000 no 1.000000 1.000000 no
## 3 1.880534 1.411685 Sometimes no 1.910378 no 0.866045 1.673584 no
## 4 3.000000 3.000000 Sometimes no 1.674061 no 1.467863 0.780199 Sometimes
## 5 2.679664 1.971472 Sometimes no 1.979848 no 1.967973 0.931721 Sometimes
## 6 2.919751 3.000000 Sometimes no 2.137550 no 1.930033 1.000000 Sometimes
## MTRANS NObeyesdad
## 1 Public_Transportation Overweight_Level_II
## 2 Automobile Normal_Weight
## 3 Public_Transportation Insufficient_Weight
## 4 Public_Transportation Obesity_Type_III
## 5 Public_Transportation Overweight_Level_II
## 6 Public_Transportation Insufficient_Weight
#Remove repeated rows after combine
any(duplicated(combined))
## [1] TRUE
sum(duplicated(combined))
## [1] 24
train_cleaned <- unique(combined)
dim(train_cleaned)
## [1] 22845 17
Then we should identify and categorize the numerical and character feature columns in the data frame, ensuring that certain specific columns are not mistakenly included in the numerical features.
numerical_feats_train <- names(train_cleaned)[sapply(train_cleaned, is.numeric)]
numerical_feats_test <- names(test)[sapply(test, is.numeric)]
categorical_feats_train <- names(train_cleaned)[sapply(train_cleaned, is.character)]
categorical_feats_test <- names(test)[sapply(test, is.character)]
numerical_feats_test <- setdiff(numerical_feats_test, "id")
y <- train_cleaned$NObeyesdad
train <- train_cleaned[, !names(train_cleaned) %in% "NObeyesdad"]
y_le <- as.numeric(as.factor(y)) - 1
set.seed(42)
trainIndex <- createDataPartition(y_le, p = 0.8, list = FALSE, times = 1)
X_train <- train[trainIndex, ]
X_val <- train[-trainIndex, ]
y_train <- y_le[trainIndex]
y_val <- y_le[-trainIndex]
Exploratory Data Analysis (EDA) allows data scientists to examine data sets and summarize their main characteristics, often using data visualization techniques. It aids in manipulating data sources to extract the needed answers, enabling the discovery of patterns, identification of anomalies, hypothesis testing, and assumption checking. EDA enhances the understanding of data set variables and their relationships, providing insights based on size, patterns, and other attributes.
We should analyze the conditions of our participants, that can help us understand the basic structure and background information of the data.
Let’s analyze the frequency of Nobesity in participants,first.
ggplot(train_cleaned, aes(x = NObeyesdad)) +
geom_bar(aes(fill = NObeyesdad)) +
scale_fill_brewer(palette = "Blues") +
labs(y = 'Frequency') +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))
We can see that there are the most people with Obesity Type III, followed by Obesity Type II.
Moreover, let’s analyze another conditions about participants, involve target of obesity, alcohol consumption, smoking, family history, gender and even the transportation they chosen.
#Target plot
#Create the Data Frame for pie chart
obs_pie <- train_cleaned %>%
count(NObeyesdad) %>%
mutate(per = round(n / sum(n) * 100, 2),
value_label = paste0(per, "%\n", n))
#Create pie chart
ggplot(obs_pie,aes(x = '', y = per,fill = NObeyesdad)) +
geom_bar(width = 1,stat = 'identity', color = 'white') +
coord_polar('y') +
geom_text(aes(label = value_label),
position = position_stack(vjust = 0.5),
color = 'black', size = 3) +
scale_fill_brewer(palette = 'GnBu') +
theme_void() +
theme(legend.position = 'right',
plot.title = element_text(hjust = 0.5)) +
ggtitle(label = 'Obesity (Target) in Participants')
#Alcohol consumption
CALC_pie <- train_cleaned %>%
count(CALC) %>%
mutate(per = round(n / sum(n) * 100, 2),
value_label = paste0(per, "%\n", n))
CALC_pie_filtered <- CALC_pie %>% filter(per > 0)
ggplot(CALC_pie_filtered,aes(x = '', y = per,fill = CALC)) +
geom_bar(width = 1,stat = 'identity', color = 'white') +
coord_polar('y') +
geom_text(aes(label = value_label),
position = position_stack(vjust = 0.5),
color = 'black', size = 3) +
scale_fill_brewer(palette = 'Oranges') +
theme_void() +
theme(legend.position = 'right',
plot.title = element_text(hjust = 0.5)) +
ggtitle(label = 'Alcohol Consumption among Participants')
#Smoking
SMOKE_pie <- train_cleaned %>%
count(SMOKE) %>%
mutate(per = round(n / sum(n) * 100, 2),
value_label = paste0(per, "%\n", n))
ggplot(SMOKE_pie,aes(x = '', y = per,fill = SMOKE)) +
geom_bar(width = 1,stat = 'identity', color = 'white') +
coord_polar('y') +
geom_text(aes(label = value_label),
position = position_stack(vjust = 0.5),
color = 'black', size = 3) +
scale_fill_brewer(palette = 'PuRd') +
theme_void() +
theme(legend.position = 'right',
plot.title = element_text(hjust = 0.5)) +
ggtitle(label = 'People with Smoking Habit')
#Family history
family_history_pie <- train_cleaned %>%
count(family_history_with_overweight) %>%
mutate(per = round(n / sum(n) * 100, 2),
value_label = paste0(per,"%\n",n))
ggplot(family_history_pie,aes(x = '',y = per, fill = family_history_with_overweight)) +
geom_bar(width = 1,stat = 'identity', color = 'white') +
coord_polar('y') +
geom_text(aes(label = value_label),
position = position_stack(vjust = 0.5),
color = 'black', size = 3) +
scale_fill_brewer(palette = 'YlGn') +
theme_void() +
theme(legend.position = 'right',
plot.title = element_text(hjust = 0.5)) +
ggtitle(label = 'Family history with Overweight')
#Gender
gender_pie <- train_cleaned %>%
count(Gender) %>%
mutate(per = round(n / sum(n) * 100, 2),
value_label = paste0(per,"%\n",n))
ggplot(gender_pie,aes(x = '',y = per, fill = Gender)) +
geom_bar(width = 1,stat = 'identity', color = 'white') +
coord_polar('y') +
geom_text(aes(label = value_label),
position = position_stack(vjust = 0.5),
color = 'black', size = 3) +
scale_fill_brewer(palette = 'PuBu') +
theme_void() +
theme(legend.position = 'right',
plot.title = element_text(hjust = 0.5)) +
ggtitle(label = 'Family history with Overweight')
# Transportation
ggplot(train_cleaned,aes(x = MTRANS)) +
geom_bar(aes(fill = MTRANS)) +
scale_fill_brewer(palette = "Pastel2") +
labs(title='Different Modes of transportation of participants') +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
According to those plots, there are some information we can get as follow:
Most participants (71.99%) do not consume alcohol.
25.38% of participants drink occasionally, and 2.62% of participants drink frequently.
The most of participants (98.73%) do not smoke, and only 1.27% having smoking habit.
50.15% of participants have a family history of overweight, and 49.85% of participants do not.
50.15% of participants are female and 49.85% of participants are male.
Most participants use public transportation, fewer participants use bikes and motorbikes.
Let’s compare the influence of different factors on Nobesity:
# Gender with Target
ggplot(train_cleaned, aes(y = NObeyesdad, fill = Gender)) +
geom_bar(position = position_dodge(), stat = 'count') +
geom_text(stat = 'count', aes(label = after_stat(count)), position = position_dodge(width = 1), hjust = -0.5) +
scale_fill_brewer(palette = "Blues") +
labs(title = 'Target w.r.t Gender', x = 'Count', y = 'NObeyesdad') +
scale_y_discrete(limits = c('Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III')) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
# Target distribution in SMOKE
SMOKE_YES <- subset(train_cleaned,SMOKE == 'yes')
ggplot(SMOKE_YES, aes(x = NObeyesdad)) +
geom_bar(aes(fill = NObeyesdad)) +
scale_fill_brewer(palette = "Blues") +
labs(title = 'Target distribution in SMOKE=YES') +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(train_cleaned, aes(y = FAVC, fill = NObeyesdad)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Blues") +
labs(title = "Effect of FAVC (High Caloric Food Intake) on Target",
y = "FAVC",
x = "count",
fill = "NObeyesdad") +
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))
ggplot(train_cleaned, aes(x = family_history_with_overweight, fill = NObeyesdad)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Blues") +
labs(title = "Target with Family Hist. with Overweight",
y = "count",
x = "family history with overweight",
fill = "NObeyesdad") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
ggplot(train_cleaned, aes(x = SCC, fill = NObeyesdad)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Blues") +
labs(title = "SCC with Targe",
y = "count",
x = "SCC",
fill = "NObeyesdad") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
#formula: BMI = Weight/(Height^2)
train_cleaned$BMI <- with(train_cleaned, Weight / (Height^2))
ggplot(train_cleaned, aes(x = BMI, fill = NObeyesdad)) +
geom_histogram(position = "dodge", bins = 30) +
scale_fill_brewer(palette = "Blues") +
labs(title = "BMI Distribution",
y = "Count",
x = "BMI",
fill = "NObeyesdad") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
We can see people who consume high-calorie foods and those with a family history of obesity are more likely to develop Obesity Type III and Obesity Type II. In contrast, individuals with normal weight and insufficient weight tend not to have a family history of obesity and do not consume high-calorie foods.
for (i in categorical_feats_train) {
print(table(train_cleaned[[i]]))
print("-----------")
}
##
## Female Male
## 11457 11388
## [1] "-----------"
##
## no yes
## 4109 18736
## [1] "-----------"
##
## no yes
## 2019 20826
## [1] "-----------"
##
## Always Frequently no Sometimes
## 531 2708 316 19290
## [1] "-----------"
##
## no yes
## 22556 289
## [1] "-----------"
##
## no yes
## 22062 783
## [1] "-----------"
##
## Always Frequently no Sometimes
## 1 599 5799 16446
## [1] "-----------"
##
## Automobile Bike Motorbike
## 3990 39 49
## Public_Transportation Walking
## 18245 522
## [1] "-----------"
##
## Insufficient_Weight Normal_Weight Obesity_Type_I Obesity_Type_II
## 2790 3364 3261 3545
## Obesity_Type_III Overweight_Level_I Overweight_Level_II
## 4370 2703 2812
## [1] "-----------"
standardize <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
train_cleaned[, numerical_feats_train] <- lapply(train_cleaned[, numerical_feats_train], standardize)
test[, numerical_feats_test] <- lapply(test[, numerical_feats_test], standardize)
train_cleaned <- train_cleaned %>% mutate_if(is.character,as.factor)
test <- test %>% mutate_if(is.character,as.factor)
str(train)
## 'data.frame': 22845 obs. of 16 variables:
## $ Gender : chr "Male" "Female" "Female" "Female" ...
## $ Age : num 24.4 18 18 21 31.6 ...
## $ Height : num 1.7 1.56 1.71 1.71 1.91 ...
## $ Weight : num 81.7 57 50.2 131.3 93.8 ...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "yes" ...
## $ FAVC : chr "yes" "yes" "yes" "yes" ...
## $ FCVC : num 2 2 1.88 3 2.68 ...
## $ NCP : num 2.98 3 1.41 3 1.97 ...
## $ CAEC : chr "Sometimes" "Frequently" "Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "no" "no" "no" ...
## $ CH2O : num 2.76 2 1.91 1.67 1.98 ...
## $ SCC : chr "no" "no" "no" "no" ...
## $ FAF : num 0 1 0.866 1.468 1.968 ...
## $ TUE : num 0.976 1 1.674 0.78 0.932 ...
## $ CALC : chr "Sometimes" "no" "no" "Sometimes" ...
## $ MTRANS : chr "Public_Transportation" "Automobile" "Public_Transportation" "Public_Transportation" ...
for (categorical_feat in categorical_feats_train) {
train_cleaned[[categorical_feat]] <- as.integer(train_cleaned[[categorical_feat]])
}
for (categorical_feat in categorical_feats_test) {
test[[categorical_feat]] <- as.integer(test[[categorical_feat]])
}
X_test <- test[, !names(test) %in% "id"]
head(train_cleaned)
## Gender Age Height Weight family_history_with_overweight
## 1 2 0.09634504 -0.005336032 -0.2322849 2
## 2 1 -1.02313940 -1.598658082 -1.1680511 2
## 3 1 -1.02313940 0.125113384 -1.4272838 2
## 4 1 -0.51009618 0.116805229 1.6492996 2
## 5 2 1.34702230 2.432344957 0.2277514 2
## 6 2 -1.00085592 0.546940042 -1.3746790 2
## FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE
## 1 2 -0.8318003 0.3198325 4 1 1.21024543 1 -1.1725038 0.5897277
## 2 2 -0.8318003 0.3433186 2 1 -0.04464353 1 0.0183577 0.6287571
## 3 2 -1.0557745 -1.8900180 4 1 -0.19193221 1 -0.1411642 1.7461791
## 4 2 1.0429941 0.3433186 4 1 -0.58030581 1 0.5755177 0.2641249
## 5 2 0.4424300 -1.1028990 4 1 -0.07776220 1 1.1710795 0.5154877
## 6 2 0.8925437 0.3433186 4 1 0.18141210 1 1.1258982 0.6287571
## CALC MTRANS NObeyesdad BMI
## 1 4 4 7 28.25956
## 2 3 1 2 23.42209
## 3 3 4 1 17.12671
## 4 4 4 5 44.85580
## 5 4 4 7 25.59915
## 6 4 4 1 16.86193
head(test)
## id Gender Age Height Weight family_history_with_overweight
## 1 0 2 0.1056960 -0.002828196 -0.235707 2
## 2 1 1 -1.0270272 -1.606252138 -1.170903 2
## 3 2 1 -1.0270272 0.128448288 -1.429978 2
## 4 3 1 -0.5079168 0.120087458 1.644731 2
## 5 4 2 1.3711635 2.450308044 0.224049 2
## 6 5 2 -1.0044802 0.552949388 -1.377405 2
## FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE
## 1 2 -0.8362588 0.3146763 4 1 1.20656459 1 -1.17111325 0.5974236
## 2 2 -0.8362588 0.3383560 2 1 -0.04834813 1 0.02177431 0.6364976
## 3 2 -1.0603060 -1.9133767 4 1 -0.19563960 1 -0.13801895 1.7551970
## 4 2 1.0391462 0.3383560 4 1 -0.58402055 1 0.57988226 0.2714485
## 5 2 0.4383864 -1.1197742 4 1 -0.08146743 1 1.17645726 0.5230987
## 6 2 0.8886468 0.3383560 4 1 0.17771178 1 1.13119910 0.6364976
## CALC MTRANS NObeyesdad
## 1 3 4 7
## 2 2 1 2
## 3 2 4 1
## 4 3 4 5
## 5 3 4 7
## 6 3 4 1
p <- ggplot(train_cleaned, aes(x = NObeyesdad, fill = NObeyesdad)) +
geom_bar() +
labs(y = "Frequency") +
scale_fill_brewer(palette = "Blues") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggsave("photo1.png", plot = p, width = 14, height = 6)
print(p)
p <- ggplot(train_cleaned, aes(x = factor(NObeyesdad), fill = factor(Gender))) +
geom_bar(position = "fill") +
labs(title = "Target w.r.t Gender") +
scale_fill_manual(values = c("#1f78b4", "#a6cee3"), labels = c("Male", "Female")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
p <- ggplot(train_cleaned, aes(x = FAVC, fill = NObeyesdad)) +
geom_bar(position = "dodge") +
labs(title = "Effect of FAVC (High Caloric Food Intake) on Target") +
scale_fill_manual(values = c("#ffffcc", "#c7e9b4", "#7fcdbb", "#41b6c4", "#1d91c0", "#225ea8", "#0c2c84"),
labels = c('Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III')) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
p <- ggplot(train_cleaned, aes(x = family_history_with_overweight, fill = NObeyesdad)) +
geom_bar(position = "dodge") +
labs(title = "Target with Family Hist. with Overweight") +
scale_fill_manual(values = c("#ffffcc", "#c7e9b4", "#7fcdbb", "#41b6c4", "#1d91c0", "#225ea8", "#0c2c84"),
labels = c('Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III')) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
The above charts reveal the relationship between different features and target categories, leading to the following conclusions:
The distribution of different categories of NObeyesdad is relatively balanced, but the frequency of category 4 is the highest, and category 1 is the lowest. This indicates that there are more samples of category 4 in the dataset, while category 1 has relatively fewer samples.
From a gender perspective, there are differences in the gender distribution across different categories. The proportion of males and females varies significantly in some categories, suggesting that gender may have some influence on category prediction.
Most samples have high values for the FAVC feature, indicating that high caloric food intake may have an important impact on predicting target categories.
For the family history with overweight feature, most samples have a family history of overweight, suggesting that family history of obesity is also a common phenomenon in the dataset and may have a significant impact on predicting target categories.
The vast majority of samples have low values for the SCC feature, indicating that the SCC feature may have a smaller impact on predicting target categories in the dataset.
plots <- list()
target <- 'NObeyesdad'
for (feature in numerical_feats_train) {
p <- ggplot(train_cleaned, aes_string(x = target, y = feature)) +
geom_boxplot(aes(fill = target), whis = c(1, 99), color = '#756bb1') +
coord_flip() +
ggtitle(feature) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
plots[[feature]] <- p
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in geom_boxplot(aes(fill = target), whis = c(1, 99), color = "#756bb1"): Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
## Ignoring unknown parameters: `whis`
do.call(grid.arrange, c(plots, ncol = 2))
These box plots provide a visual representation of the distribution and variability of each feature under the “NObeyesdad” category. The presence of outliers and the width of the interquartile ranges indicate the variability within the data for each feature. Based on these box plots, we can draw the following conclusions:
Most individuals’ ages are concentrated in the lower range, with some outliers indicating a few individuals of older age.
The height distribution is relatively normal, with most individuals’ heights around the median and a few high-end outliers.
Weight distribution is quite broad, with many individuals near the median and some high-end outliers.
The frequency of vegetable consumption is relatively concentrated, with a narrow interquartile range, indicating similar vegetable consumption habits among most individuals.
The number of main meals consumed shows a very narrow distribution, with most individuals consuming the same number of main meals, but there are some extreme outliers.
Daily water intake is relatively concentrated, with some outliers, indicating similar patterns of daily water consumption among individuals.
The distribution of physical activity frequency is relatively uniform, with a wide range and some outliers, indicating diversity in physical activity levels.
The distribution of time spent using electronic devices is also relatively uniform, with a wide range and some outliers, suggesting diverse usage patterns of electronic devices.
corr_matrix <- cor(train_cleaned[, numerical_feats_train])
p <- ggplot(reshape2::melt(corr_matrix), aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "#fee08b", mid = "#ffffbf", high = "#1f78b4",
midpoint = 0, limits = c(-1,1), name = "Correlation") +
theme_minimal() +
labs(title = "Correlation Matrix of Numerical Features")
print(p)
Here are some key observations:
Age and Height have a high correlation.
Weight and Height also show a high correlation.
FCVC and CH2O have a moderate correlation.
TUE and CH2O show some positive correlation.
FAF and TUE have a certain negative correlation.
Perform Principal Component Analysis (PCA) on the train_cleaned data frame and extract the first two principal components to reduce the dimensionality of the data. This helps to simplify the data, making it easier to visualize and analyze, while retaining as much of the original information as possible.
pca <- prcomp(train_cleaned, center = TRUE, scale. = TRUE)
reduced_X <- data.frame(pca$x[, 1:2])
colnames(reduced_X) <- c("principal component 1", "principal component 2")
head(reduced_X)
## principal component 1 principal component 2
## 1 -0.9596962 -0.4941391
## 2 2.4321077 0.4620528
## 3 2.4649425 -0.3535129
## 4 -1.9662601 0.7165956
## 5 -1.4319683 -1.9214633
## 6 1.5763951 -1.5898702
explained_variance_ratio <- summary(pca)$importance[2,]
print(sprintf('Explained variation per principal component: %s', paste(round(explained_variance_ratio, 4), collapse = ', ')))
## [1] "Explained variation per principal component: 0.1976, 0.1136, 0.0982, 0.0707, 0.0629, 0.0562, 0.0534, 0.0503, 0.0472, 0.0432, 0.0406, 0.0367, 0.0358, 0.0327, 0.0271, 0.0182, 0.0153, 2e-04"
reduced_X$y <- y
ggplot(reduced_X, aes(x = `principal component 1`, y = `principal component 2`, color = as.factor(y))) +
geom_point(alpha = 0.3) +
scale_color_manual(values = scales::hue_pal()(length(unique(reduced_X$y)))) +
theme_minimal() +
labs(title = "PCA Scatter Plot",
x = "Principal Component 1",
y = "Principal Component 2",
color = "Class")
cumvar <- cumsum(pca$sdev^2 / sum(pca$sdev^2))
n_components <- which(cumvar >= 0.95)[1]
reduced_X2 <- data.frame(pca$x[, 1:n_components])
colnames(reduced_X2) <- paste0("principal component ", 1:n_components)
print(dim(reduced_X2))
## [1] 22845 15
# Perform t-SNE
tsne_results <- Rtsne(as.matrix(train_cleaned), dims = 2, perplexity = 3)
# Create a data frame
tsne_df <- data.frame(Component_1= tsne_results$Y[, 1], Component_2 = tsne_results$Y[, 2], y = y)
# Plot the t-SNE results
ggplot(tsne_df, aes(x = Component_1, y = Component_2, color = factor(y))) +
geom_point() +
labs(x = 'Component 1', y = 'Component 2', color = 'Class') +
ggtitle('t-SNE Visualization') +
theme_minimal()
# Perform t-SNE
tsne_results2 <- Rtsne(as.matrix(train_cleaned),dims=2)
#Create a data frame
tsne_df2 <- data.frame(Component_1 = tsne_results2$Y[,1], Component_2 = tsne_results2$Y[,2], y = y)
# Plot the tSNE results
p <- ggplot(tsne_df2, aes(x = Component_1, y = Component_2, color = factor(y))) +
geom_point() +
labs(x = 'Component 1', y = 'Component 2', title = 't-SNE Visualization')
print(p)
Model training is a crucial step in a machine learning project. In this process, we use the cleaned and processed data to train the model, enabling it to recognize patterns and relationships within the data. By continuously adjusting the model’s parameters and structure, we can enhance its predictive power and accuracy.
A decision tree is a machine learning model used for classification and regression tasks. It splits data into subsets based on feature values, creating a tree-like decision structure that clearly illustrates the decision-making process, making it easy to interpret and understand. It can handle both numerical and categorical data and performs well in dealing with missing values and high-dimensional data.
decision_tree_model <- rpart(y_train ~ ., data = X_train, method = "class")
rpart.plot(decision_tree_model)
## Warning: All boxes will be white (the box.palette argument will be ignored) because
## the number of classes in the response 7 is greater than length(box.palette) 6.
## To silence this warning use box.palette=0 or trace=-1.
y_pred <- predict(decision_tree_model, newdata = X_val, type = "class")
accuracy <- sum(y_pred == y_val) / length(y_val)
cat("Decision Tree Accuracy:", accuracy, "\n")
## Decision Tree Accuracy: 0.8404116
conf_matrix <- table(y_pred, y_val)
print(conf_matrix)
## y_val
## y_pred 0 1 2 3 4 5 6
## 0 542 51 1 0 0 4 0
## 1 46 527 0 0 0 59 9
## 2 0 1 456 50 0 15 96
## 3 0 0 66 691 0 0 4
## 4 0 0 5 2 874 0 0
## 5 1 59 19 1 0 404 122
## 6 0 3 70 0 0 45 345
We can see that the accuracy of the decision tree model is 0.8404116, which is approximately 84.04%. And from the confusion matrix, it is evident that most samples are correctly classified, but there is some confusion between certain classes. The model’s ability to distinguish between these classes is insufficient and requires further optimization.
Random forest is an ensemble learning method that improves prediction accuracy and stability by constructing multiple decision trees and averaging or voting on their results. It has the advantage of handling high-dimensional data and preventing overfitting. Random forest excels in dealing with data noise and missing values, making it a powerful tool for classification and regression tasks.
y_train <- factor(y_train)
random_forest_model <- randomForest(y_train~. , data = cbind(X_train,y_train),
ntree = 1000, # total number of decision trees in ensemble --> larger is more accurate but slower ro run
mtry= round(sqrt(ncol(train))), # number of features (potential for hyperparameter tuning)
nodesize = 20 # minimum number of terminal leaf nodes --> larger the number the smaller the tree
)
# Plot the error
plot(random_forest_model, main = "Random Forest: Error per number of trees")
y_probs <- predict(random_forest_model, newdata = X_val, type = "prob") #probability
y_preds <- predict(random_forest_model, newdata = X_val, type = "class")
y_preds<-as.vector(y_preds)
y_preds <-factor(as.numeric(gsub("\"", "", y_preds)))
y_val<-factor(y_val)
cm<- confusionMatrix(y_preds, y_val)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6
## 0 536 18 1 0 0 5 0
## 1 52 592 1 0 0 75 14
## 2 0 0 547 20 0 13 52
## 3 0 0 21 720 0 0 5
## 4 0 0 1 0 874 0 0
## 5 1 23 7 0 0 348 36
## 6 0 8 39 4 0 86 469
##
## Overall Statistics
##
## Accuracy : 0.8945
## 95% CI : (0.8852, 0.9032)
## No Information Rate : 0.1913
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8762
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9100 0.9236 0.8865 0.9677 1.0000 0.66034
## Specificity 0.9940 0.9638 0.9785 0.9932 0.9997 0.98342
## Pos Pred Value 0.9571 0.8065 0.8655 0.9651 0.9989 0.83855
## Neg Pred Value 0.9868 0.9872 0.9822 0.9937 1.0000 0.95690
## Prevalence 0.1289 0.1403 0.1351 0.1629 0.1913 0.11537
## Detection Rate 0.1173 0.1296 0.1197 0.1576 0.1913 0.07618
## Detection Prevalence 0.1226 0.1607 0.1384 0.1633 0.1915 0.09085
## Balanced Accuracy 0.9520 0.9437 0.9325 0.9805 0.9999 0.82188
## Class: 6
## Sensitivity 0.8142
## Specificity 0.9657
## Pos Pred Value 0.7739
## Neg Pred Value 0.9730
## Prevalence 0.1261
## Detection Rate 0.1027
## Detection Prevalence 0.1327
## Balanced Accuracy 0.8900
importance_data <- as.data.frame(random_forest_model$importance)
importance_data$Feature <- rownames(importance_data)
rownames(importance_data) <- NULL
ggplot(importance_data, aes(x = reorder(Feature, -MeanDecreaseGini), y = MeanDecreaseGini,fill = factor(Feature))) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Feature", y = "Importance", title = "Feature Importances from Random Forest")
y_preds <- as.factor(y_preds)
y_val <- as.factor(y_val)
all_levels <- union(levels(y_preds), levels(y_val))
y_preds <- factor(y_preds, levels = all_levels)
y_val <- factor(y_val, levels = all_levels)
confusionMatrix(y_preds, y_val)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6
## 0 536 18 1 0 0 5 0
## 1 52 592 1 0 0 75 14
## 2 0 0 547 20 0 13 52
## 3 0 0 21 720 0 0 5
## 4 0 0 1 0 874 0 0
## 5 1 23 7 0 0 348 36
## 6 0 8 39 4 0 86 469
##
## Overall Statistics
##
## Accuracy : 0.8945
## 95% CI : (0.8852, 0.9032)
## No Information Rate : 0.1913
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8762
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9100 0.9236 0.8865 0.9677 1.0000 0.66034
## Specificity 0.9940 0.9638 0.9785 0.9932 0.9997 0.98342
## Pos Pred Value 0.9571 0.8065 0.8655 0.9651 0.9989 0.83855
## Neg Pred Value 0.9868 0.9872 0.9822 0.9937 1.0000 0.95690
## Prevalence 0.1289 0.1403 0.1351 0.1629 0.1913 0.11537
## Detection Rate 0.1173 0.1296 0.1197 0.1576 0.1913 0.07618
## Detection Prevalence 0.1226 0.1607 0.1384 0.1633 0.1915 0.09085
## Balanced Accuracy 0.9520 0.9437 0.9325 0.9805 0.9999 0.82188
## Class: 6
## Sensitivity 0.8142
## Specificity 0.9657
## Pos Pred Value 0.7739
## Neg Pred Value 0.9730
## Prevalence 0.1261
## Detection Rate 0.1027
## Detection Prevalence 0.1327
## Balanced Accuracy 0.8900
In this analysis, the random forest model performed well, achieving an accuracy of 89.45%, which is significantly higher than that of the decision tree model. The feature importance analysis showed that Weight is the most important feature for predicting the class, followed by FCVC, Height, and Age.
XGBoost is a powerful gradient boosting algorithm designed for efficiency and performance. It provides high accuracy through optimized handling of sparse data, regularization to prevent overfitting, and parallel processing capabilities. XGBoost is widely chosen for its speed, scalability, and superior performance in various machine learning competitions.
X_train[] <- lapply(X_train, function(x) {
if (is.character(x)) {
as.numeric(as.factor(x))
} else {
x
}
})
X_train_matrix <- as.matrix(X_train)
X_val[] <- lapply(X_val, function(x) {
if (is.character(x)) {
as.numeric(as.factor(x))
} else {
x
}
})
X_val_matrix <- as.matrix(X_val)
dval <- xgb.DMatrix(data = X_val_matrix, label = y_val)
dtrain <- xgb.DMatrix(data = X_train_matrix, label = y_train)
dtest <- xgb.DMatrix(as.matrix(X_val), label=y_val)
watchlist <- list(train=dtrain, test=dtest)
param <- list(
objective = "multi:softmax",
eta = 0.3,
max_depth = 8,
num_class = 8)
xgb_model <- xgb.train(params = param, data = dtrain, nrounds = 1000, watchlist = watchlist, early_stopping_rounds = 10)
## [1] train-mlogloss:1.223446 test-mlogloss:1.247614
## Multiple eval metrics are present. Will use test_mlogloss for early stopping.
## Will train until test_mlogloss hasn't improved in 10 rounds.
##
## [2] train-mlogloss:0.909093 test-mlogloss:0.962844
## [3] train-mlogloss:0.717562 test-mlogloss:0.789666
## [4] train-mlogloss:0.584507 test-mlogloss:0.666956
## [5] train-mlogloss:0.485385 test-mlogloss:0.577298
## [6] train-mlogloss:0.412554 test-mlogloss:0.510495
## [7] train-mlogloss:0.356395 test-mlogloss:0.461828
## [8] train-mlogloss:0.313149 test-mlogloss:0.423535
## [9] train-mlogloss:0.279527 test-mlogloss:0.393768
## [10] train-mlogloss:0.253957 test-mlogloss:0.374398
## [11] train-mlogloss:0.231790 test-mlogloss:0.355831
## [12] train-mlogloss:0.214276 test-mlogloss:0.341758
## [13] train-mlogloss:0.200700 test-mlogloss:0.331954
## [14] train-mlogloss:0.189028 test-mlogloss:0.324934
## [15] train-mlogloss:0.179011 test-mlogloss:0.317822
## [16] train-mlogloss:0.170610 test-mlogloss:0.312430
## [17] train-mlogloss:0.163308 test-mlogloss:0.308711
## [18] train-mlogloss:0.156174 test-mlogloss:0.304329
## [19] train-mlogloss:0.150731 test-mlogloss:0.302678
## [20] train-mlogloss:0.144329 test-mlogloss:0.299519
## [21] train-mlogloss:0.140092 test-mlogloss:0.297199
## [22] train-mlogloss:0.134562 test-mlogloss:0.294934
## [23] train-mlogloss:0.130177 test-mlogloss:0.293503
## [24] train-mlogloss:0.126419 test-mlogloss:0.291261
## [25] train-mlogloss:0.122862 test-mlogloss:0.289810
## [26] train-mlogloss:0.118999 test-mlogloss:0.288696
## [27] train-mlogloss:0.115531 test-mlogloss:0.288332
## [28] train-mlogloss:0.112421 test-mlogloss:0.287921
## [29] train-mlogloss:0.109026 test-mlogloss:0.286756
## [30] train-mlogloss:0.106142 test-mlogloss:0.286445
## [31] train-mlogloss:0.103784 test-mlogloss:0.286149
## [32] train-mlogloss:0.101143 test-mlogloss:0.285863
## [33] train-mlogloss:0.098532 test-mlogloss:0.286178
## [34] train-mlogloss:0.096340 test-mlogloss:0.285950
## [35] train-mlogloss:0.094500 test-mlogloss:0.286541
## [36] train-mlogloss:0.092808 test-mlogloss:0.287382
## [37] train-mlogloss:0.090787 test-mlogloss:0.287563
## [38] train-mlogloss:0.088297 test-mlogloss:0.289106
## [39] train-mlogloss:0.086840 test-mlogloss:0.288789
## [40] train-mlogloss:0.084415 test-mlogloss:0.289494
## [41] train-mlogloss:0.081818 test-mlogloss:0.289674
## [42] train-mlogloss:0.080206 test-mlogloss:0.289739
## Stopping. Best iteration:
## [32] train-mlogloss:0.101143 test-mlogloss:0.285863
y_preds <- predict(xgb_model, as.matrix(X_val))
confusionMatrix(as.factor(y_preds),as.factor(as.numeric(y_val)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 548 20 1 0 0 6 0
## 2 39 583 0 0 0 57 9
## 3 0 0 549 18 0 16 58
## 4 0 0 22 721 0 0 6
## 5 0 0 1 0 874 0 0
## 6 2 28 7 1 0 365 30
## 7 0 10 37 4 0 83 473
##
## Overall Statistics
##
## Accuracy : 0.9004
## 95% CI : (0.8913, 0.9089)
## No Information Rate : 0.1913
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8832
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.9304 0.9095 0.8898 0.9691 1.0000 0.69260
## Specificity 0.9932 0.9733 0.9767 0.9927 0.9997 0.98317
## Pos Pred Value 0.9530 0.8474 0.8565 0.9626 0.9989 0.84296
## Neg Pred Value 0.9897 0.9851 0.9827 0.9940 1.0000 0.96082
## Prevalence 0.1289 0.1403 0.1351 0.1629 0.1913 0.11537
## Detection Rate 0.1200 0.1276 0.1202 0.1578 0.1913 0.07990
## Detection Prevalence 0.1259 0.1506 0.1403 0.1640 0.1915 0.09479
## Balanced Accuracy 0.9618 0.9414 0.9333 0.9809 0.9999 0.83789
## Class: 7
## Sensitivity 0.8212
## Specificity 0.9664
## Pos Pred Value 0.7792
## Neg Pred Value 0.9740
## Prevalence 0.1261
## Detection Rate 0.1035
## Detection Prevalence 0.1329
## Balanced Accuracy 0.8938
According to above, the accuracy of the XGBoost model is about 90.04%. The confusion matrix provides a detailed view of the classification results for each class, including true positives, false positives, true negatives, and false negatives. The statistical metrics include sensitivity, specificity, positive predictive value, and negative predictive value for each class, showcasing the model’s performance across different categories.
The project involves a multi-class classification task with ordered target classes, which can also be viewed as a regression task. In general, the regression approach does not achieve higher accuracy compared to direct classification; however, it requires less time as some models are built equal to the number of categories.
set.seed(42)
x <- train_cleaned[, c("Age", "Height", "Weight")]
y <- train_cleaned$NObeyesdad
index <- createDataPartition(y, times = 1, p = 0.8, list = FALSE)
x_train <- x[index, ]
x_val <- x[-index, ]
y_train <- y[index]
y_val <- y[-index]
cat("Train set:", length(index), "Test set:", length(-index))
## Train set: 18277 Test set: 18277
# A decision tree regression model was used to fit the data
decision_tree_model <- rpart(y_train ~ ., data = cbind(x_train, y_train), method = "anova")
print(decision_tree_model)
## n= 18277
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 18277 65667.9700 3.992504
## 2) Weight< -0.996845 3743 3258.4010 1.537804 *
## 3) Weight>=-0.996845 14534 34047.5900 4.624673
## 6) Height>=0.5553336 5416 10747.3300 4.223966
## 12) Weight< -0.4733502 549 1088.7540 2.530055 *
## 13) Weight>=-0.4733502 4867 7905.6190 4.415040
## 26) Weight>=0.2333382 3733 2433.9130 3.995178 *
## 27) Weight< 0.2333382 1134 2647.3510 5.797178 *
## 7) Height< 0.5553336 9118 21914.0900 4.862689
## 14) Weight>=-0.04518605 4335 3079.1530 4.610150
## 28) Weight< 0.5201781 859 1338.2400 3.599534 *
## 29) Weight>=0.5201781 3476 646.7696 4.859896 *
## 15) Weight< -0.04518605 4783 18307.8900 5.091574
## 30) Weight< -0.5970722 1637 7112.1990 4.417838
## 60) Height>=-0.8035392 786 2652.7580 3.119593 *
## 61) Height< -0.8035392 851 1911.1160 5.616921 *
## 31) Weight>=-0.5970722 3146 10065.9700 5.442149
## 62) Height< -0.5790597 967 1789.9190 3.600827 *
## 63) Height>=-0.5790597 2179 3542.4990 6.259293 *
folds <- createFolds(y_train, k = 3, list = TRUE, returnTrain = FALSE)
# Create the trainControl object
ctrl <- trainControl(method = "cv", index = folds)
# Use the train function for cross-validation
model <- train(x = x_train,
y = y_train,
method = "rpart",
trControl = ctrl,
metric = "Rsquared")
# Output the R² score for each fold
print(paste("Scores for each fold:", model$results$Rsquared))
## [1] "Scores for each fold: 0.644799780731035"
## [2] "Scores for each fold: 0.523370430826407"
## [3] "Scores for each fold: 0.428312173615942"
print(paste("Average R²:", mean(model$results$Rsquared), collapse = NULL))
## [1] "Average R²: 0.532160795057795"
In this project, we compared various factors in the dataset that influence obesity and evaluated several models (decision tree, random forest, XGBoost). We found that XGBoost had the highest accuracy, significantly outperforming both the random forest and decision tree models. Although the current sample is limited to people aged 14 to 61, the generalizability and prediction accuracy of the model can be further improved by expanding the sample range, optimizing features, experimenting with more models and parameter tuning, and using model interpretation tools to analyze the impact of each feature on the model’s prediction results.