Obesity is defined by the CDC as “having a body mass index (BMI) of 30.0 or higher, with severe obesity being defined as a BMI of 40.0 or higher” [1]. The prevalence of both adult and childhood obesity and severe obesity have been rising steadily since 1990 [1, 2, 3]. In 2024, a staggering 35 million children under the age of 5 were considered overweight according to the World Health Organization. While once thought to be a symptom of excess, obesity rates have been rising in developing countries, indicating the severity of the obesity epidemic worldwide. In Africa alone the number of overweight children under 5 years of age rose a staggering 12.1% since 2000 [3].
A novel challenge has been raised by these global spikes in obesity; children are facing obesity and malnutrition simultaneously. Due to the increase in exposure to poor nutrient quality, low cost, high fat and high sugar foods paired with lowered participation in physical exercise, children are developing elevated weights while simultaneously experiencing malnutrition [3]. Due to the decreased access to healthy foods, the obesity rate in children increases the closer their family gets to the federal poverty line. According to CDC and the US Department of Agriculture, currently at least 25.8% of children living with a family income at 130% of the federal poverty line or lower are experiencing obesity [2].
Obesity care places a massive burden on hospitals, healthcare institutions, and healthcare systems as a whole due to its high comorbidity rates [1, 2, 3]. According to the WHO, in 2021, elevated BMI was linked to approximately 3.7 million deaths from non-communicable diseases including but not limited to heart diseases, diabetes, cancers, digestive disorders, chronic respiratory diseases, and neurological disorders [3]. In 2017 the CDC found that 58% of adults in the United States that are considered obese also have comorbid high blood pressure, and 23% have comorbid diabetes. Additionally, the CDC identified that in 2019 annual medical costs for those with obesity were on average $1,861 higher per person per year for those with obesity, and $3,097 per person per year for those with severe (a.k.a. morbid) obesity.
As outlined by the UConn Rudd Center for Food Policy and Health those with overweight or obese weight status experience weight based stigma in a broad variety of life aspects. Many persons with obesity have gone as far as requesting weight no longer be collected at routine medical appointments in attempt to combat these stigmas and reduce potential inequalities in healthcare [4]. While this band-aid approach seeks to return comfort and dignity to the patient, the categorization of patients into weight rankings is pivotal for targeted preventative care and reducing future healthcare burden for patients and the healthcare system as a whole. In order to ensure adequate care is initiated at the appropriate time, researchers have set out to identify alternative ways to predict a patient’s weight status or future weight status. While some patients may not currently be overweight or obese, their behaviors and daily activities may indicate they are on track to develop overweight or obese weight status, further underpinning the need for weight based predictions that do not rely on patients’ current weight.
obese <- read.csv("https://nlepera.github.io/sta552/HW05/data/obesity.csv")
colnames(obese)[colnames(obese) == "family_history_with_overweight"] <- "FH.Overweight"
colnames(obese)[colnames(obese) == "NObeyesdad"] <- "Ob.Rank"
In order to address this problem of behavioral based weight status
identification, the researchers at Universidad de la Costa (CUC)
compiled personal health and daily behavioral data from persons ages 14
through 61 in Mexico, Peru, and Columbia. Participants were presented
with a web survey to anonymously capture their eating habits, behavioral
choices, and demographic information. This survey collection resulted in
2111 unique observations over 16 variables, and the subsequent
calculation of 1 variable to classify the respondent’s obesity rank [5].
More information on the variables collected is included below in
Table 1
vars <- data.frame(
Name = c("Gender", "Age", "Height", "Weight", "Family History Overweight", "High Calorie Foods Often", "Vegetables in Meals", "Meals Per Day", "Food Between Meals", "Smoking Status", "Daily Water Consumption", "Calorie Counting", "Physical Activity Rate", "Technology Use", "Alcohol", "Transportation", "Obesity Level"),
Type = c("Continuous", "Continuous", "Continuous", "Continuous", "Binary", "Binary", "Continuous", "Continuous", "Categorical", "Binary", "Continuous", "Binary", "Continuous", "Continuous", "Categorical", "Categorical", "Categorical"),
Questions = c("", "", "", "", "Has a family member suffered or suffers from overweight?", "Do you eat high caloric food frequently?", "Do you usually eat vegetables in your meals?", "How many main meals do you have daily?", "Do you eat any food between meals?", "Do you smoke?", "How much water do you drink daily?", "Do you monitor the calories you eat daily?", "How often do you have physical activity?", "How much time do you use technological devices such as cell phone, videogames, television, computer and others?", "How often do you drink alcohol?", "Which transportation do you usually use?", ""),
Responses = c("Male/Female", "Age (y)", "Height (m)", "Weight (kg)", "Yes/No", "Yes/No", "", "", "No/Sometimes/Frequently/Always", "Yes/No", "", "Yes/No", "", "", "Never/Sometimes/Frequently/Always", "Auto/Motorbike/Bike/Public Trans/Walking", "Insuf. Weight/Normal Weight/Overweight 1/Overweight 2/Obese 1/Obese 2/Obese 3"),
Var = c("Gender", "Age", "Height", "Weight", "fh.overweight", "FAVC", "FCVC", "NCP", "CAEC", "SMOKE", "CH20", "SCC", "FAF", "TUE", "CALC", "MTRANS", "Ob.Rank")
)
kable(vars, align = c("l", "c", "c", "c"), col.names = c("Variable Name", "Type", "Associated Survey Question", "Response Options / Levels", "Coded Var. Name"), caption = "Table 1:
<br>Summary details of all variables in the obesity data.") %>%
kable_styling()
Variable Name | Type | Associated Survey Question | Response Options / Levels | Coded Var. Name |
---|---|---|---|---|
Gender | Continuous | Male/Female | Gender | |
Age | Continuous | Age (y) | Age | |
Height | Continuous | Height (m) | Height | |
Weight | Continuous | Weight (kg) | Weight | |
Family History Overweight | Binary | Has a family member suffered or suffers from overweight? | Yes/No | fh.overweight |
High Calorie Foods Often | Binary | Do you eat high caloric food frequently? | Yes/No | FAVC |
Vegetables in Meals | Continuous | Do you usually eat vegetables in your meals? | FCVC | |
Meals Per Day | Continuous | How many main meals do you have daily? | NCP | |
Food Between Meals | Categorical | Do you eat any food between meals? | No/Sometimes/Frequently/Always | CAEC |
Smoking Status | Binary | Do you smoke? | Yes/No | SMOKE |
Daily Water Consumption | Continuous | How much water do you drink daily? | CH20 | |
Calorie Counting | Binary | Do you monitor the calories you eat daily? | Yes/No | SCC |
Physical Activity Rate | Continuous | How often do you have physical activity? | FAF | |
Technology Use | Continuous | How much time do you use technological devices such as cell phone, videogames, television, computer and others? | TUE | |
Alcohol | Categorical | How often do you drink alcohol? | Never/Sometimes/Frequently/Always | CALC |
Transportation | Categorical | Which transportation do you usually use? | Auto/Motorbike/Bike/Public Trans/Walking | MTRANS |
Obesity Level | Categorical | Insuf. Weight/Normal Weight/Overweight 1/Overweight 2/Obese 1/Obese 2/Obese 3 | Ob.Rank |
obese <- obese %>% mutate_if(is.character, as.factor)
The goals of this analysis are to develop and tune various prediction models to utilize the dietary, behavioral, and demographic data (without patient weight) to predict a patients’ overweight/obesity status. This will allow researchers to better identify patients that are overweight/obese that may not normally have access to preventative healthcare and to predict patients that may not yet be suffering from overweight or obesity yet their behaviors and dietary habits place them on track to developing these conditions.
As the overweight/obesity rank as captured by Palechor and Manotas was encoded as a factor variable with 7 levels both linear and multinomial regression approaches were utilized. In addition to standard regression approaches, CART Regression, CART Classification, BAGGING regression, and BAGGING classification were employed for assessment.
In order to accommodate for both linear regression and categorical
classification/multinomial regression the data was copied in to
duplicate data sets. One data set retained the categorical values as
factors, while the other first converted categorical values to factors,
then again converted these values to numeric from factor. Summary
statistics of the factor based data set are included below in
Table 2
.
As the purpose of this analysis is to predict the participant’s
overweight/obese status without utilizing participants weight, the
weight variable was dropped from subsequent analysis following
Table 2
. Additionally, a cursory evaluation of
Table 2
indicates there are no missing values in the data
set, therefore imputation is not required.
kable(summary(obese), align = "c" , caption = "Table 2:
<br>Summary statistics of obesity data.") %>%
kable_styling() %>%
scroll_box(width = "100%")
Gender | Age | Height | Weight | FH.Overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | Ob.Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Female:1043 | Min. :14.00 | Min. :1.450 | Min. : 39.00 | no : 385 | no : 245 | Min. :1.000 | Min. :1.000 | Always : 53 | no :2067 | Min. :1.000 | no :2015 | Min. :0.0000 | Min. :0.0000 | Always : 1 | Automobile : 457 | Insufficient_Weight:272 | |
Male :1068 | 1st Qu.:19.95 | 1st Qu.:1.630 | 1st Qu.: 65.47 | yes:1726 | yes:1866 | 1st Qu.:2.000 | 1st Qu.:2.659 | Frequently: 242 | yes: 44 | 1st Qu.:1.585 | yes: 96 | 1st Qu.:0.1245 | 1st Qu.:0.0000 | Frequently: 70 | Bike : 7 | Normal_Weight :287 | |
NA | Median :22.78 | Median :1.700 | Median : 83.00 | NA | NA | Median :2.386 | Median :3.000 | no : 51 | NA | Median :2.000 | NA | Median :1.0000 | Median :0.6253 | no : 639 | Motorbike : 11 | Obesity_Type_I :351 | |
NA | Mean :24.31 | Mean :1.702 | Mean : 86.59 | NA | NA | Mean :2.419 | Mean :2.686 | Sometimes :1765 | NA | Mean :2.008 | NA | Mean :1.0103 | Mean :0.6579 | Sometimes :1401 | Public_Transportation:1580 | Obesity_Type_II :297 | |
NA | 3rd Qu.:26.00 | 3rd Qu.:1.768 | 3rd Qu.:107.43 | NA | NA | 3rd Qu.:3.000 | 3rd Qu.:3.000 | NA | NA | 3rd Qu.:2.477 | NA | 3rd Qu.:1.6667 | 3rd Qu.:1.0000 | NA | Walking : 56 | Obesity_Type_III :324 | |
NA | Max. :61.00 | Max. :1.980 | Max. :173.00 | NA | NA | Max. :3.000 | Max. :4.000 | NA | NA | Max. :3.000 | NA | Max. :3.0000 | Max. :2.0000 | NA | NA | Overweight_Level_I :290 | |
NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Overweight_Level_II:290 |
The pruned obesity data was then assessed for potential correlation
between all variables. The below Figure 1
demonstrates both
the magnitude and direction through the respective sizes and colors of
the potted circles. Variables sharing a pearson correlation coefficient
(r) of greater than or equal to 0.60 are considered to have at least
moderate correlation. A total of four (4) variables in two (2) unique
combinations were identified as at least moderately correlated, as
captured below in Figure 2
and Table 3
.
#drop weight from obese
obese <- obese[-4]
#create all numeric version of obese for correlation plot
obese.numeric <- obese %>% mutate_if(is.factor, as.numeric)
cor.obese <- cor(obese.numeric)
corrplot(cor.obese, type = "upper", order = "hclust", tl.col = "black", tl.srt = 65, diag = T)
title(main = "Variable Correlation", line = -3.75, cex.main =2)
title(sub = "Figure 1:
Pairwise correlation visualization of all variables in obese data. Character variables
were transformed to factors then numeric.
Size and color of circles are determined by strength and direction of correlation.", line = 0, cex.sub = 0.85)
#set up correlation table for filtering
cor.table <- data.frame(
row = rownames(cor.obese)[row(cor.obese)],
col = colnames(cor.obese)[col(cor.obese)],
cor = c(cor.obese)
)
#filter for 0.6 < cor < 1.0 to filter for high cor but ignore cor = 1 (same variable match)
cor.t2 <- filter(cor.table, abs(cor) >= 0.6) %>% filter(abs(cor) != 1)
cor.t2$cor <- round(cor.t2$cor, 3)
cor.t2 <- cor.t2[order(cor.t2$cor),]
cor.mtrans_age <- ggplot(obese, aes (x = MTRANS, y = Age)) +
geom_boxplot(aes(fill = MTRANS), alpha = 0.6) +
theme_bw() +
theme(axis.text.x = element_text(angle = 35, vjust = 1, hjust=1), legend.position = "none")
cor.gen_height <- ggplot(obese, aes (x = Gender, y = Height)) +
geom_boxplot(aes(fill = Gender), alpha = 0.6) +
theme_bw() +
theme(axis.text.x = element_text(angle = 35, vjust = 1, hjust=1), legend.position = "none")
grid.arrange(
arrangeGrob(cor.mtrans_age, cor.gen_height, ncol = 2, nrow = 1),
top = textGrob("Pairwise Correlation Plots
", gp = gpar(fontsize = 18, fontface = "bold")),
bottom = textGrob("Figure 2:
Visualizatons of Obesity variable correlations where |r| > 0.6")
)
kable(cor.t2[c(1,4),], align = "c", row.names = F, col.names = c("Variable 1 (x)", "Variable 2 (y)", "Correlation (r)"), caption = "Table 3:
<br>Identified moderate to strong correlations from obese data.
<br>Duplicate entries have been removed, favoring categorical values as the x variable to allow for easy visualization in Figure 2.
<br>Moderate to strong correlations defined as 0.6 ≤ r ≤ 1") %>%
kable_styling()
Variable 1 (x) | Variable 2 (y) | Correlation (r) |
---|---|---|
MTRANS | Age | -0.602 |
Gender | Height | 0.618 |
The correlated variables as outlined in Figre 2
and
Table 3
prove to be logical and do not require further
investigation.
Figure 2
.In order to properly utilize the correlated variables without data loss, feature variables were created to allow their inclusion without the implications of collinearity. Additionally, a third feature variable was created to allow for additional dimensionality to the prediction models. The following feature variables were created prior to analysis:
Summary statistics of the newly created feature variables are
included below in Table 4
.
#adding feature engineered variables
obese$veg.day <- obese$FCVC*obese$NCP
obese$gen.height <- obese$Height/as.numeric(obese$Gender)
obese$trans.age <- as.numeric(obese$MTRANS)/as.numeric(obese$Age)
#pulling an all numeric copy just in case needed later
obese.numeric <- obese %>% mutate_if(is.factor, as.numeric)
kable(summary(obese[c(17:19)]), align = "c", col.names = c("Vegetables per Day", "Height by Gender", "Transport Method by Age"),caption = "Table 4:
<br>Summary statistics of feature engineered variables (veg.day, gen.height, trans.age") %>%
kable_styling()
Vegetables per Day | Height by Gender | Transport Method by Age | |
---|---|---|---|
Min. : 1.000 | Min. :0.7800 | Min. :0.01786 | |
1st Qu.: 5.018 | 1st Qu.:0.8799 | 1st Qu.:0.13120 | |
Median : 6.196 | Median :0.9571 | Median :0.17391 | |
Mean : 6.514 | Mean :1.2568 | Mean :0.15248 | |
3rd Qu.: 9.000 | 3rd Qu.:1.6376 | 3rd Qu.:0.20000 | |
Max. :12.000 | Max. :1.8434 | Max. :0.35714 |
In order to create the prediction models for patient obesity status, the data was first split in to 80:20 training:test groups to allow for model training and tuning on the test group. This split selection process will ensure the prediction models are not over-fitted to the initial data. Preventing over fitting of the models will allow for use with future collected data and participant data sourced from other countries by ensuring flexibility.
#80:20 split for CV
#numeric
cv.index <- sample(1:nrow(obese.numeric), size = 0.8*nrow(obese.numeric))
ob.train <- obese.numeric[cv.index,]
ob.test <- obese.numeric[-cv.index,]
#categorical
cv.index1 <- sample(1:nrow(obese), size = 0.8*nrow(obese))
ob.train1 <- obese[cv.index,]
ob.test1 <- obese[-cv.index,]
Rather than creating a regression model the CART algorithm can be utilized to create prediction models for both regression and classification. The CART algorithm functions through creating decision trees utilizing supervised learning algorithms based on the target response variable. The model initiates by splitting the first identified variable based on an algorithmic defined cutoff value creating the root node that branches into two inner nodes. The inner nodes are subsequently split into either an additional two inner nodes or two terminal (leaf) nodes. This process continues until the stop point is reached; the stop point is algorithmically determined to prevent over fitting and ensure the model is generalized enough that newly collected data can be utilized with the same model.
As the obesity rank variable is an integer value, both CART Regression and CART Classification trees were created to determine the best fit prediction model.
CART regression was initiated utilizing the rpart() package utilizing the full model. The regression tree was created utilizing the ANOVA method indicating that the tree splits are completed utilizing Mean Square Error (MSE) reduction techniques. This MSE reduction approach follows the below equation to ensure child (inner & terminal) nodes are selected based on splits that minimize the MSE of each node.
\[ \underline{\mbox{Mean Squared Error (MSE) Reduction}} \\ MSE = \frac{\sum_{i = 1}^{n} (y_{i} - \hat{y})^2}{n} \]
As the data was preemptively split into training and test data, the training data was utilized for 10-fold cross validation as part of the initial full model creation process. This cross validation ensures the produced model is the best fit for the training data as provided. Initially the hyperparameters were initiated as:
#create CART regression tree (initial full model)
reg.tree <- rpart(Ob.Rank ~ ., data = ob.train, method = "anova",
control = rpart.control(
minsplit = 20,
minbucket = 7,
cp = 0.01, #reduction in MSE needed for split
maxdepth = 5,
xval = 10
))
#plot full model regression tree
fancyRpartPlot(reg.tree, type = 5, tweak = 1.01, main = "Initial CART Regression Tree" , caption = "Figure 3:
Initial CART regression tree model (unpruned)")
With the initial CART regression tree model created above in
Figure 3
the model was then tuned to assign better fit
hyperparameter values ensuring improved fit and an overall reduction in
model size. The model’s CP values were compared with respect to the
cross validated error (xerror) and standard error of the cross validated
error (xstd) to identify new CP values for use with the initial model.
Both the CP associated with the minimum xerror (known as the min CP) and
the largest CP associated with the xerror within 1 standard error of the
minimum xerror value (known as the 1se CP value) were collected for
subsequent tuning.
|
|
The calculated CP values from the previous section were then fed in
to the model pruning function prune() to create two pruned CART
regression tree models. The first tree model (Figure 5
) was
created using the 1se CP value while the second (Figure 6
)
was created using the minimum CP value. As demonstrated by the below
regression trees, the 1se method of tuning the CP hyperparameter results
in a more significantly reduced regression tree model.
cp.reg.table <- reg.tree$cptable
#get values for pruned trees
xerror.min <- min(cp.reg.table[,"xerror"]) #min xerror
#extract row with minimum xerror as standalone data frame, then pull CP value out of that row
min.cp.row <- which.min(cp.reg.table[, "xerror"])
min.cp <- cp.reg.table[min.cp.row, "CP"] #min.cp for mincp tree
#get xstd value from minimum xerror row
xerror.xstd <- cp.reg.table[min.cp.row, "xstd"]
threshold <- xerror.min + xerror.xstd #add xstd value to min xerror to get 1 se from min
best.cp.row <- which(cp.reg.table[,"xerror"] <= threshold)[1] #pull row where xerror is closest to but below min(xerror) + 1se threshold
best.cp <- cp.reg.table[best.cp.row, "CP"] #pull CP value from row where xerror w.in 1se
reg.tree.bestcp <- prune(reg.tree, cp = best.cp) #prune tree with 1se parameter
reg.tree.mincp <- prune(reg.tree, cp = min.cp) #prune tree with min cp parameter
fancyRpartPlot(reg.tree.bestcp, type = 5, tweak = 1.01, main = "Pruned CART Regression Tree
CP = 1se", caption = paste("Figure 5:
Pruned CART Regression tree (1se model). CP =", round(best.cp, 5)))
fancyRpartPlot(reg.tree.mincp, type = 5, tweak = 1.01, main = "Pruned CART Regression Tree
CP = min", caption = paste("Figure 6:
Pruned CART Regression tree (min model). CP =", round(min.cp, 5)))
As demonstrated in the above Figure 5
, the 1se CART
regression tree resulted in a model that incorporated the following
feature variables:
Meanwhile, as demonstrated in Figure 6
the minimum CART
regression tree resulted in a model that incorporated the following
variables:
As the minimum CP model utilized a lower CP value, this model required less of a change in MSE before nodes were split, thus resulting in a broader model that incorporates a larger number of feature variables.
Both the 1se and minimum CP CART regression tree models were then utilized to generate prediction values for obesity ranks in utilizing the test data set. These predicted values were compared to the actual values in the test data set to assess for model accuracy and reliability. These CART regression models were additionally compared to two linear regression models to determine their efficacy in relation to other more established statistical prediction methods. The first regression model was comprised utilizing the variables as identified in the 1se CART regression tree model, while the second was constructed utilizing the stepAIC() package to automatically identify the best variables for inclusion.
The linear regression equations utilized are as follows:
\[ \underline{\mbox{1se Linear Regression Model}} \\ \\ \begin{align} \hat{Y}_{Ob.Rank} \sim \mbox{CAEC}*x_{1} + &\mbox{gen.height}*x_2 + \mbox{Age}*x_3 + \mbox{NCP}*x_4 + \mbox{FH.Overweight}*x_5 \end{align} \]
\[ \underline{\mbox{stepAIC Linear Regression Model}} \\ \\ \begin{align} \hat{Y}_{Ob.Rank} \sim \mbox{CAEC}*x_{1} + &\mbox{gen.height}*x_2 + \mbox{Age}*x_3 + \\ \mbox{NCP}*x_4 + &\mbox{FH.Overweight}*x_5 + \mbox{MTRANS}*x_6 + \mbox{FAF}*x_7 + \mbox{CH2O}*x_8 \end{align} \]
obrank.reg.tree.best <- predict(reg.tree.bestcp, newdata = ob.test)
obrank.reg.tree.min <- predict(reg.tree.mincp, newdata = ob.test)
#BestCP Model
mse.reg.best <- mean((ob.test$Ob.Rank - obrank.reg.tree.best)**2)
rmse.reg.best <- sqrt(mse.reg.best)
mae.reg.best <- mean(abs(ob.test$Ob.Rank - obrank.reg.tree.best))
rsq.reg.best <- (cor(as.numeric(ob.test$Ob.Rank), obrank.reg.tree.best))**2
#MinCP Model
mse.reg.min <- mean((ob.test$Ob.Rank - obrank.reg.tree.min)**2)
rmse.reg.min <- sqrt(mse.reg.min)
mae.reg.min <- mean(abs(ob.test$Ob.Rank - obrank.reg.tree.min))
rsq.reg.min <- (cor(as.numeric(ob.test$Ob.Rank), obrank.reg.tree.min))**2
#make linear regression models
#model using variables from 1se tree
lin.1se <- lm(Ob.Rank ~ CAEC + gen.height + Age + NCP + FH.Overweight, data = ob.train)
lin.pred <- predict(lin.1se, newdata = ob.test)
mse.lin.min <- mean((ob.test$Ob.Rank - lin.pred)**2)
rmse.lin.min <- sqrt(mse.lin.min)
mae.lin.min <- mean(abs(ob.test$Ob.Rank - lin.pred))
rsq.lin.min <- (cor(as.numeric(ob.test$Ob.Rank), lin.pred))**2
#stepwise selection from all variables
step.lin <- lm(Ob.Rank ~., data = ob.train)
AIC.lin <- stepAIC(step.lin, direction = "both", trace = F)
step.pred <- predict(AIC.lin, newdata = ob.test)
mse.step.min <- mean((ob.test$Ob.Rank - step.pred)**2)
rmse.step.min <- sqrt(mse.step.min)
mae.step.min <- mean(abs(ob.test$Ob.Rank - step.pred))
rsq.step.min <- (cor(as.numeric(ob.test$Ob.Rank), step.pred))**2
reg.errors <- data.frame(
measures = c("Mean Square Error (MSE)", "Root Mean Square Error (RMSE)", "Mean Absolute Error (MAE)", "R Squared"),
tree.r1se = c(mse.reg.best, rmse.reg.best, mae.reg.best, rsq.reg.best),
tree.rmin = c(mse.reg.min, rmse.reg.min, mae.reg.min, rsq.reg.min),
lin.1se = c(mse.lin.min, rmse.lin.min, mae.lin.min, rsq.lin.min),
step.lin = c(mse.step.min, rmse.step.min, mae.step.min, rsq.step.min)
)
The model assessment statistics are included below in
Table 6
. As demonstrated by the lowest MSE, RMSE, and MAE
with the highest r\(^2\) value, the Min
CP CART tree resulted in the best fit prediction model. Albeit, while
the Min CP CART Tree preformed the best of the four available regression
based models, none of the models preformed well. All returned an r\(^2\) value of less than 0.26 indicating
these models only capture less than 26% of the variance in the obesity
data.
kable(reg.errors, align = c("l", "c","c","c","c") , col.names = c("Asm. Measures", "1se CART Tree", "Min CP CART Tree", "1se Linear", "stepAIC Linear") , caption = "Table 6:
<br>Model assessment statistics") %>%
kable_styling()
Asm. Measures | 1se CART Tree | Min CP CART Tree | 1se Linear | stepAIC Linear |
---|---|---|---|---|
Mean Square Error (MSE) | 2.7792004 | 2.7090165 | 3.2114684 | 2.8910271 |
Root Mean Square Error (RMSE) | 1.6670934 | 1.6459090 | 1.7920570 | 1.7003021 |
Mean Absolute Error (MAE) | 1.2933708 | 1.2571838 | 1.5385074 | 1.4111261 |
R Squared | 0.2762515 | 0.2996549 | 0.1662987 | 0.2512238 |
A summary of the variable importance for both the 1se CART regression
and min CP CART regression trees is included below in
Figure 7
. While a majority of the variables as represented
in the above trees in Figure 5
and Figure 6
are represented in the below importance graphs, some variables are
considered important but omitted. This omission of important variables
can be caused by a number of reasons including but not limited to:
imp.tree1se <- reg.tree.bestcp$variable.importance
imp.treemin <- reg.tree.mincp$variable.importance
par(mfrow = c(1,2), oma = c(5,2,3,2))
barplot(imp.tree1se,
horiz = T,
main = list("1se cp Regression Tree Model", font = 1),
xlab = "Importance",
las = 1)
barplot(imp.treemin,
horiz = T,
main = list("Min cp Regression Tree Model", font = 1),
xlab = "Importance",
las = 1)
mtext("Variable Importance", side = 3, line = 1, outer = T, cex = 1.5, font = 2)
mtext("Figure 7:
Variable importance plots for both CART Regression tree models", side = 1, line = 1, outer = T)
CART classification was initiated utilizing the rpart() package utilizing the full model as a starting point. The regression tree was created utilizing the Class method and Gini impurity method indicating that the tree splits are completed utilizing splits to maximize the Gini gain. This Gini reduction approach follows the below equation to ensure child (inner & terminal) nodes are selected based on splits that maximize the Gini gain (purity) of each node.
\[ \underline{\mbox{Gini Impurity Reduction}} \\ \begin{align*} \mbox{Gini} &= 1 - \sum_{i = 1}^{k} p_{i}^{2} \ \ \ &\mbox{where} \ \ &p_{i} \ \ \mbox{is a proportion of class} \ i \\ \\ \Delta G &= \mbox{Gini} * S - \frac{n_L}{n} * \mbox{Gini}*S_L - \frac{n_R}{n}*\mbox{Gini}*S_r &\mbox{where} \ &n_L \ = \mbox{left node sample size} \\ &&&n_R = \mbox{right node sample size} \end{align*} \]
As the data was preemptively split into training and test data, the training data was utilized for 10-fold cross validation as part of the initial full model creation process. This cross validation ensures the produced model is the best fit for the training data as provided. Initially the hyperparameters were initiated as:
#create CART regression tree (initial full model)
class.tree <- rpart(Ob.Rank~., data = ob.train1, method = "class",
parms = list(split = "gini"),
control = rpart.control(
minsplit = 20,
minbucket = 10,
cp = 0.001, #reduction in complexity param needed for split
maxdepth = 5,
xval = 10
))
#plot full model class tree
fancyRpartPlot(class.tree, type = 5, tweak = 1.1, main = "Initial CART Gini Classification Tree" , caption = "Figure 8:
Initial CART Gini classification tree model (unpruned)")
With the initial CART classification tree model created above in
Figure 8
the model was then tuned to assign better fit
hyperparameter values ensuring improved fit and an overall reduction in
model size. The model’s CP values were compared with respect to the
cross validated error (xerror) and standard error of the cross validated
error (xstd) to identify new CP values for use with the initial model.
Both the CP associated with the minimum xerror (known as the min CP) and
the largest CP associated with the xerror within 1 standard error of the
minimum xerror value (known as the 1se CP value) were collected for
subsequent tuning.
|
|
The calculated CP values from the previous section were then fed in
to the model pruning function prune() to create two pruned CART
Gini classification tree models. The first tree model
(Figure 10
) was created using the 1se CP value while the
second (Figure 11
) was created using the minimum CP value.
As demonstrated by the below regression trees, the 1se method of tuning
the CP hyperparameter results in a more significantly reduced regression
tree model.
cp.class.table <- class.tree$cptable
#get values for pruned trees
xerror.min2 <- min(cp.class.table[,"xerror"]) #min xerror
#extract row with minimum xerror as standalone data frame, then pull CP value out of that row
min.cp.row2 <- which.min(cp.class.table[, "xerror"])
min.cp2 <- cp.class.table[min.cp.row2, "CP"] #min.cp for mincp tree
#get xstd value from minimum xerror row
xerror.xstd2 <- cp.class.table[min.cp.row2, "xstd"]
threshold2 <- xerror.min2 + xerror.xstd2 #add xstd value to min xerror to get 1 se from min
best.cp.row2 <- which(cp.class.table[,"xerror"] <= threshold2)[1] #pull row where xerror is closest to but below min(xerror) + 1se threshold
best.cp2 <- cp.class.table[best.cp.row2, "CP"] #pull CP value from row where xerror w.in 1se
class.bestcp <- prune(class.tree, cp = best.cp2) #prune tree with 1se parameter
class.mincp <- prune(class.tree, cp = min.cp2) #prune tree with min cp parameter
fancyRpartPlot(class.bestcp, type = 5, tweak = 1.02, main = "Pruned CART Gini classification Tree
CP = 1se", caption = paste("Figure 10:
Pruned CART Regression tree (1se model). CP =", round(best.cp2, 5)))
fancyRpartPlot(class.mincp, type = 5, tweak = 1.02, main = "Pruned CART Gini classification Tree
CP = min", caption = paste("Figure 11:
Pruned CART Regression tree (min model). CP =", round(min.cp2, 5)))
As demonstrated in the above Figure 10
, the 1se CART
Gini classification tree resulted in a model that incorporated the
following feature variables:
Meanwhile, as demonstrated in Figure 11
the minimum CP
CART Gini classification tree resulted in a model that incorporated the
following variables:
As the minimum CP model utilized a lower CP value, this model required less of a change in the Gini gain before nodes were split, thus resulting in a broader model that incorporates a larger number of feature variables.
Both the 1se and minimum CP CART Gini classification tree models were then utilized to generate prediction values for obesity ranks in utilizing the test data set. These predicted values were compared to the actual values in the test data set to assess for model accuracy and reliability. These CART Gini classification models were additionally compared to a multinomial regression model to determine their efficacy in relation to other more established statistical prediction methods. The multinomial model was comprised utilizing the variables as identified in the 1se CART regression tree model.
#preds from tree model (class)
obrank.class.best.lab <- predict(class.bestcp, newdata = ob.test1, type = "class")
obrank.class.best.prob <- predict(class.bestcp, newdata = ob.test1, type = "prob")[,2]
obrank.class.min.lab <- predict(class.mincp, newdata = ob.test1, type = "class")
obrank.class.min.prob <- predict(class.mincp, newdata = ob.test1, type = "prob")[,2]
#make logistic regression models
#model using variables from 1se tree
log.1se <- multinom(Ob.Rank ~ veg.day + FH.Overweight + gen.height + Age + NCP + CAEC + trans.age + Height + FAVC + MTRANS, data = ob.train1, trace=F)
#log model predictions
log.pred.lab <- predict(log.1se, newdata = ob.test1, type = "class")
log.pred.prob <- predict(log.1se, newdata = ob.test1, type = "prob")[,2]
#make ROC curves
ROC.class.1se <- roc(ob.test1$Ob.Rank, obrank.class.best.prob)
ROC.class.min <- roc(ob.test1$Ob.Rank, obrank.class.min.prob)
ROC.log.1se <- roc(ob.test1$Ob.Rank, log.pred.prob)
#sensitivities
roc.c1.sen <- ROC.class.1se$sensitivities
roc.cm.sen <- ROC.class.min$sensitivities
roc.log.sen <- ROC.log.1se$sensitivities
#1 - specificities
roc.c1.spe <- (1 - ROC.class.1se$specificities)
roc.cm.spe <- (1 - ROC.class.min$specificities)
roc.log.spe <- (1 - ROC.log.1se$specificities)
#AUC
roc.c1.auc <- ROC.class.1se$auc
roc.cm.auc <- ROC.class.min$auc
roc.log.auc <- ROC.log.1se$auc
#best cutoffs
c1.cp <- coords(ROC.class.1se, "best", ret = c("threshold", "sensitivity", "specificity"))
cm.cp <- coords(ROC.class.min, "best", ret = c("threshold", "sensitivity", "specificity"))
l1.cp <- coords(ROC.log.1se, "best", ret = c("threshold", "sensitivity", "specificity"))
cutpoffs.c <- NULL
cutoffs.c <- data.frame(
measure = c("Threshold", "Sensitivity", "Specificity", "AUC"),
c1.cp = c(c1.cp[,1], c1.cp[,2], c1.cp[,3], roc.c1.auc),
cm.cp = c(cm.cp[,1], cm.cp[,2], cm.cp[,3], roc.cm.auc),
l1.cp = c(l1.cp[,1], l1.cp[,2], l1.cp[,3], roc.log.auc)
)
In order to compare the efficacy of each CART Gini classification
tree and the multinomial regression model, an ROC plot was constructed
and area under the curve (AUC) values were calculated in the below
Figure 12
. Optimal cutoff points to maximize threshold,
sensitivity, and specificity were calculated and plotted in the below
ROC plot. These cutoff values are additionally listed in
Table 8
for ease of reference. While neither model proved
to incredibly reliable, the CART Gini classification tree with the 1se
CP value returned the best AUC value while retaining the highest
possible sensitivity and specificity.
par(oma = c(6,2,2,2), cex.main = 2)
plot(roc.c1.spe, roc.c1.sen, type = "l", xlim = c(0,1), xlab = "1 - Specificity (False Positive)", ylim = c(0,1), ylab = "Sensitivity (True Positive)", main = "ROC Curves") +
lines(roc.c1.spe, roc.c1.sen, col = "darkviolet") +
lines(roc.log.spe, roc.log.sen, col = "coral3") +
lines(roc.cm.spe, roc.cm.sen, col = "blue3") +
lines(x = c(0:1), y=c(0:1), col = "darkred", lty = 4)
FALSE integer(0)
points((1-c1.cp[1,3]), c1.cp[1,2], pch = 22, bg = adjustcolor("darkviolet", alpha = 0.5), cex = 1.75) +
points((1-cm.cp[1,3]), cm.cp[1,2], pch = 23, bg = adjustcolor("blue3", alpha = 0.5), cex = 1.5) +
points((1-l1.cp[1,3]), l1.cp[1,2], pch = 21, bg = adjustcolor("coral3", alpha = 0.5), cex = 1.25)
FALSE integer(0)
legend("bottomright", c(paste("Tree.1se ", "AUC =", round(roc.c1.auc, 4)),
paste("Tree.min ", "AUC =", round(roc.cm.auc, 4)),
paste("Logi.1se ", "AUC =", round(roc.log.auc, 4))),
col = c("darkviolet", "blue3", "coral3"),
lwd = 2,
lty = 1,
cex = 0.8,
bty = "o")
mtext("Figure 12:
ROC Comparrison of CART Gini classification trees & multinomial regression model", side = 1, line = 6, outside = T)
kable(cutoffs.c, align = c("l", "c", "c", "c"), col.names = c("Cutoff Details", "Class 1se Model", "Class Min Model", "Logistic Model 1se Vars"), caption = "Table 8:
<br>A summary of calculated best/maximum cutoff points and model AUC values.") %>%
kable_styling()
Cutoff Details | Class 1se Model | Class Min Model | Logistic Model 1se Vars |
---|---|---|---|
Threshold | 0.2582288 | 0.1588050 | 0.2118908 |
Sensitivity | 0.5178571 | 0.5714286 | 0.5535714 |
Specificity | 0.8412698 | 0.8253968 | 0.6507937 |
AUC | 0.6502268 | 0.6676587 | 0.5847506 |
To further improve upon the CART regression and classification trees created to predict obesity ranking, additional BAGGING algorithms were employed. Bootstrap aggregation with CART models (BAGGING) is a process that consists of repeatedly re-sampling the test data with replacement and calculating the mean for each sample collected. The distribution of these bootstrap means are then analysed to determine the regression coefficients to build the prediction model, with weighting considered for the frequency of these calculated means. Through this averaging of multiple models the BAGGING process reduces model variance without leading to increased bias. This approach introduces a layer of randomness to the model building process further preventing overfitting. The introduction of randomness in the model building process is essential to ensuring a model’s adaptability to new population data for prediction.
As the data was preemptively split into training and test data, the training data was utilized for 10-fold cross validation as part of the initial BAGGING regression model creation process. This cross validation ensures the produced model is the best fit for the training data as provided. Initially the hyperparameters were initiated as multiple values with a for loop to identify the best fit parameter values:
#CV control
ctrl.bag <- trainControl(method = "cv", number = 5, verboseIter = T)
#define params to test
nbagg.val <- c(10, 25)
cp.val <- c(0.01, 0.5)
maxdepth.val <- c(5, 10)
bag <- data.frame(NULL)
for (nbagg in nbagg.val) {
for (cp in cp.val) {
for (maxdepth in maxdepth.val) {
set.seed(1245)
model <- bagging(Ob.Rank ~., data = ob.train,
nbagg = nbagg,
coob = T,
trControl = ctrl.bag,
control = rpart.control(cp = cp,
maxdepth = maxdepth)
)
oob.error <- model$err
bag <- rbind(bag, data.frame(nbagg = nbagg,
cp = cp,
maxdepth = maxdepth,
oob.error = oob.error))
}
}
}
#pull out best values
best.bagg <- bag[which.min(bag$oob.error),]
kable(best.bagg, caption = "Table 9:
<br>A summary of best hyperparameter values.") %>%
kable_styling()
nbagg | cp | maxdepth | oob.error | |
---|---|---|---|---|
6 | 25 | 0.01 | 10 | 1.415986 |
Utilizing the best identified hyperparameter values as listed above
in Table 9
the initial BAGGING regression model was
trained. Once trained, the model was utilized to obtain model assessment
statistics such as RMSE, R squared and MAE. As demonstrated below in
Table 9
the BAGGING regression model has proven to be the
best fit of all five regression based models, returning the greatest
r\(^2\) value and lowest RMSE and MSE
values of all models compared.
#train final model
bag.final <- bagging(
Ob.Rank ~.,
ob.train,
nbagg = best.bagg$nbagg,
coob = T,
trControl = ctrl.bag,
control = rpart.control(cp = best.bagg$cp,
maxdepth = best.bagg$maxdepth),
importance = T
)
bagg.pred <- predict(bag.final, newdata = ob.test)
bag.error <- postResample(pred = bagg.pred, obs = ob.test$Ob.Rank)
join.error <- reg.errors[-1,]
join.error$bag.reg <- c(as.numeric(bag.error[1]), as.numeric(bag.error[3]), as.numeric(bag.error[2]))
kable(join.error, col.names = c("Asm. Measures", "1se CART Tree", "Min CP CART Tree", "1se Linear", "stepAIC Linear", "BAGGING Regression") , caption = "Table 9:
<br>Model assessment statistics") %>%
kable_styling()
Asm. Measures | 1se CART Tree | Min CP CART Tree | 1se Linear | stepAIC Linear | BAGGING Regression | |
---|---|---|---|---|---|---|
2 | Root Mean Square Error (RMSE) | 1.6670934 | 1.6459090 | 1.7920570 | 1.7003021 | 1.4262659 |
3 | Mean Absolute Error (MAE) | 1.2933708 | 1.2571838 | 1.5385074 | 1.4111261 | 1.0760547 |
4 | R Squared | 0.2762515 | 0.2996549 | 0.1662987 | 0.2512238 | 0.4795678 |
A summary of the variable importance for both the BAGGING regression
is included below in Figure 13
. Unlike what was seen with
the CART regression models, the BAGGING regression model places
significantly greater importance on all variables in the data set.
bag.import <- varImp(bag.final)
get.bagging.importance <- function(model) {
trees <- model$mtrees
imp <- numeric(length(trees[[1]]$btree$variable.importance))
names(imp) <- names(trees[[1]]$btree$variable.importance)
for(tree in trees){
imp[names(tree$btree$variable.importance)] <- imp[names(tree$btree$variable.importance)] + tree$btree$variable.importance
}
imp <- imp/length(trees)
return(imp)
}
bag.imp.sc <- sort(get.bagging.importance(bag.final), decreasing = T)
par(oma = c(5,2,3,2))
barplot(bag.imp.sc,
horiz = T,
main = list("Bagging Regression Tree Model", font = 1),
xlab = "Importance",
las = 1)
mtext("Variable Importance", side = 3, line = 1, outer = T, cex = 1.5, font = 2)
mtext("Figure 13:
Variable importance plots for BAGGING Regression tree model", side = 1, line = 1, outer = T)
As the data was preemptively split into training and test data, the training data was utilized for 10-fold cross validation as part of the initial BAGGING classification model creation process. This cross validation ensures the produced model is the best fit for the training data as provided. Initially the hyperparameters were initiated as multiple values with a for loop to identify the best fit parameter values:
baghp.grid <- expand.grid(
nbagg = c(25, 50),
minsplit = c(5, 10),
maxdepth = c(5, 10),
cp = c(0.01, 0.001)
)
bag.results <- data.frame(NULL)
bag.ac <- 0
bag.param <- list(NULL)
for(i in 1:nrow(baghp.grid)) {
params <- baghp.grid[i,]
rpart.control <- rpart.control(
minsplit = params$minsplit,
maxdepth = params$maxdepth,
cp = params$cp
)
bag.model <- bagging(Ob.Rank ~., ob.train1,
nbagg = params$nbagg,
coob = T,
control = rpart.control
)
bag.preds <- predict(bag.model, newdata = ob.test1)
bag.conf <- confusionMatrix(bag.preds, ob.test1$Ob.Rank)
accuracy <- bag.conf$overall["Accuracy"]
bag.results <- rbind(bag.results, data.frame(
nbagg = params$nbagg,
minsplit = params$minsplit,
maxdepth = params$maxdepth,
cp = params$cp,
Accuracy = accuracy
))
if(accuracy > bag.ac){
bag.ac <- accuracy
bag.param <- params
}
}
kable(bag.param, caption = "Table 10:
<br>A summary of best hyperparameter values.") %>%
kable_styling
nbagg | minsplit | maxdepth | cp | |
---|---|---|---|---|
16 | 50 | 10 | 10 | 0.001 |
The BAGGING classification model was then utilized to generate prediction values for obesity ranks in utilizing the test data set. These predicted values were compared to the actual values in the test data set to assess for model accuracy and reliability. This BAGGING classification model was additionally compared to the previously constructed CART Gini classification model and the previously utilized multinomial regression model to determine their efficacy in relation to other more established statistical prediction methods.
A confusion matrix of the predicted obesity rank values produced by
the BAGGING classification model compared to the test data values is
included below in Table 11
.
best.control <- rpart.control(
minsplit = bag.param$minsplit,
maxdepth = bag.param$maxdepth,
cp = bag.param$cp
)
final.bag.clas <- bagging(
Ob.Rank ~., ob.train1,
nbagg = bag.param$nbagg,
coob = T,
control = best.control
)
bag.c.preds <- predict(final.bag.clas, newdata = ob.test1)
final.conf <- confusionMatrix(bag.c.preds, ob.test1$Ob.Rank)
kable(final.conf$table, col.names = c("Predicted Values", "Actual Insf. Weight", "Actual Normal Weight", "Actual Obesity 1", "Actual Obesity 2", "Actual Obesity 3", "Actual Overweight 1", "Actual Overweight 2"), caption = "Table 11:
<br>Confusion matrix of BAGGING classification predicted obesity rank values.") %>%
kable_styling() %>%
scroll_box("100%")
Predicted Values | Actual Insf. Weight | Actual Normal Weight | Actual Obesity 1 | Actual Obesity 2 | Actual Obesity 3 | Actual Overweight 1 | Actual Overweight 2 |
---|---|---|---|---|---|---|---|
Insufficient_Weight | 50 | 5 | 0 | 0 | 0 | 4 | 2 |
Normal_Weight | 9 | 36 | 4 | 1 | 0 | 8 | 5 |
Obesity_Type_I | 2 | 5 | 60 | 0 | 0 | 4 | 4 |
Obesity_Type_II | 0 | 1 | 2 | 61 | 0 | 0 | 2 |
Obesity_Type_III | 0 | 2 | 1 | 0 | 59 | 0 | 0 |
Overweight_Level_I | 1 | 3 | 0 | 0 | 0 | 33 | 1 |
Overweight_Level_II | 1 | 4 | 7 | 3 | 0 | 2 | 41 |
#make ROC curves
roc.bag <- roc(ob.test1$Ob.Rank, as.numeric(bag.c.preds))
#sensitivities
bag.sen <- roc.bag$sensitivities
#1 - specificities
bag.spe <- (1 - roc.bag$specificities)
#AUC
bag.auc <- ROC.class.1se$auc
#best cutoffs
bag.cp <- coords(roc.bag, "best", ret = c("threshold", "sensitivity", "specificity"))
cutpoffs.b <- NULL
cutoffs.b <- data.frame(
measure = c("Threshold", "Sensitivity", "Specificity", "AUC"),
bag.cp = c(bag.cp[,1], bag.cp[,2], bag.cp[,3], bag.auc)
)
In order to compare the efficacy of the model, an ROC plot was
constructed and area under the curve (AUC) values were calculated in the
below Figure 13
. Optimal cutoff points to maximize
threshold, sensitivity, and specificity were calculated and plotted in
the below ROC plot. These cutoff values are additionally listed in
Table 12
for ease of reference. While neither model proved
to incredibly reliable, the BAGGING classification returned the best AUC
value while retaining the highest possible sensitivity and
specificity.
par(oma = c(6,2,2,2), cex.main = 2)
plot(roc.c1.spe, roc.c1.sen, type = "l", xlim = c(0,1), xlab = "1 - Specificity (False Positive)", ylim = c(0,1), ylab = "Sensitivity (True Positive)", main = "ROC Curves", col = "darkviolet") +
lines(roc.cm.spe, roc.cm.sen, col = "blue3") +
lines(roc.log.spe, roc.log.sen, col = "coral3") +
lines(bag.spe, bag.sen, col = "yellow4") +
lines(x = c(0:1), y = c(0,1), col = "darkred", lty = 3) +
points((1-c1.cp[1,3]), c1.cp[1,2], pch = 21, bg = adjustcolor("darkviolet", alpha = 0.5), cex = 1.75) +
points((1-cm.cp[1,3]), cm.cp[1,2], pch = 22, bg = adjustcolor("blue3", alpha = 0.5), cex = 1.5) +
points((1-l1.cp[1,3]), l1.cp[1,2], pch = 23, bg = adjustcolor("coral3", alpha = 0.5), cex = 1.25) +
points((1-bag.cp[1,3]), bag.cp[1,2], pch = 24, bg = adjustcolor("yellow4", alpha = 0.5), cex = 1.5)
FALSE integer(0)
legend("bottomright", c(paste("Tree.1se ", "AUC =", round(roc.c1.auc, 4)),
paste("Tree.min ", "AUC =", round(roc.cm.auc, 4)),
paste("Logi.1se ", "AUC =", round(roc.log.auc, 4)),
paste("BAGGING", "AUC =", round(bag.auc, 4))),
col = c("darkviolet", "blue3", "coral3", "yellow4"),
lwd = 2,
lty = 1,
cex = 0.8,
bty = "o")
mtext("Figure 13:
ROC Comparrison of BAGGING classification model,
CART Gini classification trees & multinomial
regression model", side = 1, line = 7.5, outside = T)
cutoffs <- cbind(cutoffs.c, cutoffs.b[2])
kable(cutoffs, align = c("l", "c", "c", "c", "c"), col.names = c("Cutoff Details", "Class 1se Model", "Class Min Model", "Logistic Model 1se Vars", "BAGGING Class Model"), caption = "Table 12:
<br>A summary of calculated best/maximum cutoff points & AUC values") %>%
kable_styling()
Cutoff Details | Class 1se Model | Class Min Model | Logistic Model 1se Vars | BAGGING Class Model |
---|---|---|---|---|
Threshold | 0.2582288 | 0.1588050 | 0.2118908 | 1.5000000 |
Sensitivity | 0.5178571 | 0.5714286 | 0.5535714 | 0.9107143 |
Specificity | 0.8412698 | 0.8253968 | 0.6507937 | 0.7936508 |
AUC | 0.6502268 | 0.6676587 | 0.5847506 | 0.6502268 |
Overall, the BAGGING regression and classification models produced the most reliable predictions for individuals’ obesity ratings when omitting weight data. This approach introduces randomness to the models as well, further improving the models’ performance as new test data is introduced. Additionally, treating the obesity rank values as categorical proved to yield more accurate prediction results, therefore when building future prediction models for obesity rank this approach should be incorporated.
While these models are far from perfect in their predictive capabilities, they provide a starting point to identify both individuals that are overweight and or obese that would not traditionally seek routine medical care due to financial or societal barriers through utilization of simple web surveys. Additionally, as weight was removed from these prediction models, they can also be utilized to identify individuals that may not yet be overweight or obese, but their behavior and dietary choices align with those that are overweight or obese, acting essentially as an obesity predictor. As weight is a sensitive subject for most adults, it is ideal that these prediction models do not include participant weight removing the potential for deceptive answers and encouraging survey completion. Through these identification and obesity prediction methods, communities can be targeted for outreach programs to increase access to care or targeted marketing to promote healthy lifestyles promotions (i.e. gym membership discounts from helth insurance providers).
Information
[1] https://www.cdc.gov/obesity/adult-obesity-facts/index.html
[2] https://www.cdc.gov/obesity/childhood-obesity-facts/childhood-obesity-facts.html
[3] https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight
[4] https://uconnruddcenter.org/research/weight-bias-stigma/
Data
[5] “Estimation of Obesity Levels Based On Eating Habits and Physical
Condition .” UCI Machine Learning Repository, 2019, https://doi.org/10.24432/C5H31Z.
(obtained from: https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition)
https://www.sciencedirect.com/science/article/pii/S2352340919306985?via%3Dihub