Purpose of the Study

Obesity and overweight conditions are major public health concerns. This study aims to analyze the effectiveness of a 12-week weight loss program using a dataset collected from 35 participants.I will examine the relationships or impact between diet adherence, exercise intensity, sleep quality, and weight loss using statistical methods in R.


Load Required Libraries

# Load necessary libraries
library(ggplot2)  
library(dplyr)     
library(corrplot)
library(readxl)
library(naniar)
library(car)
library(gtsummary)

** Import Dataset **

data <- read_excel("Data_WeightLoss.xlsx")
head(data)
## # A tibble: 6 × 9
##   Participant Gender   Age BaselineWeight DietAdherence ExerciseIntensity
##         <dbl> <chr>  <dbl>          <dbl>         <dbl>             <dbl>
## 1           1 Male      39           92.6           9.6               6.6
## 2           2 Male      48           88.9           9.7               7  
## 3           3 Male      31           91.3           8.4               5.1
## 4           4 Female    35           70             7.6               5.9
## 5           5 Male      32           84.3           6.6               9.2
## 6           6 Female    29           76.6           9.7               6.2
## # ℹ 3 more variables: SleepQuality <dbl>, WeightLoss <dbl>, AfterWeight <dbl>

** EDA **

isnull <- colSums(is.na(data))
print(paste("Missing value count:", isnull))
## [1] "Missing value count: 0" "Missing value count: 0" "Missing value count: 0"
## [4] "Missing value count: 0" "Missing value count: 0" "Missing value count: 1"
## [7] "Missing value count: 0" "Missing value count: 0" "Missing value count: 0"
gg_miss_var(data,show_pct = TRUE)+
  labs(y = "Displaying the proportion of missings")

numeric_cols <- data[sapply(data, is.numeric)]
stats <- data.frame(
  Mean = sapply(numeric_cols, mean, na.rm = TRUE),
  Std = sapply(numeric_cols, sd, na.rm = TRUE),
  Min = sapply(numeric_cols, min, na.rm = TRUE),
  Q1 = sapply(numeric_cols, function(x) quantile(x, 0.25, na.rm = TRUE)),
  Median = sapply(numeric_cols, median, na.rm = TRUE),
  Q3 = sapply(numeric_cols, function(x) quantile(x, 0.75, na.rm = TRUE)),
  Max = sapply(numeric_cols, max, na.rm = TRUE),
  IQR = sapply(numeric_cols, IQR, na.rm = TRUE)
)

stats$Lower_Bound = stats$Q1 - 1.5*stats$IQR
stats$Upper_Bound = stats$Q3 + 1.5*stats$IQR

print(stats)
##                        Mean       Std  Min    Q1 Median    Q3  Max   IQR
## Participant       18.000000 10.246951  1.0  9.50   18.0 26.50 35.0 17.00
## Age               38.085714  7.875395 26.0 31.50   37.0 46.50 52.0 15.00
## BaselineWeight    84.434286  8.069694 70.0 78.45   84.3 89.85 99.5 11.40
## DietAdherence      8.242857  1.171797  6.2  7.25    8.4  9.20  9.9  1.95
## ExerciseIntensity  7.129412  1.164508  5.1  6.20    6.9  8.05  9.3  1.85
## SleepQuality       7.688571  1.265788  5.8  6.60    7.6  8.40  9.9  1.80
## WeightLoss         9.314286  1.334198  6.6  8.15    9.4 10.20 12.0  2.05
## AfterWeight       75.120000  8.436224 60.9 68.60   74.0 81.65 91.6 13.05
##                   Lower_Bound Upper_Bound
## Participant           -16.000      52.000
## Age                     9.000      69.000
## BaselineWeight         61.350     106.950
## DietAdherence           4.325      12.125
## ExerciseIntensity       3.425      10.825
## SleepQuality            3.900      11.100
## WeightLoss              5.075      13.275
## AfterWeight            49.025     101.225

Data Inspection and Observations

Upon inspecting the dataset, the following observations were made:

  • Missing values: The dataset has 1 missing value in the ExerciseIntensity column.

  • Gender column: The Gender column is in character format. It may be beneficial to convert this column into a numeric format for further analysis.

  • Participant column: The Participant column appears to be a serial number rather than a categorical or numeric variable of interest.

  • Outlier: As per the Lower_Bound and Upper_Bound of the each column shows that there is no outlier value

# Drop Participant col
data <- data %>% select(-Participant)
# Convert Gender to factor
data$Gender <- as.factor(data$Gender)
head(data)
## # A tibble: 6 × 8
##   Gender   Age BaselineWeight DietAdherence ExerciseIntensity SleepQuality
##   <fct>  <dbl>          <dbl>         <dbl>             <dbl>        <dbl>
## 1 Male      39           92.6           9.6               6.6          8.1
## 2 Male      48           88.9           9.7               7            6.9
## 3 Male      31           91.3           8.4               5.1          7.6
## 4 Female    35           70             7.6               5.9          9.4
## 5 Male      32           84.3           6.6               9.2          7.9
## 6 Female    29           76.6           9.7               6.2          9.2
## # ℹ 2 more variables: WeightLoss <dbl>, AfterWeight <dbl>
# checking up the correlation with columns and convert the gender factor to numeric and store in "Gender_numeric" column
data$Gender_numeric <- as.numeric(factor(data$Gender))

cor_matrix <- cor(data[, c("WeightLoss", "BaselineWeight","Age", "AfterWeight", "DietAdherence", "ExerciseIntensity", "SleepQuality", "Gender_numeric")], use="complete.obs")

corrplot(cor_matrix, method="number", type="full",order = 'AOE',col= COL2('PuOr', 40),)

print(cor_matrix)
##                   WeightLoss BaselineWeight         Age  AfterWeight
## WeightLoss         1.0000000    -0.17557639 -0.32758865 -0.328599400
## BaselineWeight    -0.1755764     1.00000000  0.06870527  0.987492122
## Age               -0.3275886     0.06870527  1.00000000  0.118379318
## AfterWeight       -0.3285994     0.98749212  0.11837932  1.000000000
## DietAdherence      0.5823951    -0.10995272 -0.01491977 -0.198759837
## ExerciseIntensity  0.5855719     0.08816111 -0.08191880 -0.009203663
## SleepQuality       0.2370278    -0.06149738 -0.18388617 -0.096960379
## Gender_numeric     0.2818622     0.09047096  0.01148646  0.041653359
##                   DietAdherence ExerciseIntensity SleepQuality Gender_numeric
## WeightLoss           0.58239514       0.585571938   0.23702783     0.28186215
## BaselineWeight      -0.10995272       0.088161111  -0.06149738     0.09047096
## Age                 -0.01491977      -0.081918801  -0.18388617     0.01148646
## AfterWeight         -0.19875984      -0.009203663  -0.09696038     0.04165336
## DietAdherence        1.00000000       0.010902604  -0.06819197     0.10297619
## ExerciseIntensity    0.01090260       1.000000000  -0.23645092     0.12689491
## SleepQuality        -0.06819197      -0.236450924   1.00000000     0.01795168
## Gender_numeric       0.10297619       0.126894912   0.01795168     1.00000000

Interpretation of Correlation Results

From the correlation matrix, we observe that the WeightLoss attribute has significant correlations with several other numeric variables. Specifically:

  • Positive correlations:
    • DietAdherence and ExerciseIntensity show strong positive correlations with WeightLoss.
    • SleepQuality and Gender_numeric also exhibit positive correlations with WeightLoss.
  • Negative correlations:
    • BaselineWeight, AfterWeight and Age have negative correlations with WeightLoss, indicating that as these variables increase, weight loss tends to decrease.

Overall, we can conclude that DietAdherence and ExerciseIntensity are key factors positively influencing weight loss, while BaselineWeight and Age are negatively related.

Data Visualization

for (i in names(data)){
  if (i %in% c("WeightLoss","BaselineWeight","Age", "DietAdherence", "ExerciseIntensity", "SleepQuality")){
   print(
     ggplot(data, aes_string(x = i)) +
        geom_histogram(color = "black", bins = 5, alpha = 0.6)  +
        labs(title = paste("Histogram of ", i), x = i, y = "Frequency") +
        theme_minimal()
   )
  }
}

** Regression with threshold = 0.3 **

In here if i select threshold = 0.3 than columns will:

  • DietAdherence
  • ExerciseIntensity
  • Age
  • AfterWeight
model <- lm(WeightLoss ~ Age + AfterWeight + DietAdherence + ExerciseIntensity, data = data)

model %>%
  tbl_regression() %>%
  bold_labels() %>%
  bold_p(t=.1)
Characteristic Beta 95% CI p-value
Age -0.04 -0.07, -0.01 0.007
AfterWeight -0.03 -0.06, 0.00 0.044
DietAdherence 0.61 0.40, 0.81 <0.001
ExerciseIntensity 0.64 0.44, 0.85 <0.001
Abbreviation: CI = Confidence Interval
summary(model)
## 
## Call:
## lm(formula = WeightLoss ~ Age + AfterWeight + DietAdherence + 
##     ExerciseIntensity, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62075 -0.31320 -0.05088  0.38025  1.37831 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.56128    1.71251   2.080  0.04651 *  
## Age               -0.04238    0.01474  -2.875  0.00749 ** 
## AfterWeight       -0.02999    0.01426  -2.103  0.04427 *  
## DietAdherence      0.60710    0.10029   6.053 1.38e-06 ***
## ExerciseIntensity  0.64168    0.10011   6.410 5.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6674 on 29 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7822, Adjusted R-squared:  0.7521 
## F-statistic: 26.03 on 4 and 29 DF,  p-value: 3.117e-09
qqnorm(resid(model))
qqline(resid(model))

#plot(model$fitted.values, resid(model))
#abline(h=0, col="red")
print(vif(model))
##               Age       AfterWeight     DietAdherence ExerciseIntensity 
##          1.021073          1.055782          1.041316          1.006857
barplot(vif(model), main = "VIF Plot", col = "lightblue", ylim = c(0, max(vif(model)) + 10))
abline(h = 10, col = "red", lty = 2)

Finding on threshold = 0.3

My Model demonstrates exceptional statistical significance with an F-statistic of 26.03 and a p-value of 3.117e-09.With an R-squared value of 0.78 indicates that our model explains 78% of the variance in weight loss outcomes. [4][5]

In here the residuals range from -1.62 to 1.38, with a median close to zero, which is generally a good sign.

All Variance Inflation Factor (VIF) values in our model are below 1.06, this ensure that model coefficient estimates are stable and that model can confidently interpret the individual effects of each predictor on weight loss outcomes. [1][2][3]

The Q-Q plot of residuals demonstrates a normal distribution , further validating the model’s assumptions and strengthening the credibility of our findings. This normal distribution of residuals indicates that our model is well-specified and that the relationships between variables are appropriately captured.[6]

Characteristic Interpretation
Age As age increases, weight loss decreases slightly.
AfterWeight Higher after-weight slightly reduces weight loss.
DietAdherence Higher diet adherence significantly increases weight loss.
ExerciseIntensity Higher exercise intensity significantly increases weight loss.

** Regression with threshold = 0.2 **

In here if i select threshold = 0.2 than columns will:

  • DietAdherence
  • ExerciseIntensity
  • Age
  • AfterWeight
  • SleepQuality
  • Gender_numeric
model2 <- lm(WeightLoss ~ Age + AfterWeight + DietAdherence + ExerciseIntensity +SleepQuality+ Gender_numeric, data = data)

model2 %>%
  tbl_regression() %>%
  bold_labels() %>%
  bold_p(t=.1)
Characteristic Beta 95% CI p-value
Age -0.03 -0.05, -0.01 0.001
AfterWeight -0.03 -0.04, -0.01 0.004
DietAdherence 0.62 0.51, 0.74 <0.001
ExerciseIntensity 0.73 0.61, 0.85 <0.001
SleepQuality 0.40 0.29, 0.52 <0.001
Gender_numeric 0.39 0.11, 0.66 0.007
Abbreviation: CI = Confidence Interval
summary(model2)
## 
## Call:
## lm(formula = WeightLoss ~ Age + AfterWeight + DietAdherence + 
##     ExerciseIntensity + SleepQuality + Gender_numeric, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81299 -0.18907 -0.04931  0.20005  0.81801 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.638318   1.210923  -1.353  0.18729    
## Age               -0.030569   0.008611  -3.550  0.00144 ** 
## AfterWeight       -0.025912   0.008216  -3.154  0.00393 ** 
## DietAdherence      0.624733   0.058015  10.768 2.83e-11 ***
## ExerciseIntensity  0.728287   0.059870  12.164 1.82e-12 ***
## SleepQuality       0.404399   0.057209   7.069 1.34e-07 ***
## Gender_numeric     0.388966   0.133572   2.912  0.00712 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3818 on 27 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9336, Adjusted R-squared:  0.9189 
## F-statistic: 63.32 on 6 and 27 DF,  p-value: 1.231e-14
qqnorm(resid(model2))
qqline(resid(model2))

print(vif(model2))
##               Age       AfterWeight     DietAdherence ExerciseIntensity 
##          1.064681          1.070766          1.064953          1.100533 
##      SleepQuality    Gender_numeric 
##          1.129739          1.036881
barplot(vif(model2), main = "VIF Plot", col = "lightblue", ylim = c(0, max(vif(model2)) + 10),las = 2,      cex.names = 0.7)
abline(h = 10, col = "red", lty = 2)

Finding on threshold = 0.2

My Model demonstrates exceptional statistical significance with an F-statistic of 63.32 and ap-value: 1.231e-14.With anR-squared value of 0.93indicates that our model explains93%` of the variance in weight loss outcomes [4].The highly significant F-statistic (p < 0.001) provides robust evidence for the model’s overall validity.[5]

In here the residuals range from -0.81299 to 0.81801, with a median close to zero, which is generally a good sign.

All Variance Inflation Factor (VIF) values in our model are below 1.12 and below the threshold, this ensure that model coefficient estimates are stable and that model can confidently interpret the individual effects of each predictor on weight loss outcomes. [1][2][3]

The Q-Q plot of residuals demonstrates a normal distribution , further validating the model’s assumptions and strengthening the credibility of our findings. This normal distribution of residuals indicates that our model is well-specified and that the relationships between variables are appropriately captured.[6]

Characteristic Interpretation
Age As age increases, weight loss decreases slightly.
AfterWeight Higher after-weight slightly reduces weight loss.
Gender_numeric Indicates potential differences in weight loss outcomes between genders.
SleepQuality Moderate positive effect
DietAdherence Second strongest positive effect
ExerciseIntensity Strongest positive predictor of weight loss.

Final Finding

Model Comparison and Key Findings

Predictor Coefficient (Threshold 0.3) Coefficient (Threshold 0.2) p-value (Threshold 0.2) p-value (Threshold 0.3) Interpretation
Age -0.04238 -0.030569 0.00144 0.007 Older age decreases weight loss.
AfterWeight -0.02999 -0.025912 0.00393 0.044 Higher starting weight decreases weight loss.
DietAdherence 0.60710 0.624733 <0.001 <0.001 Higher diet adherence increases weight loss.
ExerciseIntensity 0.64168 0.728287 <0.001 <0.001 Higher exercise intensity increases weight loss.
SleepQuality N/A 0.404399 <0.001 N/A Better sleep quality increases weight loss.
Gender_numeric N/A 0.388966 0.00712 N/A Gender positively impacts weight loss.

VIF (Variance Inflation Factor):

All predictors have low VIF values (< 1.2), indicating no significant multicollinearity.

Statistical Significance:

  • DietAdherence and ExerciseIntensity are highly significant in both models.
  • Age and AfterWeight are significant predictors, with negative effects on weight loss.
  • SleepQuality and Gender show positive effects in the extended model.

Practical Implications:

  • Focus on improving DietAdherence and ExerciseIntensity for better weight loss outcomes.
  • Incorporate SleepQuality as part of the program for additional benefits.
  • Age and AfterWeight should be considered for personalized interventions.

Limitation:

  • Sample Size: With only 35 participants, the generalizability of the findings might be limited. Larger sample sizes are needed for more robust conclusions.
  • External Factors: The analysis only considers a few lifestyle factors, but other factors such as psychological health, stress, and external environmental influences were not accounted for.

Conclusion:

Both models emphasize the importance of DietAdherence and ExerciseIntensity for weight loss. The extended model with SleepQuality and Gender provides a more comprehensive understanding. Tailored programs based on these factors will be more effective in helping individuals achieve weight loss goals.