library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition\\obesity.csv"
)

# View the first few rows of the dataset
head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

Numeric Variable Pairs

Pair 1: Body Mass Index (BMI) and Calories Intake

We will calculate BMI as a derived variable using weight and height and explore its relationship with Calories Intake. BMI is calculated using the following formula:

\[[ \text{BMI} = \frac{\text{Weight (kg)}}{\text{Height (m)}^2} ]\]

  • Weight is in kilograms (kg)
  • Height is in meters (m)

Deriving BMI variable

# Convert height from cm to meters
obesity<- obesity %>%
  mutate(Height_m = Height / 100)

# Calculate BMI
obesity <- obesity %>%
  mutate(BMI = Weight / (Height_m^2))

# View the first few rows with the new BMI column
head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad Height_m      BMI
## 1       Normal_Weight   0.0162 243865.3
## 2       Normal_Weight   0.0152 242382.3
## 3       Normal_Weight   0.0180 237654.3
## 4  Overweight_Level_I   0.0180 268518.5
## 5 Overweight_Level_II   0.0178 283423.8
## 6       Normal_Weight   0.0162 201950.9

Visualization for pair 1

FAVC - Calories intake

# Plot BMI vs. Calories Intake
ggplot(obesity, aes(x = FAVC, y = BMI)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  labs(title = "Scatterplot of BMI vs. FAVC", 
       x = "FAVC", 
       y = "BMI")
## `geom_smooth()` using formula = 'y ~ x'

conclusion and scrutinizing plot:

  • Scatterplot Shape: Since the plot uses geom_point(), it represents the relationship between BMI and FAVC using individual data points. If you observe any clustering or patterns, that could indicate the nature of the relationship. If the points are scattered without a clear pattern, it suggests no strong correlation between BMI and FAVC. However, a pattern would indicate a correlation between the two.

  • Linear Trend: The red line added using geom_smooth is a linear model that helps to observe the overall trend. If the line slopes upward, this implies a positive correlation, meaning higher FAVC (high-calorie food consumption) tends to be associated with higher BMI. Conversely, a downward slope would suggest the opposite.

  • Outliers: Look for data points that deviate significantly from the red line. These could be considered outliers. In such a case, outliers might indicate individuals with a high or low BMI relative to their reported FAVC values, which could stem from unique health or lifestyle factors. Outliers can affect the accuracy of the linear regression, so it’s important to identify them and consider their impact.

  • Interpretation: If the plot shows an upward-sloping regression line, the conclusion could be that there is a positive relationship between frequent high-calorie food consumption and higher BMI. However, if the points are highly dispersed with no clear trend, it might suggest that FAVC alone isn’t a strong predictor of BMI, and other factors may need to be considered.

Pair 2: Obesity Level (Response Variable) and Physical Activity (Explanatory Variable)

# Create a numeric variable for Physical Activity(FAF)
obesity <- obesity %>%
  mutate(FAF = case_when(
    FAF == "None" ~ 0,
    FAF == "Low" ~ 1,
    FAF == "Medium" ~ 2,
    FAF == "High" ~ 3
  ))

# View the first few rows with the new Physical Activity Level column
head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no  NA   1         no Public_Transportation
## 2 Sometimes   yes    3 yes  NA   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no  NA   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no  NA   0 Frequently               Walking
## 5 Sometimes    no    2  no  NA   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no  NA   0  Sometimes            Automobile
##            NObeyesdad Height_m      BMI
## 1       Normal_Weight   0.0162 243865.3
## 2       Normal_Weight   0.0152 242382.3
## 3       Normal_Weight   0.0180 237654.3
## 4  Overweight_Level_I   0.0180 268518.5
## 5 Overweight_Level_II   0.0178 283423.8
## 6       Normal_Weight   0.0162 201950.9

Visualization for pair 2

FAF - physical activity NObeyesdad - Obesity level

# Plot FAF vs NObeyesdad
ggplot(obesity, aes(x = factor(FAF), y = NObeyesdad)) +
  geom_boxplot() +
  labs(title = "Boxplot of FAF vs. NObeyesdad",
       x = "FAF",
       y = "NObeyesdad")

conclusion and scrutinizing plot:

  • Distribution Across FAF Levels: Each box represents the distribution of NObeyesdad levels for different FAF categories. Check how the boxes are arranged vertically to see if there is a trend in obesity levels across different FAF frequencies.

  • Median Lines: The horizontal line inside each box indicates the median NObeyesdad level for that specific FAF category. Compare the median values across the FAF categories. If the median values decrease as FAF increases, this suggests that higher physical activity frequency is associated with lower obesity levels.

  • Interquartile Range (IQR): The height of each box (the IQR) represents the range of obesity levels for the middle 50% of individuals in each FAF category. Wider boxes suggest greater variability in NObeyesdad levels, while narrower boxes indicate more consistency.

  • Outliers: Points outside the whiskers of each box represent outliers. These are individuals whose obesity levels are significantly different from others in the same FAF category. Identify the number of outliers for each FAF category: Are there many outliers for specific FAF levels? This may indicate variability in obesity levels among individuals with similar physical activity frequencies.

Correlation coefficients

Pair 1: BMI vs. Calories Intake (Pearson Correlation)

Here as the FAVC column is categorical, we can change it by converting to numerical column by assuming “yes” as 1 and “no” as o

# Convert FAVC (Frequent consumption of high-caloric food) to numeric: Yes = 1, No = 0
obesity$FAVC_numeric <- ifelse(obesity$FAVC == "yes", 1, 0)

# Pearson correlation for BMI and the numeric version of FAVC
cor_bmi_calories <- cor(obesity$BMI, obesity$FAVC_numeric, method = "pearson")
cor_bmi_calories
## [1] 0.2460967

Numeric Conversion of FAVC:

We converted the FAVC variable into a numeric format where “Yes” is coded as 1 and “No” is coded as 0. This means that higher values of FAVC (1) correspond to frequent consumption of high-caloric food, while lower values (0) represent infrequent consumption.

  • The correlation value is positive, it tells that individuals who frequently consume high-caloric foods (FAVC = 1) tend to have higher BMI. This aligns with the expectation that a diet high in calories can contribute to weight gain.

Pair 2: Obesity Level vs. Physical Activity Level (Spearman Correlation)

Here converting obesity column as numeric using factor() and as.numeric()

# Convert the NObeyesdad (Obesity Level) to a numeric variable
obesity$Obesity_numeric <- as.numeric(factor(obesity$NObeyesdad))

# Check if conversion worked
head(obesity$Obesity_numeric)
## [1] 2 2 2 6 7 2
# Spearman correlation between Obesity Level (numeric) and Physical Activity Level (FAF)
cor_obesity_activity <- cor(obesity$Obesity_numeric, obesity$FAF, method = "spearman")

# Print the result
cor_obesity_activity
## [1] NA

Numeric Conversion of NObeyesdad:

We converted the NObeyesdad variable into a numeric format using as.numeric(factor(obesity$NObeyesdad)). This means each obesity category is assigned a numeric value. For example, if “Normal weight” is assigned 1 and “Obesity” is assigned 4, this numeric representation allows for correlation analysis.

  • The calculation of the Spearman correlation between Obesity_numeric and FAF examines how the ranks of obesity levels relate to the frequency of physical activity.

To build confidence intervals for the response variables in the dataset, we will focus on two response variables:

BMI (from the first pair: BMI vs. Calories Intake) Obesity Level (from the second pair: Obesity Level vs. Physical Activity Level) For continuous variables like BMI, we will use a confidence interval based on the sample mean. For ordinal variables like Obesity Level, the interval is less straightforward, but we can still estimate the central tendency with a confidence interval based on mean ranks or proportions (for example, using the proportion of individuals in each obesity class).

Pair 1: Confidence Interval for BMI

# Convert height from cm to meters
obesity<- obesity %>%
  mutate(Height_m = Height / 100)

# Calculate BMI
obesity <- obesity %>%
  mutate(BMI = Weight / (Height_m^2))

# View the first few rows with the new BMI column
head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no  NA   1         no Public_Transportation
## 2 Sometimes   yes    3 yes  NA   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no  NA   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no  NA   0 Frequently               Walking
## 5 Sometimes    no    2  no  NA   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no  NA   0  Sometimes            Automobile
##            NObeyesdad Height_m      BMI FAVC_numeric Obesity_numeric
## 1       Normal_Weight   0.0162 243865.3            0               2
## 2       Normal_Weight   0.0152 242382.3            0               2
## 3       Normal_Weight   0.0180 237654.3            0               2
## 4  Overweight_Level_I   0.0180 268518.5            0               6
## 5 Overweight_Level_II   0.0178 283423.8            0               7
## 6       Normal_Weight   0.0162 201950.9            1               2
# Calculate the mean and standard error of BMI
mean_bmi <- mean(obesity$BMI, na.rm = TRUE)
sd_bmi <- sd(obesity$BMI, na.rm = TRUE)
n_bmi <- sum(!is.na(obesity$BMI))
se_bmi <- sd_bmi / sqrt(n_bmi)

# Calculate the 95% confidence interval for the mean BMI
ci_bmi_lower <- mean_bmi - 1.96 * se_bmi
ci_bmi_upper <- mean_bmi + 1.96 * se_bmi

ci_bmi <- c(ci_bmi_lower, ci_bmi_upper)
ci_bmi
## [1] 293584.0 300419.2

Conclusion Based on the Confidence Interval - Interpretation of the Confidence Interval: The output variable ci_bmi contains two values: ci_bmi_lower and ci_bmi_upper, representing the lower and upper bounds of the 95% confidence interval for the mean BMI.

  • The confidence interval provides a statistically sound basis for making inferences about the overall population’s BMI based on your sample data. If the mean BMI falls within a healthy range (generally considered to be 18.5 to 24.9), it may indicate that the population is largely within a healthy weight range. Conversely, if the mean BMI is in the overweight or obese categories (25.0 and above), it may suggest a significant public health concern related to weight management and associated health risks. Considerations:

  • The width of the confidence interval can give insight into the precision of the mean estimate. A narrower interval suggests greater precision, while a wider interval indicates more variability in the BMI data. It’s also crucial to consider the sample size; larger sample sizes generally lead to more accurate estimates and narrower confidence intervals. Further Investigations:

  • If the confidence interval suggests a concerning mean BMI, further investigation into the dietary habits, physical activity levels, and other lifestyle factors of the population could provide additional insights. Comparing these results with national averages or historical data might also be beneficial for assessing trends in obesity and health risks.

Pair 2:Confidence Interval for Obesity Level

# Count the number of individuals in overweight categories (using Roman numerals)
overweight_count <- sum(obesity$NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II"), na.rm = TRUE)

# Total number of individuals (excluding NAs)
total_count <- sum(!is.na(obesity$NObeyesdad))

# Print counts for debugging
print(paste("Overweight Count:", overweight_count))
## [1] "Overweight Count: 580"
print(paste("Total Count:", total_count))
## [1] "Total Count: 2111"
# Calculate the proportion of overweight individuals
if (total_count > 0) {
    prop_overweight <- overweight_count / total_count

    # Calculate the standard error for the proportion
    se_overweight <- sqrt((prop_overweight * (1 - prop_overweight)) / total_count)

    # Calculate the 95% confidence interval for the proportion of overweight individuals
    ci_overweight_I <- prop_overweight - 1.96 * se_overweight  # Lower bound
    ci_overweight_II <- prop_overweight + 1.96 * se_overweight  # Upper bound

    # Output the results
    ci_overweight <- c(ci_overweight_I, ci_overweight_II)
    print(ci_overweight)
} else {
    print("Total count is zero, cannot calculate proportion or confidence interval.")
}
## [1] 0.2557087 0.2937939

Conclusion Based on the Confidence Interval - Interpretation of the Confidence Interval: The output variable ci_overweight contains two values: the lower and upper bounds of the 95% confidence interval for the proportion of overweight individuals.

  • Implications for the Population: This confidence interval provides a statistical basis for understanding how prevalent overweight conditions are in your sample population. If the confidence interval falls within the ranges commonly associated with public health concerns, it might indicate a significant public health issue related to weight management and associated health risks such as diabetes, cardiovascular diseases, and metabolic syndrome.

  • Considerations: The width of the confidence interval indicates the precision of the proportion estimate. A narrow interval suggests a more precise estimate, while a wider interval indicates greater variability in the proportion of overweight individuals. The sample size plays a critical role; larger sample sizes typically yield more accurate estimates and narrower confidence intervals.

  • Further Investigations: If the confidence interval suggests a concerning level of overweight individuals, further investigation into factors contributing to overweight status could be beneficial. It may also be helpful to compare these findings with national averages or historical data on obesity trends to assess whether this sample reflects broader societal trends.

Future Questions for Investigation:

  • How do calorie intake, physical activity, and age interact to influence BMI? Are there synergistic or antagonistic effects that complicate the relationships observed?
  • Could socioeconomic status or psychological factors explain variability in physical activity levels or diet, especially among different age groups?
  • Would a longitudinal study reveal different insights into how these variables interact over time, especially concerning aging and physical activity?