library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition\\obesity.csv"
)
# View the first few rows of the dataset
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
We will calculate BMI as a derived variable using weight and height and explore its relationship with Calories Intake. BMI is calculated using the following formula:
\[[ \text{BMI} = \frac{\text{Weight (kg)}}{\text{Height (m)}^2} ]\]
# Convert height from cm to meters
obesity<- obesity %>%
mutate(Height_m = Height / 100)
# Calculate BMI
obesity <- obesity %>%
mutate(BMI = Weight / (Height_m^2))
# View the first few rows with the new BMI column
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad Height_m BMI
## 1 Normal_Weight 0.0162 243865.3
## 2 Normal_Weight 0.0152 242382.3
## 3 Normal_Weight 0.0180 237654.3
## 4 Overweight_Level_I 0.0180 268518.5
## 5 Overweight_Level_II 0.0178 283423.8
## 6 Normal_Weight 0.0162 201950.9
FAVC - Calories intake
# Plot BMI vs. Calories Intake
ggplot(obesity, aes(x = FAVC, y = BMI)) +
geom_point() +
geom_smooth(method = "lm", col = "red") +
labs(title = "Scatterplot of BMI vs. FAVC",
x = "FAVC",
y = "BMI")
## `geom_smooth()` using formula = 'y ~ x'
Scatterplot Shape: Since the plot uses geom_point(), it represents the relationship between BMI and FAVC using individual data points. If you observe any clustering or patterns, that could indicate the nature of the relationship. If the points are scattered without a clear pattern, it suggests no strong correlation between BMI and FAVC. However, a pattern would indicate a correlation between the two.
Linear Trend: The red line added using geom_smooth is a linear model that helps to observe the overall trend. If the line slopes upward, this implies a positive correlation, meaning higher FAVC (high-calorie food consumption) tends to be associated with higher BMI. Conversely, a downward slope would suggest the opposite.
Outliers: Look for data points that deviate significantly from the red line. These could be considered outliers. In such a case, outliers might indicate individuals with a high or low BMI relative to their reported FAVC values, which could stem from unique health or lifestyle factors. Outliers can affect the accuracy of the linear regression, so it’s important to identify them and consider their impact.
Interpretation: If the plot shows an upward-sloping regression line, the conclusion could be that there is a positive relationship between frequent high-calorie food consumption and higher BMI. However, if the points are highly dispersed with no clear trend, it might suggest that FAVC alone isn’t a strong predictor of BMI, and other factors may need to be considered.
# Create a numeric variable for Physical Activity(FAF)
obesity <- obesity %>%
mutate(FAF = case_when(
FAF == "None" ~ 0,
FAF == "Low" ~ 1,
FAF == "Medium" ~ 2,
FAF == "High" ~ 3
))
# View the first few rows with the new Physical Activity Level column
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no NA 1 no Public_Transportation
## 2 Sometimes yes 3 yes NA 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no NA 1 Frequently Public_Transportation
## 4 Sometimes no 2 no NA 0 Frequently Walking
## 5 Sometimes no 2 no NA 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no NA 0 Sometimes Automobile
## NObeyesdad Height_m BMI
## 1 Normal_Weight 0.0162 243865.3
## 2 Normal_Weight 0.0152 242382.3
## 3 Normal_Weight 0.0180 237654.3
## 4 Overweight_Level_I 0.0180 268518.5
## 5 Overweight_Level_II 0.0178 283423.8
## 6 Normal_Weight 0.0162 201950.9
FAF - physical activity NObeyesdad - Obesity level
# Plot FAF vs NObeyesdad
ggplot(obesity, aes(x = factor(FAF), y = NObeyesdad)) +
geom_boxplot() +
labs(title = "Boxplot of FAF vs. NObeyesdad",
x = "FAF",
y = "NObeyesdad")
Distribution Across FAF Levels: Each box represents the distribution of NObeyesdad levels for different FAF categories. Check how the boxes are arranged vertically to see if there is a trend in obesity levels across different FAF frequencies.
Median Lines: The horizontal line inside each box indicates the median NObeyesdad level for that specific FAF category. Compare the median values across the FAF categories. If the median values decrease as FAF increases, this suggests that higher physical activity frequency is associated with lower obesity levels.
Interquartile Range (IQR): The height of each box (the IQR) represents the range of obesity levels for the middle 50% of individuals in each FAF category. Wider boxes suggest greater variability in NObeyesdad levels, while narrower boxes indicate more consistency.
Outliers: Points outside the whiskers of each box represent outliers. These are individuals whose obesity levels are significantly different from others in the same FAF category. Identify the number of outliers for each FAF category: Are there many outliers for specific FAF levels? This may indicate variability in obesity levels among individuals with similar physical activity frequencies.
Here as the FAVC column is categorical, we can change it by converting to numerical column by assuming “yes” as 1 and “no” as o
# Convert FAVC (Frequent consumption of high-caloric food) to numeric: Yes = 1, No = 0
obesity$FAVC_numeric <- ifelse(obesity$FAVC == "yes", 1, 0)
# Pearson correlation for BMI and the numeric version of FAVC
cor_bmi_calories <- cor(obesity$BMI, obesity$FAVC_numeric, method = "pearson")
cor_bmi_calories
## [1] 0.2460967
Numeric Conversion of FAVC:
We converted the FAVC variable into a numeric format where “Yes” is coded as 1 and “No” is coded as 0. This means that higher values of FAVC (1) correspond to frequent consumption of high-caloric food, while lower values (0) represent infrequent consumption.
Here converting obesity column as numeric using factor() and as.numeric()
# Convert the NObeyesdad (Obesity Level) to a numeric variable
obesity$Obesity_numeric <- as.numeric(factor(obesity$NObeyesdad))
# Check if conversion worked
head(obesity$Obesity_numeric)
## [1] 2 2 2 6 7 2
# Spearman correlation between Obesity Level (numeric) and Physical Activity Level (FAF)
cor_obesity_activity <- cor(obesity$Obesity_numeric, obesity$FAF, method = "spearman")
# Print the result
cor_obesity_activity
## [1] NA
Numeric Conversion of NObeyesdad:
We converted the NObeyesdad variable into a numeric format using as.numeric(factor(obesity$NObeyesdad)). This means each obesity category is assigned a numeric value. For example, if “Normal weight” is assigned 1 and “Obesity” is assigned 4, this numeric representation allows for correlation analysis.
BMI (from the first pair: BMI vs. Calories Intake) Obesity Level (from the second pair: Obesity Level vs. Physical Activity Level) For continuous variables like BMI, we will use a confidence interval based on the sample mean. For ordinal variables like Obesity Level, the interval is less straightforward, but we can still estimate the central tendency with a confidence interval based on mean ranks or proportions (for example, using the proportion of individuals in each obesity class).
# Convert height from cm to meters
obesity<- obesity %>%
mutate(Height_m = Height / 100)
# Calculate BMI
obesity <- obesity %>%
mutate(BMI = Weight / (Height_m^2))
# View the first few rows with the new BMI column
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no NA 1 no Public_Transportation
## 2 Sometimes yes 3 yes NA 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no NA 1 Frequently Public_Transportation
## 4 Sometimes no 2 no NA 0 Frequently Walking
## 5 Sometimes no 2 no NA 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no NA 0 Sometimes Automobile
## NObeyesdad Height_m BMI FAVC_numeric Obesity_numeric
## 1 Normal_Weight 0.0162 243865.3 0 2
## 2 Normal_Weight 0.0152 242382.3 0 2
## 3 Normal_Weight 0.0180 237654.3 0 2
## 4 Overweight_Level_I 0.0180 268518.5 0 6
## 5 Overweight_Level_II 0.0178 283423.8 0 7
## 6 Normal_Weight 0.0162 201950.9 1 2
# Calculate the mean and standard error of BMI
mean_bmi <- mean(obesity$BMI, na.rm = TRUE)
sd_bmi <- sd(obesity$BMI, na.rm = TRUE)
n_bmi <- sum(!is.na(obesity$BMI))
se_bmi <- sd_bmi / sqrt(n_bmi)
# Calculate the 95% confidence interval for the mean BMI
ci_bmi_lower <- mean_bmi - 1.96 * se_bmi
ci_bmi_upper <- mean_bmi + 1.96 * se_bmi
ci_bmi <- c(ci_bmi_lower, ci_bmi_upper)
ci_bmi
## [1] 293584.0 300419.2
Conclusion Based on the Confidence Interval - Interpretation of the Confidence Interval: The output variable ci_bmi contains two values: ci_bmi_lower and ci_bmi_upper, representing the lower and upper bounds of the 95% confidence interval for the mean BMI.
The confidence interval provides a statistically sound basis for making inferences about the overall population’s BMI based on your sample data. If the mean BMI falls within a healthy range (generally considered to be 18.5 to 24.9), it may indicate that the population is largely within a healthy weight range. Conversely, if the mean BMI is in the overweight or obese categories (25.0 and above), it may suggest a significant public health concern related to weight management and associated health risks. Considerations:
The width of the confidence interval can give insight into the precision of the mean estimate. A narrower interval suggests greater precision, while a wider interval indicates more variability in the BMI data. It’s also crucial to consider the sample size; larger sample sizes generally lead to more accurate estimates and narrower confidence intervals. Further Investigations:
If the confidence interval suggests a concerning mean BMI, further investigation into the dietary habits, physical activity levels, and other lifestyle factors of the population could provide additional insights. Comparing these results with national averages or historical data might also be beneficial for assessing trends in obesity and health risks.
# Count the number of individuals in overweight categories (using Roman numerals)
overweight_count <- sum(obesity$NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II"), na.rm = TRUE)
# Total number of individuals (excluding NAs)
total_count <- sum(!is.na(obesity$NObeyesdad))
# Print counts for debugging
print(paste("Overweight Count:", overweight_count))
## [1] "Overweight Count: 580"
print(paste("Total Count:", total_count))
## [1] "Total Count: 2111"
# Calculate the proportion of overweight individuals
if (total_count > 0) {
prop_overweight <- overweight_count / total_count
# Calculate the standard error for the proportion
se_overweight <- sqrt((prop_overweight * (1 - prop_overweight)) / total_count)
# Calculate the 95% confidence interval for the proportion of overweight individuals
ci_overweight_I <- prop_overweight - 1.96 * se_overweight # Lower bound
ci_overweight_II <- prop_overweight + 1.96 * se_overweight # Upper bound
# Output the results
ci_overweight <- c(ci_overweight_I, ci_overweight_II)
print(ci_overweight)
} else {
print("Total count is zero, cannot calculate proportion or confidence interval.")
}
## [1] 0.2557087 0.2937939
Conclusion Based on the Confidence Interval - Interpretation of the Confidence Interval: The output variable ci_overweight contains two values: the lower and upper bounds of the 95% confidence interval for the proportion of overweight individuals.
Implications for the Population: This confidence interval provides a statistical basis for understanding how prevalent overweight conditions are in your sample population. If the confidence interval falls within the ranges commonly associated with public health concerns, it might indicate a significant public health issue related to weight management and associated health risks such as diabetes, cardiovascular diseases, and metabolic syndrome.
Considerations: The width of the confidence interval indicates the precision of the proportion estimate. A narrow interval suggests a more precise estimate, while a wider interval indicates greater variability in the proportion of overweight individuals. The sample size plays a critical role; larger sample sizes typically yield more accurate estimates and narrower confidence intervals.
Further Investigations: If the confidence interval suggests a concerning level of overweight individuals, further investigation into factors contributing to overweight status could be beneficial. It may also be helpful to compare these findings with national averages or historical data on obesity trends to assess whether this sample reflects broader societal trends.