1.SUMMARY

The dataset used for this analysis is the UCI Obesity Dataset, which can be found at the UCI Machine Learning Repositoryhere. The dataset contains information on individual habits, conditions, and medical history with the purpose of identifying risk factors associated with obesity. The dataset includes multiple features such as gender, age, family history, physical activity, and food consumption patterns. The documentation for the dataset is here

Main goal

Seed question: What factors are most strongly associated with higher obesity levels?

Key variables

  • Age: Age of the participants
  • Gender: Gender of the participants
  • Family history: Family history of obesity
  • Physical activity level(FAF):light, moderate, or high activity level
  • Food consumption patterns: Types of food intake

Data inspection

  • Physical Activity Levels: This variable could help explore whether people with higher physical activity are less likely to get obese.
  • Family History of Obesity: This factor allows us to explore the hereditary component of obesity.
  • Daily Caloric Intake and Nutritional Habits: These fields could be highly predictive of obesity, as they track food consumption and diet quality.
  • Gender and Age: These demographic variables could also reveal patterns of obesity across different groups.

Visualization for further investigation

Two key aspects of the data that warrant further investigation are

library(ggplot2)

obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition\\obesity.csv"
)

# View the first few rows of the dataset
head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

Calculated BMI and added as new column

# Check the column names
print(colnames(obesity))
##  [1] "Gender"                         "Age"                           
##  [3] "Height"                         "Weight"                        
##  [5] "family_history_with_overweight" "FAVC"                          
##  [7] "FCVC"                           "NCP"                           
##  [9] "CAEC"                           "SMOKE"                         
## [11] "CH2O"                           "SCC"                           
## [13] "FAF"                            "TUE"                           
## [15] "CALC"                           "MTRANS"                        
## [17] "NObeyesdad"
# Convert to numeric
obesity$Weight <- as.numeric(as.character(obesity$Weight))
obesity$Height <- as.numeric(as.character(obesity$Height))

# Check for NA values
cat("NA in Weight:", sum(is.na(obesity$Weight_kg)), "\n")
## NA in Weight: 0
cat("NA in Height:", sum(is.na(obesity$Height_m)), "\n")
## NA in Height: 0
# Remove rows with NA values
obesity <- na.omit(obesity)

# Calculate BMI
obesity$BMI <- obesity$Weight / (obesity$Height)

# Verify the calculation
head(obesity)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad      BMI
## 1       Normal_Weight 39.50617
## 2       Normal_Weight 36.84211
## 3       Normal_Weight 42.77778
## 4  Overweight_Level_I 48.33333
## 5 Overweight_Level_II 50.44944
## 6       Normal_Weight 32.71605
                          
                          
                          
                           **1st visualization**

Physical Activity and Obesity: Is there a noticeable trend where people with higher activity levels are less likely to be obese?

# Scatter plot of BMI vs. Physical Activity with a trend line
ggplot(obesity, aes(x = FAF, y = BMI)) +
  geom_point(alpha = 0.6, color = "blue") +  # Scatter points
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +  # Trend line
  labs(title = "Scatter Plot of BMI vs. Daily Physical Activity", 
       x = "Daily Physical Activity (minutes)", 
       y = "BMI") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insights

  • The red dashed trend line (linear model) appears to slope downwards, suggesting a negative correlation between BMI and daily physical activity. This means that as daily physical activity increases, BMI tends to decrease.

  • Outliers: Any points that are far away from the general trend may indicate individuals whose BMI is unusually high or low relative to their physical activity levels.

  • Trend Line (Red Dashed Line): The red dashed line represents the linear trend estimated by the geom_smooth() function, indicating the overall direction of the relationship.

  • The trend line is negative so, the plot suggests that increasing daily physical activity may help in reducing BMI.

    ```

correlation <- cor(obesity$FAF, obesity$BMI, method = "pearson")
print(correlation)
## [1] -0.1123485
  • so here we have a weak negative correlation.

                      **2nd Visualization**

    ```

Family History and Obesity:Are individuals with a family history of obesity more likely to be obese themselves?

# Summarizing data for pie chart
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
obesity <- obesity %>%
  group_by(family_history_with_overweight, NObeyesdad) %>%
  summarise(Count = n())
## `summarise()` has grouped output by 'family_history_with_overweight'. You can
## override using the `.groups` argument.
# Creating the pie chart
ggplot(obesity, aes(x = "", y = Count, fill = NObeyesdad)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +  # Convert bar chart to pie chart
  facet_wrap(~ family_history_with_overweight) +
  labs(title = "Obesity Levels by Family History of Obesity", fill = "Obesity Level") +
  theme_minimal()

Insights

  • Distribution of Obesity Levels: The pie chart shows the proportions of different Obesity Levels (e.g., Normal, Overweight, Obese) within each category of Family History of Obesity (e.g., Yes, No).
  • Using facet_wrap(~ Family_History_Obesity), the pie chart is divided into two or more separate charts (facets) based on whether individuals have a family history of obesity or not.
  • This allows for direct visual comparison of obesity levels between those with a family history of obesity and those without.
  • Key Observations:
    • Family History Present: If the pie chart for individuals with a family history of obesity shows a higher proportion of the “Obese” segment compared to “Normal” or “Overweight”, it suggests that individuals with a family history of obesity are more likely to be obese.
    • Family History Absent: Conversely, if the pie chart for individuals without a family history of obesity shows a significant portion in the “Normal” category, it may indicate that genetic factors play a role in obesity risk.
  • Limitations:
    • Causation vs. Correlation: While the pie chart illustrates associations, it does not establish causation. Other factors (such as lifestyle, diet, socioeconomic status) may also contribute to obesity levels.
    • Sample Size: The reliability of the insights may depend on the sample size and how representative it is of the broader population.

Plan moving forward

Data Cleaning: Handle any missing or inconsistent data and ensure key variables are categorized correctly.

Exploratory Data Analysis (EDA): Perform EDA with detailed visualizations and summary statistics to confirm initial trends.

Hypothesis Testing: Conduct formal hypothesis tests to validate the relationships between obesity levels, physical activity, and family history.

Modelling: Consider building a logistic regressive model to identify key aspects of obesity within the dataset.

2. INITIAL FINDINGS

Hypothesis 1: Higher levels of physical activity are associated with lower levels of obesity.

Rationale: Physical activity is a critical factor in energy balance and weight control. It’s commonly believed that individuals with higher physical activity levels are more likely to maintain a healthy weight, while those with sedentary lifestyles are at greater risk of obesity.

VISUALIZATION FOR HYPOTHESIS-1

library(ggplot2)
obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition\\obesity.csv"
) 

# Box plot for Physical Activity vs Obesity
ggplot(obesity, aes(x=FAF, y=NObeyesdad, fill=NObeyesdad)) +
  geom_boxplot() +
  labs(title="Box Plot of BMI by Physical Activity Level", 
       x="Physical Activity Level", y="Obesity levels") +
  theme_minimal() +
  theme(legend.position = "none")  # Hides the redundant legend

Insights

  • Box and Whisker Representation: Each box represents the interquartile range (IQR) of the obesity levels for each physical activity level. The box displays the 25th percentile (Q1) to the 75th percentile (Q3), while the line inside the box indicates the median (Q2) obesity level for that activity level.
  • The fill aesthetic is set to NObeyesdad, indicating that the fill color of the boxes represents different obesity levels. This can help visualize how obesity levels vary across different physical activity levels.
    • The box plot is an effective way to visualize the relationship between physical activity and obesity levels. It allows for quick comparisons between groups and can highlight important trends and patterns in the data.

Hypothesis 2: Individuals with a family history of obesity are more likely to be obese themselves.

Rationale: Family history of obesity can be a genetic or behavioral factor contributing to an individual’s likelihood of developing obesity.There is a myth that those with obese family members may have a higher chance to get obesity due to genetics.

  • Family history of obesity variable: Make sure that there is a variable that indicates whether a participant has a family history of obesity. This is likely a binary variable.
  • Categorize obesity levels: As in the first hypothesis, ensure that obesity levels are categorical and that any missing values are addressed.
  • Proportional analysis: Since the hypothesis involves comparing proportions, we’ll visualize the distribution of obesity levels across individuals with and without a family history.

VISUALIZATION FOR HYPOTHESIS-2

# Sample R code for visualization
ggplot(obesity, aes(x=family_history_with_overweight, fill=NObeyesdad)) +
  geom_bar(position="fill") +
  labs(title="Obesity Levels by Family History of Obesity", 
       x="Family History of Obesity", y="Proportion", fill="Obesity Level")

Insights

  • Relative Frequencies: The chart shows the proportion of different obesity levels (e.g., “Obese”, “Not Obese”) within each category of family history (e.g., “Yes” or “No”). This allows for easy comparison of how obesity levels vary based on family history.
  • Understanding Risks: If one category shows a higher proportion of “Obese” individuals compared to the other, it suggests that having a family history of obesity might be associated with a higher risk of being obese.