1.SUMMARY
The dataset used for this analysis is the UCI Obesity Dataset, which
can be found at the UCI Machine Learning Repositoryhere.
The dataset contains information on individual habits, conditions, and
medical history with the purpose of identifying risk factors associated
with obesity. The dataset includes multiple features such as gender,
age, family history, physical activity, and food consumption patterns.
The documentation for the dataset is here
Main goal
Seed question: What factors are most strongly
associated with higher obesity levels?
Key variables
- Age: Age of the participants
- Gender: Gender of the participants
- Family history: Family history of obesity
- Physical activity level(FAF):light, moderate, or high activity
level
- Food consumption patterns: Types of food intake
Data inspection
- Physical Activity Levels: This variable could help explore whether
people with higher physical activity are less likely to get obese.
- Family History of Obesity: This factor allows us to explore the
hereditary component of obesity.
- Daily Caloric Intake and Nutritional Habits: These fields could be
highly predictive of obesity, as they track food consumption and diet
quality.
- Gender and Age: These demographic variables could also reveal
patterns of obesity across different groups.
Visualization for further investigation
Two key aspects of the data that warrant further investigation
are
library(ggplot2)
obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition\\obesity.csv"
)
# View the first few rows of the dataset
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
Calculated BMI and added as new column
# Check the column names
print(colnames(obesity))
## [1] "Gender" "Age"
## [3] "Height" "Weight"
## [5] "family_history_with_overweight" "FAVC"
## [7] "FCVC" "NCP"
## [9] "CAEC" "SMOKE"
## [11] "CH2O" "SCC"
## [13] "FAF" "TUE"
## [15] "CALC" "MTRANS"
## [17] "NObeyesdad"
# Convert to numeric
obesity$Weight <- as.numeric(as.character(obesity$Weight))
obesity$Height <- as.numeric(as.character(obesity$Height))
# Check for NA values
cat("NA in Weight:", sum(is.na(obesity$Weight_kg)), "\n")
## NA in Weight: 0
cat("NA in Height:", sum(is.na(obesity$Height_m)), "\n")
## NA in Height: 0
# Remove rows with NA values
obesity <- na.omit(obesity)
# Calculate BMI
obesity$BMI <- obesity$Weight / (obesity$Height)
# Verify the calculation
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad BMI
## 1 Normal_Weight 39.50617
## 2 Normal_Weight 36.84211
## 3 Normal_Weight 42.77778
## 4 Overweight_Level_I 48.33333
## 5 Overweight_Level_II 50.44944
## 6 Normal_Weight 32.71605
**1st visualization**
Physical Activity and Obesity: Is there a noticeable trend
where people with higher activity levels are less likely to be
obese?
# Scatter plot of BMI vs. Physical Activity with a trend line
ggplot(obesity, aes(x = FAF, y = BMI)) +
geom_point(alpha = 0.6, color = "blue") + # Scatter points
geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") + # Trend line
labs(title = "Scatter Plot of BMI vs. Daily Physical Activity",
x = "Daily Physical Activity (minutes)",
y = "BMI") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insights
The red dashed trend line (linear model) appears to slope
downwards, suggesting a negative correlation between BMI and daily
physical activity. This means that as daily physical activity increases,
BMI tends to decrease.
Outliers: Any points that are far away from the general trend may
indicate individuals whose BMI is unusually high or low relative to
their physical activity levels.
Trend Line (Red Dashed Line): The red dashed line represents the
linear trend estimated by the geom_smooth() function, indicating the
overall direction of the relationship.
The trend line is negative so, the plot suggests that increasing
daily physical activity may help in reducing BMI.
```
correlation <- cor(obesity$FAF, obesity$BMI, method = "pearson")
print(correlation)
## [1] -0.1123485
Family History and Obesity:Are individuals with a family
history of obesity more likely to be obese themselves?
# Summarizing data for pie chart
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
obesity <- obesity %>%
group_by(family_history_with_overweight, NObeyesdad) %>%
summarise(Count = n())
## `summarise()` has grouped output by 'family_history_with_overweight'. You can
## override using the `.groups` argument.
# Creating the pie chart
ggplot(obesity, aes(x = "", y = Count, fill = NObeyesdad)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") + # Convert bar chart to pie chart
facet_wrap(~ family_history_with_overweight) +
labs(title = "Obesity Levels by Family History of Obesity", fill = "Obesity Level") +
theme_minimal()

Insights
- Distribution of Obesity Levels: The pie chart shows the proportions
of different Obesity Levels (e.g., Normal, Overweight, Obese) within
each category of Family History of Obesity (e.g., Yes, No).
- Using facet_wrap(~ Family_History_Obesity), the pie chart is divided
into two or more separate charts (facets) based on whether individuals
have a family history of obesity or not.
- This allows for direct visual comparison of obesity levels between
those with a family history of obesity and those without.
- Key Observations:
- Family History Present: If the pie chart for individuals with a
family history of obesity shows a higher proportion of the “Obese”
segment compared to “Normal” or “Overweight”, it suggests that
individuals with a family history of obesity are more likely to be
obese.
- Family History Absent: Conversely, if the pie chart for individuals
without a family history of obesity shows a significant portion in the
“Normal” category, it may indicate that genetic factors play a role in
obesity risk.
- Limitations:
- Causation vs. Correlation: While the pie chart illustrates
associations, it does not establish causation. Other factors (such as
lifestyle, diet, socioeconomic status) may also contribute to obesity
levels.
- Sample Size: The reliability of the insights may depend on the
sample size and how representative it is of the broader population.
Plan moving forward
Data Cleaning: Handle any missing or inconsistent data and
ensure key variables are categorized correctly.
Exploratory Data Analysis (EDA): Perform EDA with detailed
visualizations and summary statistics to confirm initial trends.
Hypothesis Testing: Conduct formal hypothesis tests to
validate the relationships between obesity levels, physical activity,
and family history.
Modelling: Consider building a logistic regressive model to
identify key aspects of obesity within the dataset.
2. INITIAL FINDINGS
Hypothesis 1: Higher levels of physical activity are
associated with lower levels of obesity.
Rationale: Physical activity is a critical factor in energy
balance and weight control. It’s commonly believed that individuals with
higher physical activity levels are more likely to maintain a healthy
weight, while those with sedentary lifestyles are at greater risk of
obesity.
- Check the structure of the physical activity levels: The dataset
likely contains a categorical variable indicating the physical activity
level of each participant. You’ll want to ensure that this variable is
properly encoded and ready for analysis.
- Ensure obesity levels are also categorical: The obesity level (e.g.,
normal, overweight, obese) should be a categorical variable. If it’s
encoded as a continuous variable (e.g., BMI), you might want to bin it
into categories.
- Handle missing or undefined values: Clean up any missing data or
undefined physical activity values before proceeding with the
visualization.
VISUALIZATION FOR HYPOTHESIS-1
library(ggplot2)
obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition\\obesity.csv"
)
# Box plot for Physical Activity vs Obesity
ggplot(obesity, aes(x=FAF, y=NObeyesdad, fill=NObeyesdad)) +
geom_boxplot() +
labs(title="Box Plot of BMI by Physical Activity Level",
x="Physical Activity Level", y="Obesity levels") +
theme_minimal() +
theme(legend.position = "none") # Hides the redundant legend

Insights
- Box and Whisker Representation: Each box represents the
interquartile range (IQR) of the obesity levels for each physical
activity level. The box displays the 25th percentile (Q1) to the 75th
percentile (Q3), while the line inside the box indicates the median (Q2)
obesity level for that activity level.
- The fill aesthetic is set to NObeyesdad, indicating that the fill
color of the boxes represents different obesity levels. This can help
visualize how obesity levels vary across different physical activity
levels.
- The box plot is an effective way to visualize the relationship
between physical activity and obesity levels. It allows for quick
comparisons between groups and can highlight important trends and
patterns in the data.
Hypothesis 2: Individuals with a family history of obesity
are more likely to be obese themselves.
Rationale: Family history of obesity can be a genetic or
behavioral factor contributing to an individual’s likelihood of
developing obesity.There is a myth that those with obese family members
may have a higher chance to get obesity due to genetics.
- Family history of obesity variable: Make sure that there is a
variable that indicates whether a participant has a family history of
obesity. This is likely a binary variable.
- Categorize obesity levels: As in the first hypothesis, ensure that
obesity levels are categorical and that any missing values are
addressed.
- Proportional analysis: Since the hypothesis involves comparing
proportions, we’ll visualize the distribution of obesity levels across
individuals with and without a family history.
VISUALIZATION FOR HYPOTHESIS-2
# Sample R code for visualization
ggplot(obesity, aes(x=family_history_with_overweight, fill=NObeyesdad)) +
geom_bar(position="fill") +
labs(title="Obesity Levels by Family History of Obesity",
x="Family History of Obesity", y="Proportion", fill="Obesity Level")

Insights
- Relative Frequencies: The chart shows the proportion of different
obesity levels (e.g., “Obese”, “Not Obese”) within each category of
family history (e.g., “Yes” or “No”). This allows for easy comparison of
how obesity levels vary based on family history.
- Understanding Risks: If one category shows a higher proportion of
“Obese” individuals compared to the other, it suggests that having a
family history of obesity might be associated with a higher risk of
being obese.