# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
# Load dataset (assuming you've saved it locally)
obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\obesity.csv")
# View the first few rows of the dataset
head(obesity)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
# Cut age into ranges and group by Age Range and summarize Weight
obesity <- obesity %>%
mutate(AgeRange = cut(Age, breaks = c(0, 18, 30, 45, 60, 100),
labels = c("0-18", "19-30", "31-45", "46-60", "61+")))
age_group <- obesity %>%
group_by(AgeRange) %>%
summarize(mean_weight = mean(Weight, na.rm = TRUE),
sd_weight = sd(Weight, na.rm = TRUE),
count = n())
# Display the result
age_group
## # A tibble: 5 × 4
## AgeRange mean_weight sd_weight count
## <fct> <dbl> <dbl> <int>
## 1 0-18 67.4 19.2 241
## 2 19-30 88.5 27.2 1514
## 3 31-45 92.1 19.6 342
## 4 46-60 80.8 9.84 13
## 5 61+ 66 NA 1
# Histogram for Weight across Age Ranges
ggplot(obesity, aes(x = Age, y = Weight)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Weight Distribution by Age Range", x = "Age Range", y = "Weight") +
theme_minimal()
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
### Insights - 0-18 Age Range (Children & Teenagers): The width of
the violin at different weights tells you where most individuals’
weights fall. If the violin is wide at lower weights, it indicates a
concentration of lower weights, which is expected in children and
teenagers. If the distribution is skewed (wider on one side), it might
show an imbalance, with a portion of individuals having higher weights.
If the plot is relatively narrow overall, this suggests lower
variability in weights among children and teenagers. - 19-30 Age Range
(Young Adults): In this age range, the violin might widen at higher
weights compared to the 0-18 group, indicating that people in this group
generally weigh more. A wider, more symmetric violin plot indicates that
the distribution of weights is more balanced, with people having weights
that cluster around a central value. If the violin shows significant
width on both sides of the plot, it suggests a bimodal distribution,
where two distinct groups are present (e.g., some are underweight, while
others are overweight). - 31-45 Age Range (Middle-Aged Adults): This
group often experiences weight gain due to lifestyle changes, slowing
metabolism, and other factors. The violin might show a wider
distribution in the upper weight range. A right-skewed violin (with more
density at higher weights) could indicate that middle-aged adults are
more prone to overweight or obesity. If the violin plot is wider than
for younger groups, it suggests a higher variability in weight,
reflecting a diverse population with different lifestyles, health
conditions, and activity levels. - 46-60 Age Range (Older Adults): The
weight distribution in this age group could continue to widen, but in
some cases, it might stabilize or even start to narrow as people age and
experience health-related weight changes. A wider violin at higher
weights could indicate an ongoing trend of weight gain in this group. If
the violin becomes narrower, it might show that many people in this age
range are clustered around a certain weight, possibly due to health
interventions or lifestyle adjustments. - 61+ Age Range (Seniors):
Weight distributions for seniors might differ significantly from younger
groups. For example, if the violin becomes narrower and more
concentrated around lower weights, it suggests weight loss due to aging,
reduced appetite, or health conditions. A left-skewed violin could
indicate that more seniors have lower weights. If the violin is more
balanced, with significant width at both ends, it suggests that some
seniors maintain higher weights, possibly due to sedentary lifestyles,
while others might lose weight due to age-related factors.
# Group by Family History with Overweight and summarize Physical Activity Level
family_history_group <- obesity %>%
group_by(family_history_with_overweight ) %>%
summarize(mean_activity = mean(FAF, na.rm = TRUE),
sd_activity = sd(FAF, na.rm = TRUE),
count = n())
# Display the result
family_history_group
## # A tibble: 2 × 4
## family_history_with_overweight mean_activity sd_activity count
## <chr> <dbl> <dbl> <int>
## 1 no 1.11 0.928 385
## 2 yes 0.988 0.831 1726
# Violin plot for Physical Activity Levels based on Family History with Overweight
ggplot(obesity, aes(x = family_history_with_overweight, y = FAF)) +
geom_boxplot(fill = "pink") +
labs(title = "Physical Activity Level by Family History with Overweight",
x = "Family History with Overweight",
y = "Physical Activity Level") +
theme_minimal()
### Insights - Physical Activity Distribution for Individuals with a
Family History of Overweight (Yes): The shape of the violin plot for the
“Yes” group shows the distribution of physical activity levels for
individuals with a family history of being overweight. Wider sections of
the violin at certain activity levels suggest that more individuals are
concentrated at those levels. For example, if the violin is widest at
lower activity levels, this indicates that many people with a family
history of overweight engage in low physical activity. A narrow violin
at higher activity levels suggests fewer individuals in this group are
highly active. If the plot is skewed towards lower activity levels, it
could suggest that individuals with a family history of overweight tend
to have less physical activity on average, which could be due to
lifestyle factors or genetic influences. - Physical Activity
Distribution for Individuals without a Family History of Overweight
(No): The shape of the violin plot for the “No” group shows the
distribution of physical activity levels for those without a family
history of overweight. If this violin is wider at higher physical
activity levels, it suggests that more individuals in this group are
more active compared to the “Yes” group. If the plot shows a balanced
distribution across various activity levels (i.e., both low and high
activity levels), it indicates that individuals without a family history
of overweight tend to have more varied activity habits. Narrower
sections at lower activity levels may suggest that fewer individuals in
this group are inactive, which could imply healthier overall lifestyle
habits compared to those with a family history of overweight.
This violin plot visually demonstrates the distribution of physical activity levels for individuals with and without a family history of overweight. It helps identify whether family history correlates with lower or higher levels of physical activity, potentially highlighting at-risk groups for inactivity. Understanding these trends can inform interventions aimed at promoting physical activity and preventing obesity in genetically predisposed populations.
gender_group <- obesity %>%
group_by(Gender) %>%
summarize(mean_Weight = mean(Weight, na.rm = TRUE),
sd_Weight = sd(Weight, na.rm = TRUE),
count = n())
# Display the result
gender_group
## # A tibble: 2 × 4
## Gender mean_Weight sd_Weight count
## <chr> <dbl> <dbl> <int>
## 1 Female 82.3 29.7 1043
## 2 Male 90.8 21.4 1068
# Bar plot for average Weight by Gender
ggplot(gender_group, aes(x = Gender, y = mean_Weight, fill = Gender)) +
geom_bar(stat = "identity", width = 0.6) +
labs(title = "Average Weight by Gender", x = "Gender", y = "Average Weight") +
theme_minimal()
### Insights
# Calculate probability of each group in Gender, AgeRange, and Family History groups
gender_prob <- gender_group %>%
mutate(probability = count / sum(count))
age_prob <- age_group %>%
mutate(probability = count / sum(count))
family_history_prob <- family_history_group %>%
mutate(probability = count / sum(count))
# Tagging lowest probability groups
gender_prob <- gender_prob %>%
mutate(tag = ifelse(probability == min(probability), "Low Probability", "Normal"))
age_prob <- age_prob %>%
mutate(tag = ifelse(probability == min(probability), "Low Probability", "Normal"))
family_history_prob <- family_history_prob %>%
mutate(tag = ifelse(probability == min(probability), "Low Probability", "Normal"))
# Display probability results
gender_prob
## # A tibble: 2 × 6
## Gender mean_Weight sd_Weight count probability tag
## <chr> <dbl> <dbl> <int> <dbl> <chr>
## 1 Female 82.3 29.7 1043 0.494 Low Probability
## 2 Male 90.8 21.4 1068 0.506 Normal
age_prob
## # A tibble: 5 × 6
## AgeRange mean_weight sd_weight count probability tag
## <fct> <dbl> <dbl> <int> <dbl> <chr>
## 1 0-18 67.4 19.2 241 0.114 Normal
## 2 19-30 88.5 27.2 1514 0.717 Normal
## 3 31-45 92.1 19.6 342 0.162 Normal
## 4 46-60 80.8 9.84 13 0.00616 Normal
## 5 61+ 66 NA 1 0.000474 Low Probability
family_history_prob
## # A tibble: 2 × 6
## family_history_with_overwe…¹ mean_activity sd_activity count probability tag
## <chr> <dbl> <dbl> <int> <dbl> <chr>
## 1 no 1.11 0.928 385 0.182 Low …
## 2 yes 0.988 0.831 1726 0.818 Norm…
## # ℹ abbreviated name: ¹family_history_with_overweight
# Hypothesis test: Comparing Physical Activity Levels for Family History vs. No Family History
t.test(FAF ~ family_history_with_overweight, data = obesity)
##
## Welch Two Sample t-test
##
## data: FAF by family_history_with_overweight
## t = 2.4315, df = 530.05, p-value = 0.01537
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## 0.02397149 0.22563956
## sample estimates:
## mean in group no mean in group yes
## 1.1123414 0.9875359
# Create a data frame for unique combinations of Gender and Family History
combination_df <- obesity %>%
count(Gender, family_history_with_overweight) %>%
arrange(desc(n))
# Find missing combinations
missing_combinations <- combination_df %>%
filter(n == 0)
# Display the most/least common combinations and any missing ones
combination_df
## Gender family_history_with_overweight n
## 1 Male yes 915
## 2 Female yes 811
## 3 Female no 232
## 4 Male no 153
missing_combinations
## [1] Gender family_history_with_overweight
## [3] n
## <0 rows> (or 0-length row.names)
Female with Family History of Overweight: If this combination has a high count, it means that many females in the dataset report a family history of overweight. Male without Family History of Overweight: Similarly, a high count here indicates that a significant number of males do not have a family history of overweight. Understanding the most common combinations can help identify which groups are more represented in the dataset, and this can be useful for generalizing findings or designing targeted interventions.
Male with Family History of Overweight: If this combination has a low count, it suggests that there are fewer males with a family history of overweight in the dataset. Female without Family History of Overweight: If this combination also has a low count, it suggests fewer females in the dataset without a family history of overweight. Identifying these less common groups helps in understanding the dataset’s distribution and might highlight areas where data might be underrepresented.
If there is no entry for Male with Family History of Overweight, it suggests that this specific combination is not present in the dataset. This could be due to data collection limitations or actual absence in the population represented by the dataset. Understanding missing combinations can help in assessing potential biases in the dataset or in identifying gaps that may need addressing in future data collection or analysis.
# Bar plot for combinations of Gender and Family History
ggplot(combination_df, aes(x = interaction(Gender, family_history_with_overweight), y = n, fill = family_history_with_overweight)) +
geom_bar(stat = "identity") +
labs(title = "Count of Gender and Family History Combinations", x = "Gender and Family History Combination", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
### Insights a. Count of Combinations: High Counts: The combinations
with the tallest bars represent the most common groups in the dataset.
For instance, if the combination of Female with Family History of
Overweight has a high count, it indicates that this group is
well-represented in the dataset. Low Counts: The shortest bars represent
less common combinations. For example, if Male with Family History of
Overweight has a short bar, it suggests that this group is
underrepresented. b. Distribution of Family History Across Genders:
Comparing Within Genders: You can compare the count of those with and
without a family history of overweight within each gender. For example,
if Female with Family History of Overweight is common while Female
without Family History of Overweight is less common, it indicates a
trend in the dataset regarding family history and gender. Comparing
Across Genders: Similarly, comparing across genders helps in
understanding if there are significant differences in how family history
of overweight is distributed between males and females. For example, if
Male with Family History of Overweight is less common compared to Female
with Family History of Overweight, it shows a gender disparity.
c. Visual Patterns and Trends: Color Coding: The use of color to
differentiate between those with and without a family history of
overweight helps in visually comparing the proportions within each
gender category. It highlights whether family history influences the
count distribution differently for each gender. Label Orientation:
Rotating the x-axis labels makes it easier to read and compare
combinations, especially when there are many categories.