data dive- group by and probabiblities

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)

# Load dataset (assuming you've saved it locally)
obesity <- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\obesity.csv")
# View the first few rows of the dataset
head(obesity)

##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

Group 1: grouping by age range

# Cut age into ranges and group by Age Range and summarize Weight
obesity <- obesity %>%
  mutate(AgeRange = cut(Age, breaks = c(0, 18, 30, 45, 60, 100), 
                        labels = c("0-18", "19-30", "31-45", "46-60", "61+")))

age_group <- obesity %>%
  group_by(AgeRange) %>%
  summarize(mean_weight = mean(Weight, na.rm = TRUE),
            sd_weight = sd(Weight, na.rm = TRUE),
            count = n())

# Display the result
age_group

## # A tibble: 5 × 4
##   AgeRange mean_weight sd_weight count
##   <fct>          <dbl>     <dbl> <int>
## 1 0-18            67.4     19.2    241
## 2 19-30           88.5     27.2   1514
## 3 31-45           92.1     19.6    342
## 4 46-60           80.8      9.84    13
## 5 61+             66       NA        1

Insights

1. 0-18 (Children & Teenagers): Mean Weight: This statistic tells us the average weight of children and teenagers in the dataset. Weight Variability (SD): High standard deviation in this group would suggest that weights vary widely, which is expected as this group includes children growing at different rates. Count: The number of individuals in this group provides context on how well-represented this age range is in the dataset. If the count is low, insights might be less reliable.
1. 19-30 (Young Adults): Mean Weight: For this group, the mean weight reflects young adults, a demographic where many might be in their physical prime. Analyzing the average here could reflect fitness or lifestyle trends. Weight Variability: If the standard deviation is high, it suggests that some individuals are underweight, while others are overweight or obese, indicating diverse weight trends in this group. Count: The number of individuals in this group should be sufficient to capture the weight trends, but a low count might require additional data to ensure accuracy.
1. 31-45 (Middle-Aged Adults): Mean Weight: This is an important group as weight gain often increases with age and lifestyle choices. The average weight might be higher than that of younger adults. Weight Variability: Middle-aged individuals may have more diverse weight ranges due to factors like diet, metabolism changes, or sedentary behavior. Count: The sample size in this group is key to understanding trends, as this age group often faces health challenges related to weight, such as obesity or metabolic conditions.
1. 46-60 (Older Adults): Mean Weight: Typically, weight gain could stabilize or even begin to decline as people age into their 50s and 60s. The mean weight here helps explore how weight changes in this transitional period. Weight Variability: The standard deviation can show how diverse this group is in terms of weight. Some individuals may have health issues contributing to weight loss, while others may still struggle with obesity. Count: The number of people in this age group will give insight into how well this age range is represented in the dataset, which is critical for drawing conclusions.
1. 61+ (Seniors): Mean Weight: For seniors, weight trends might decrease, reflecting typical aging-related weight loss. However, weight gain in this age group could be related to health issues or lack of mobility. Weight Variability: High standard deviation could show that some individuals maintain or gain weight, while others experience weight loss due to health conditions. Count: The sample size for seniors can determine how robust the insights are for this age range. A smaller group may skew the results, so a larger count gives more confidence in the data.

visualization 1: Weight Distribution Across Age Ranges

# Histogram for Weight across Age Ranges
ggplot(obesity, aes(x = Age, y = Weight)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Weight Distribution by Age Range", x = "Age Range", y = "Weight") +
  theme_minimal()

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

### Insights - 0-18 Age Range (Children & Teenagers): The width of the violin at different weights tells you where most individuals’ weights fall. If the violin is wide at lower weights, it indicates a concentration of lower weights, which is expected in children and teenagers. If the distribution is skewed (wider on one side), it might show an imbalance, with a portion of individuals having higher weights. If the plot is relatively narrow overall, this suggests lower variability in weights among children and teenagers. - 19-30 Age Range (Young Adults): In this age range, the violin might widen at higher weights compared to the 0-18 group, indicating that people in this group generally weigh more. A wider, more symmetric violin plot indicates that the distribution of weights is more balanced, with people having weights that cluster around a central value. If the violin shows significant width on both sides of the plot, it suggests a bimodal distribution, where two distinct groups are present (e.g., some are underweight, while others are overweight). - 31-45 Age Range (Middle-Aged Adults): This group often experiences weight gain due to lifestyle changes, slowing metabolism, and other factors. The violin might show a wider distribution in the upper weight range. A right-skewed violin (with more density at higher weights) could indicate that middle-aged adults are more prone to overweight or obesity. If the violin plot is wider than for younger groups, it suggests a higher variability in weight, reflecting a diverse population with different lifestyles, health conditions, and activity levels. - 46-60 Age Range (Older Adults): The weight distribution in this age group could continue to widen, but in some cases, it might stabilize or even start to narrow as people age and experience health-related weight changes. A wider violin at higher weights could indicate an ongoing trend of weight gain in this group. If the violin becomes narrower, it might show that many people in this age range are clustered around a certain weight, possibly due to health interventions or lifestyle adjustments. - 61+ Age Range (Seniors): Weight distributions for seniors might differ significantly from younger groups. For example, if the violin becomes narrower and more concentrated around lower weights, it suggests weight loss due to aging, reduced appetite, or health conditions. A left-skewed violin could indicate that more seniors have lower weights. If the violin is more balanced, with significant width at both ends, it suggests that some seniors maintain higher weights, possibly due to sedentary lifestyles, while others might lose weight due to age-related factors.

Group 2:Grouping by Family History with Overweigh

# Group by Family History with Overweight and summarize Physical Activity Level
family_history_group <- obesity %>%
  group_by(family_history_with_overweight   ) %>%
  summarize(mean_activity = mean(FAF, na.rm = TRUE),
            sd_activity = sd(FAF, na.rm = TRUE),
            count = n())

# Display the result
family_history_group

## # A tibble: 2 × 4
##   family_history_with_overweight mean_activity sd_activity count
##   <chr>                                  <dbl>       <dbl> <int>
## 1 no                                     1.11        0.928   385
## 2 yes                                    0.988       0.831  1726

Insights

Family History of Overweight (Yes)
Mean Activity: The average physical activity level for individuals with a family history of being overweight. If this value is lower than the group without a family history, it might suggest that these individuals are less physically active on average. A lower physical activity level could reflect a tendency toward a more sedentary lifestyle, possibly influenced by genetic or environmental factors. Alternatively, a high mean activity level might suggest that these individuals are compensating for their higher risk of being overweight by staying active.
Activity Variability (SD): A high standard deviation indicates that individuals in this group have a wide range of physical activity levels, with some being highly active and others being very sedentary. A low SD suggests that most people in this group have similar activity levels. High variability could indicate that family history alone is not a consistent predictor of physical activity; lifestyle choices may differ significantly among individuals even if they have a genetic predisposition.
Count: The number of individuals in this group provides context on how well-represented this category is in the dataset. A larger count gives more reliability to the statistics and insights derived from this group.
No Family History of Overweight (No)
Mean Activity: The average physical activity level for individuals without a family history of being overweight. If this value is higher than the group with a family history, it suggests that these individuals might be more physically active, potentially reflecting less concern about genetic predispositions or different lifestyle habits. This group might be at lower risk for obesity due to both genetic and lifestyle factors, leading to higher average activity levels.
Activity Variability (SD): Similar to the group with a family history, the standard deviation tells us whether there is a wide range of activity levels. If the SD is lower, it suggests that individuals without a family history are more consistent in their physical activity habits.
Count: The sample size for this group shows how many people in the dataset do not have a family history of being overweight. A high count in this group gives a robust comparison with the family history group.

Visualization 2: Physical Activity Levels by Family History with Overweight

# Violin plot for Physical Activity Levels based on Family History with Overweight
ggplot(obesity, aes(x = family_history_with_overweight, y = FAF)) +
  geom_boxplot(fill = "pink") +
  labs(title = "Physical Activity Level by Family History with Overweight", 
       x = "Family History with Overweight", 
       y = "Physical Activity Level") +
  theme_minimal()

### Insights - Physical Activity Distribution for Individuals with a Family History of Overweight (Yes): The shape of the violin plot for the “Yes” group shows the distribution of physical activity levels for individuals with a family history of being overweight. Wider sections of the violin at certain activity levels suggest that more individuals are concentrated at those levels. For example, if the violin is widest at lower activity levels, this indicates that many people with a family history of overweight engage in low physical activity. A narrow violin at higher activity levels suggests fewer individuals in this group are highly active. If the plot is skewed towards lower activity levels, it could suggest that individuals with a family history of overweight tend to have less physical activity on average, which could be due to lifestyle factors or genetic influences. - Physical Activity Distribution for Individuals without a Family History of Overweight (No): The shape of the violin plot for the “No” group shows the distribution of physical activity levels for those without a family history of overweight. If this violin is wider at higher physical activity levels, it suggests that more individuals in this group are more active compared to the “Yes” group. If the plot shows a balanced distribution across various activity levels (i.e., both low and high activity levels), it indicates that individuals without a family history of overweight tend to have more varied activity habits. Narrower sections at lower activity levels may suggest that fewer individuals in this group are inactive, which could imply healthier overall lifestyle habits compared to those with a family history of overweight.

summary

This violin plot visually demonstrates the distribution of physical activity levels for individuals with and without a family history of overweight. It helps identify whether family history correlates with lower or higher levels of physical activity, potentially highlighting at-risk groups for inactivity. Understanding these trends can inform interventions aimed at promoting physical activity and preventing obesity in genetically predisposed populations.

Group 3:Grouping by Gender

gender_group <- obesity %>%
  group_by(Gender) %>%
  summarize(mean_Weight = mean(Weight, na.rm = TRUE),
            sd_Weight = sd(Weight, na.rm = TRUE),
            count = n())

# Display the result
gender_group

## # A tibble: 2 × 4
##   Gender mean_Weight sd_Weight count
##   <chr>        <dbl>     <dbl> <int>
## 1 Female        82.3      29.7  1043
## 2 Male          90.8      21.4  1068

Insights

Mean Weight: Males: The mean weight for males tells us the average weight of male participants in the dataset. If this value is relatively high compared to females, it may reflect physiological differences, lifestyle factors, or both. Males typically have higher muscle mass and might engage in different physical activities compared to females, which could contribute to higher average weight. Females: The mean weight for females gives insight into the average weight of female participants. If this value is lower than males, it may be due to physiological differences, such as lower muscle mass, hormonal differences, and possibly lifestyle choices. Gender-based differences in metabolism, activity level, and diet can also contribute to the differences in average weight between males and females.
Weight Variability (Standard Deviation): Males: The standard deviation of weight for males reflects how much individual weights deviate from the average weight within this gender group. A higher SD indicates greater variability in weight, suggesting that while some males may be underweight, others might be overweight or obese. A lower SD suggests that weights are more clustered around the mean, indicating less variation among male participants. Females: The standard deviation for females shows how much weights differ from the average. If the SD for females is larger than that for males, it indicates more variability in weight among women, which could be due to a wider range of factors such as different levels of physical activity, diet, or genetic predispositions. If the SD is smaller, it suggests that most females in the dataset have weights closer to the average, with fewer outliers.
Count: Males: The number of male participants in the dataset (count) helps contextualize the insights. If the sample size for males is large, the results are more reliable and can be generalized better. A small sample size might make the insights less conclusive. Females: Similarly, the count for females shows the representation of women in the dataset. A balanced sample between genders ensures more accurate comparisons between male and female weight trends. If there are significantly more participants from one gender, the dataset might be biased, and this could affect the interpretation of results.

visualization 3:Weight Distribution by Gender

# Bar plot for average Weight by Gender
ggplot(gender_group, aes(x = Gender, y = mean_Weight, fill = Gender)) +
  geom_bar(stat = "identity", width = 0.6) +
  labs(title = "Average Weight by Gender", x = "Gender", y = "Average Weight") +
  theme_minimal()

### Insights

Average Weight by Gender: Height of Bars: The height of each bar represents the average weight for each gender group. Male: If the bar for males is higher than the bar for females, it indicates that, on average, males in the dataset weigh more than females. This is consistent with general physiological trends where males typically have higher muscle mass and larger body frames. Female: If the bar for females is lower, it reflects the average weight of females, which might be influenced by generally lower muscle mass and different body composition compared to males.
Comparing Average Weights: Differences: The difference in the height of the bars can show the magnitude of the average weight difference between males and females. A large difference suggests a notable disparity in average weights, while a smaller difference indicates that weights are more similar between genders.
Understanding Context: Physiological Factors: Higher average weight in males can be attributed to physiological differences such as more muscle mass and bone density. Conversely, lower average weight in females can reflect differences in body fat distribution and overall body composition. Health Implications: If the plot shows a significant difference in average weight, it could have implications for health interventions. d. Distribution of Data: The bar plot provides a straightforward comparison of average weights between genders but does not show the distribution or variability within each gender group. To understand the spread of weights and identify outliers, additional plots such as box plots or violin plots could be helpful.

probabibilty calculation for groups

# Calculate probability of each group in Gender, AgeRange, and Family History groups
gender_prob <- gender_group %>%
  mutate(probability = count / sum(count))

age_prob <- age_group %>%
  mutate(probability = count / sum(count))

family_history_prob <- family_history_group %>%
  mutate(probability = count / sum(count))

# Tagging lowest probability groups
gender_prob <- gender_prob %>%
  mutate(tag = ifelse(probability == min(probability), "Low Probability", "Normal"))

age_prob <- age_prob %>%
  mutate(tag = ifelse(probability == min(probability), "Low Probability", "Normal"))

family_history_prob <- family_history_prob %>%
  mutate(tag = ifelse(probability == min(probability), "Low Probability", "Normal"))

# Display probability results
gender_prob

## # A tibble: 2 × 6
##   Gender mean_Weight sd_Weight count probability tag            
##   <chr>        <dbl>     <dbl> <int>       <dbl> <chr>          
## 1 Female        82.3      29.7  1043       0.494 Low Probability
## 2 Male          90.8      21.4  1068       0.506 Normal

age_prob

## # A tibble: 5 × 6
##   AgeRange mean_weight sd_weight count probability tag            
##   <fct>          <dbl>     <dbl> <int>       <dbl> <chr>          
## 1 0-18            67.4     19.2    241    0.114    Normal         
## 2 19-30           88.5     27.2   1514    0.717    Normal         
## 3 31-45           92.1     19.6    342    0.162    Normal         
## 4 46-60           80.8      9.84    13    0.00616  Normal         
## 5 61+             66       NA        1    0.000474 Low Probability

family_history_prob

## # A tibble: 2 × 6
##   family_history_with_overwe…¹ mean_activity sd_activity count probability tag  
##   <chr>                                <dbl>       <dbl> <int>       <dbl> <chr>
## 1 no                                   1.11        0.928   385       0.182 Low …
## 2 yes                                  0.988       0.831  1726       0.818 Norm…
## # ℹ abbreviated name: ¹family_history_with_overweight

Insights

Gender Distribution: Most Common Gender: The gender group with the highest probability will be the most prevalent in the dataset. For example, if “Female” has a higher probability, it means there are more females in the dataset compared to males. Low Probability Gender: The gender with the lowest probability represents the least common gender group in the dataset. If “Male” has the lowest probability, it suggests that there are fewer males in the dataset compared to females.
Age Range Distribution: Most Common Age Range: The age range with the highest probability is the most represented in the dataset. For example, if “31-45” has the highest probability, it indicates that most individuals fall within this age range. Low Probability Age Range: The age range with the lowest probability is the least represented. If “61+” has the lowest probability, it suggests that there are fewer individuals in this age range compared to others.
Family History Distribution: Most Common Family History Group: The family history group with the highest probability is the most prevalent. For example, if “No” has the highest probability, it indicates that most individuals do not have a family history of being overweight. Low Probability Family History Group: The group with the lowest probability represents the least common family history status. If “Yes” has the lowest probability, it suggests that fewer individuals in the dataset have a family history of being overweight.

Testable hypothesis

# Hypothesis test: Comparing Physical Activity Levels for Family History vs. No Family History
t.test(FAF ~ family_history_with_overweight, data = obesity)

## 
##  Welch Two Sample t-test
## 
## data:  FAF by family_history_with_overweight
## t = 2.4315, df = 530.05, p-value = 0.01537
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  0.02397149 0.22563956
## sample estimates:
##  mean in group no mean in group yes 
##         1.1123414         0.9875359

Insights

Comparison of Means: Mean Physical Activity Level for Each Group: If the mean physical activity level is significantly higher or lower in one group compared to the other, this can highlight differences in activity levels based on family history of overweight.
Statistical Significance: Significant Difference: If the p-value is less than 0.05, it suggests that there is a statistically significant difference in physical activity levels between individuals with and without a family history of overweight. This could imply that family history influences physical activity levels. No Significant Difference: If the p-value is greater than 0.05, it suggests that there is no statistically significant difference in physical activity levels between the two groups. This might imply that family history of overweight does not have a strong impact on physical activity levels.
Effect Size and Practical Significance: Effect Size: The magnitude of the difference between the groups (mean difference) should be considered to assess the practical significance. Even if the p-value is significant, a small effect size might not be practically meaningful. Practical Implications: If a significant difference is found, it can inform interventions or health strategies tailored to individuals with a family history of overweight.

Combination of Two Categorical Variables

# Create a data frame for unique combinations of Gender and Family History
combination_df <- obesity %>%
  count(Gender, family_history_with_overweight) %>%
  arrange(desc(n))

# Find missing combinations
missing_combinations <- combination_df %>%
  filter(n == 0)

# Display the most/least common combinations and any missing ones
combination_df

##   Gender family_history_with_overweight   n
## 1   Male                            yes 915
## 2 Female                            yes 811
## 3 Female                             no 232
## 4   Male                             no 153

missing_combinations

## [1] Gender                         family_history_with_overweight
## [3] n                             
## <0 rows> (or 0-length row.names)

Insights

Most Common Combinations: High Counts: The combinations with the highest counts indicate the most prevalent groups in the dataset. For instance:

Female with Family History of Overweight: If this combination has a high count, it means that many females in the dataset report a family history of overweight. Male without Family History of Overweight: Similarly, a high count here indicates that a significant number of males do not have a family history of overweight. Understanding the most common combinations can help identify which groups are more represented in the dataset, and this can be useful for generalizing findings or designing targeted interventions.

Least Common Combinations: Low Counts: Combinations with the lowest counts may reflect less prevalent groups. For example:

Male with Family History of Overweight: If this combination has a low count, it suggests that there are fewer males with a family history of overweight in the dataset. Female without Family History of Overweight: If this combination also has a low count, it suggests fewer females in the dataset without a family history of overweight. Identifying these less common groups helps in understanding the dataset’s distribution and might highlight areas where data might be underrepresented.

Missing Combinations: Missing Data: If missing_combinations contains any rows, it indicates that some combinations of Gender and Family_History_Overweight are missing from the dataset. For example:

If there is no entry for Male with Family History of Overweight, it suggests that this specific combination is not present in the dataset. This could be due to data collection limitations or actual absence in the population represented by the dataset. Understanding missing combinations can help in assessing potential biases in the dataset or in identifying gaps that may need addressing in future data collection or analysis.

visualization for barplot of Gender and Family History Combinations

# Bar plot for combinations of Gender and Family History
ggplot(combination_df, aes(x = interaction(Gender, family_history_with_overweight), y = n, fill = family_history_with_overweight)) +
  geom_bar(stat = "identity") +
  labs(title = "Count of Gender and Family History Combinations", x = "Gender and Family History Combination", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

### Insights a. Count of Combinations: High Counts: The combinations with the tallest bars represent the most common groups in the dataset. For instance, if the combination of Female with Family History of Overweight has a high count, it indicates that this group is well-represented in the dataset. Low Counts: The shortest bars represent less common combinations. For example, if Male with Family History of Overweight has a short bar, it suggests that this group is underrepresented. b. Distribution of Family History Across Genders: Comparing Within Genders: You can compare the count of those with and without a family history of overweight within each gender. For example, if Female with Family History of Overweight is common while Female without Family History of Overweight is less common, it indicates a trend in the dataset regarding family history and gender. Comparing Across Genders: Similarly, comparing across genders helps in understanding if there are significant differences in how family history of overweight is distributed between males and females. For example, if Male with Family History of Overweight is less common compared to Female with Family History of Overweight, it shows a gender disparity. c. Visual Patterns and Trends: Color Coding: The use of color to differentiate between those with and without a family history of overweight helps in visually comparing the proportions within each gender category. It highlights whether family history influences the count distribution differently for each gender. Label Orientation: Rotating the x-axis labels makes it easier to read and compare combinations, especially when there are many categories.

Conclusion and Insights

The lowest probability groups were those with [Insert Specific Group Findings].
It suggests that certain combinations are rare, possibly due to genetic or environmental factors.
The hypothesis that family history contributes to higher BMI appears [supported/not supported] based on the t-test.
Further investigation into the influence of physical activity and diet on these relationships is recommended.

data dive- group by and probabiblities

Saisree mucharla

2024-09-17

Group 1: grouping by age range

Insights

visualization 1: Weight Distribution Across Age Ranges

Group 2:Grouping by Family History with Overweigh

Insights

Visualization 2: Physical Activity Levels by Family History with Overweight

summary

Group 3:Grouping by Gender

Insights

visualization 3:Weight Distribution by Gender

probabibilty calculation for groups

Insights

Testable hypothesis

Insights

Combination of Two Categorical Variables

Insights

visualization for barplot of Gender and Family History Combinations

Conclusion and Insights