library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)

Group_by dataframe_1 Obesity levels and Family overweight history and count the occurances

df |> group_by(NObeyesdad,family_history_with_overweight) |> summarise(count = n(), Mean_Value = mean(Weight))
## `summarise()` has grouped output by 'NObeyesdad'. You can override using the
## `.groups` argument.
## # A tibble: 13 × 4
## # Groups:   NObeyesdad [7]
##    NObeyesdad          family_history_with_overweight count Mean_Value
##    <chr>               <chr>                          <int>      <dbl>
##  1 Insufficient_Weight no                               146       46.2
##  2 Insufficient_Weight yes                              126       54.2
##  3 Normal_Weight       no                               132       61.0
##  4 Normal_Weight       yes                              155       63.1
##  5 Obesity_Type_I      no                                 7       95.4
##  6 Obesity_Type_I      yes                              344       92.8
##  7 Obesity_Type_II     no                                 1       93  
##  8 Obesity_Type_II     yes                              296      115. 
##  9 Obesity_Type_III    yes                              324      121. 
## 10 Overweight_Level_I  no                                81       70.4
## 11 Overweight_Level_I  yes                              209       75.8
## 12 Overweight_Level_II no                                18       82.0
## 13 Overweight_Level_II yes                              272       82.1

1). What is the rare occurance of the groupy by dataframe?

  • There is no combination of ‘Obesity_Type_III’ and ‘no’ family history.
  • Obesity level = Obesity_Type_II and family_history_with_overweight = no has the lowest probability

2) What does this mean in my data context and also in terms of probability?

  • There is no combination of ‘Obesity_Type_III’ and ‘no’ family history. In worldly sense, genetics is also played significant influencial role in Obesity_Type_III category.
  • P(Obesity_Type_II(Obesity level) and no (family_history_with_overweight)) = 1/2111.

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

  • Hypothesis : family history has more influence on Obesity_Type_III category than Insufficient_weight category.

4). Visualization

ggplot(data = df[,c('family_history_with_overweight', 'NObeyesdad')] , aes(x = NObeyesdad, fill = family_history_with_overweight) ) + geom_bar(position = 'dodge',color = 'black') + theme_minimal()+theme(axis.text.x = element_text(angle = 45, hjust = 1))

Group_BY Dataframe 2 : Gender,FAVC (Do you eat high caloric food frequently?) and SCC (Do you monitor the calories you eat daily?)

gb <-(df |> group_by(Gender,FAVC,SCC) |> summarise(count = n()))
## `summarise()` has grouped output by 'Gender', 'FAVC'. You can override using
## the `.groups` argument.
gb
## # A tibble: 8 × 4
## # Groups:   Gender, FAVC [4]
##   Gender FAVC  SCC   count
##   <chr>  <chr> <chr> <int>
## 1 Female no    no      116
## 2 Female no    yes      27
## 3 Female yes   no      857
## 4 Female yes   yes      43
## 5 Male   no    no       91
## 6 Male   no    yes      11
## 7 Male   yes   no      951
## 8 Male   yes   yes      15

1). What is the rare occurance of the groupy by dataframe?

  • The rarest occurrence is Male, FAVC = no, SCC = yes with a count of 11.

2) What does this mean(in terms of probability)?

_ In Worldly sense, Males who are both careful and concisous about their food calories and track are very less in number. - The probability of having Male, FAVC = no, SCC = yes is very low compared to other combinations P = 11/2111

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

  • Testable Hypothesis : Males who do not eat high caloric food frequently are less likely to monitor calories daily.

4) Visualization

ggplot(gb,aes(x = interaction(FAVC, SCC), y = count, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal()

## Group By DataFrame_3 : Group by Gender,NObeyesdad and summarize median AGE

gas <-(df |> group_by(df[,c('Gender','NObeyesdad')])) |> summarize(Median_Age = median(Age), count = n())
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
gas
## # A tibble: 14 × 4
## # Groups:   Gender [2]
##    Gender NObeyesdad          Median_Age count
##    <chr>  <chr>                    <dbl> <int>
##  1 Female Insufficient_Weight       19.9   173
##  2 Female Normal_Weight             21     141
##  3 Female Obesity_Type_I            23     156
##  4 Female Obesity_Type_II           24.5     2
##  5 Female Obesity_Type_III          25.4   323
##  6 Female Overweight_Level_I        21.5   145
##  7 Female Overweight_Level_II       25.1   103
##  8 Male   Insufficient_Weight       18      99
##  9 Male   Normal_Weight             21     146
## 10 Male   Obesity_Type_I            22.7   195
## 11 Male   Obesity_Type_II           27.3   295
## 12 Male   Obesity_Type_III          18       1
## 13 Male   Overweight_Level_I        21.0   145
## 14 Male   Overweight_Level_II       23.9   187

1). What is the rare occurance of the groupy by dataframe?

  • Male, Obesity_Type_III is the most rare occurence and the next rare occurence is Female, Obesity_Type_III.

2) What does this mean(in terms of probability)?

_ It can be deduced that, almost all of the Obesity_Type_III(execpt 1) are females.
- The probability of Male, Obesity_Type_III combination is very low, i.e. P = 1/2111

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

  • Testable Hypothesis: There are more females than males as the obesity level rises.

4) Visualization

ggplot(gas, aes(x = NObeyesdad, y = count, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Group_BY Dataframe 4 : Group by ‘Smoke’,‘CALC’(Alcohol comsumption) and count.

df |> group_by(SMOKE, CALC) |> summarise(count = n(), Median_Weight = median(Weight))
## `summarise()` has grouped output by 'SMOKE'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 4
## # Groups:   SMOKE [2]
##   SMOKE CALC       count Median_Weight
##   <chr> <chr>      <int>         <dbl>
## 1 no    Always         1          65  
## 2 no    Frequently    63          78.4
## 3 no    Sometimes   1370          89.8
## 4 no    no           633          80  
## 5 yes   Frequently     7          84  
## 6 yes   Sometimes     31         102  
## 7 yes   no             6          77.5

1) What is the rare occurance of the groupy by dataframe?

  • ‘Always’ Alcohol Consumptions is the rare occurance.
  • There is only one combination of No - SMOKE and always - CALC Alcohol comsumption.
  • There is NO combination of Yes - SMOKE and always - CALC Alcohol comsumption.

2) What does this mean(in terms of probability)?

  • In dataset-worldly sense, people donot ALWAYS consume alcohol on regular basis.
  • In probability terms, CALC = Always, SMOKE = No this combination is highly infrequent in the dataset. P(no,Always) = 1/2111.
  • And note: if SMOKE = Yes and CALC = Always is considered Probability is zero.

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

  • Non-smokers are more likely to have a higher frequency of alcohol consumption compared to smokers.* This hypothesis can be tested by counting the corresponding combination counts.

5) Why do you think these are missing?

  • Why CALC = Always & SMOKE = yes is missing ,can be reasoned by the probability of of CALC is Always is already 1/2111(=0.00048) (individually) and the probability of SMOKE = Yes is also very low (44/2067). So logically the combination must also result in very low probability.
  • Worldly sense, It is possible that people are more inclined towards either of them not both of them (mostly).

6) Which combinations are the most/least common, and why might that be?

  • CALC = Sometimes + SMOKE = yes combination is Most Common. _ CALC = Always + SMOKE = no combination is least Common. _ CALC = Always + SMOKE = yes combination has 0 occurence.

7) Find a way to visualize at least one of the combinations.

ggplot((df |> group_by(SMOKE, CALC) |> summarise(count = n(), Median_Weight = median(Weight))), aes(x = CALC, y = count, fill = SMOKE)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal()
## `summarise()` has grouped output by 'SMOKE'. You can override using the
## `.groups` argument.

Conclusion from Analysis

Futher Investiagtion:

  • All the testable hypothesis have to be evaluated Thank You.