STATS_WEEK

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)

Group_by dataframe_1 Obesity levels and Family overweight history and count the occurances

df |> group_by(NObeyesdad,family_history_with_overweight) |> summarise(count = n(), Mean_Value = mean(Weight))

## `summarise()` has grouped output by 'NObeyesdad'. You can override using the
## `.groups` argument.

## # A tibble: 13 × 4
## # Groups:   NObeyesdad [7]
##    NObeyesdad          family_history_with_overweight count Mean_Value
##    <chr>               <chr>                          <int>      <dbl>
##  1 Insufficient_Weight no                               146       46.2
##  2 Insufficient_Weight yes                              126       54.2
##  3 Normal_Weight       no                               132       61.0
##  4 Normal_Weight       yes                              155       63.1
##  5 Obesity_Type_I      no                                 7       95.4
##  6 Obesity_Type_I      yes                              344       92.8
##  7 Obesity_Type_II     no                                 1       93  
##  8 Obesity_Type_II     yes                              296      115. 
##  9 Obesity_Type_III    yes                              324      121. 
## 10 Overweight_Level_I  no                                81       70.4
## 11 Overweight_Level_I  yes                              209       75.8
## 12 Overweight_Level_II no                                18       82.0
## 13 Overweight_Level_II yes                              272       82.1

1). What is the rare occurance of the groupy by dataframe?

There is no combination of ‘Obesity_Type_III’ and ‘no’ family history.
Obesity level = Obesity_Type_II and family_history_with_overweight = no has the lowest probability

2) What does this mean in my data context and also in terms of probability?

There is no combination of ‘Obesity_Type_III’ and ‘no’ family history. In worldly sense, genetics is also played significant influencial role in Obesity_Type_III category.
P(Obesity_Type_II(Obesity level) and no (family_history_with_overweight)) = 1/2111.

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

Hypothesis : family history has more influence on Obesity_Type_III category than Insufficient_weight category.

4). Visualization

ggplot(data = df[,c('family_history_with_overweight', 'NObeyesdad')] , aes(x = NObeyesdad, fill = family_history_with_overweight) ) + geom_bar(position = 'dodge',color = 'black') + theme_minimal()+theme(axis.text.x = element_text(angle = 45, hjust = 1))

Group_BY Dataframe 2 : Gender,FAVC (Do you eat high caloric food frequently?) and SCC (Do you monitor the calories you eat daily?)

gb <-(df |> group_by(Gender,FAVC,SCC) |> summarise(count = n()))

## `summarise()` has grouped output by 'Gender', 'FAVC'. You can override using
## the `.groups` argument.

gb

## # A tibble: 8 × 4
## # Groups:   Gender, FAVC [4]
##   Gender FAVC  SCC   count
##   <chr>  <chr> <chr> <int>
## 1 Female no    no      116
## 2 Female no    yes      27
## 3 Female yes   no      857
## 4 Female yes   yes      43
## 5 Male   no    no       91
## 6 Male   no    yes      11
## 7 Male   yes   no      951
## 8 Male   yes   yes      15

1). What is the rare occurance of the groupy by dataframe?

The rarest occurrence is Male, FAVC = no, SCC = yes with a count of 11.

2) What does this mean(in terms of probability)?

_ In Worldly sense, Males who are both careful and concisous about their food calories and track are very less in number. - The probability of having Male, FAVC = no, SCC = yes is very low compared to other combinations P = 11/2111

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

Testable Hypothesis : Males who do not eat high caloric food frequently are less likely to monitor calories daily.

4) Visualization

ggplot(gb,aes(x = interaction(FAVC, SCC), y = count, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal()

## Group By DataFrame_3 : Group by Gender,NObeyesdad and summarize median AGE

gas <-(df |> group_by(df[,c('Gender','NObeyesdad')])) |> summarize(Median_Age = median(Age), count = n())

## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.

gas

## # A tibble: 14 × 4
## # Groups:   Gender [2]
##    Gender NObeyesdad          Median_Age count
##    <chr>  <chr>                    <dbl> <int>
##  1 Female Insufficient_Weight       19.9   173
##  2 Female Normal_Weight             21     141
##  3 Female Obesity_Type_I            23     156
##  4 Female Obesity_Type_II           24.5     2
##  5 Female Obesity_Type_III          25.4   323
##  6 Female Overweight_Level_I        21.5   145
##  7 Female Overweight_Level_II       25.1   103
##  8 Male   Insufficient_Weight       18      99
##  9 Male   Normal_Weight             21     146
## 10 Male   Obesity_Type_I            22.7   195
## 11 Male   Obesity_Type_II           27.3   295
## 12 Male   Obesity_Type_III          18       1
## 13 Male   Overweight_Level_I        21.0   145
## 14 Male   Overweight_Level_II       23.9   187

1). What is the rare occurance of the groupy by dataframe?

Male, Obesity_Type_III is the most rare occurence and the next rare occurence is Female, Obesity_Type_III.

2) What does this mean(in terms of probability)?

_ It can be deduced that, almost all of the Obesity_Type_III(execpt 1) are females.
- The probability of Male, Obesity_Type_III combination is very low, i.e. P = 1/2111

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

Testable Hypothesis: There are more females than males as the obesity level rises.

4) Visualization

ggplot(gas, aes(x = NObeyesdad, y = count, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Group_BY Dataframe 4 : Group by ‘Smoke’,‘CALC’(Alcohol comsumption) and count.

df |> group_by(SMOKE, CALC) |> summarise(count = n(), Median_Weight = median(Weight))

## `summarise()` has grouped output by 'SMOKE'. You can override using the
## `.groups` argument.

## # A tibble: 7 × 4
## # Groups:   SMOKE [2]
##   SMOKE CALC       count Median_Weight
##   <chr> <chr>      <int>         <dbl>
## 1 no    Always         1          65  
## 2 no    Frequently    63          78.4
## 3 no    Sometimes   1370          89.8
## 4 no    no           633          80  
## 5 yes   Frequently     7          84  
## 6 yes   Sometimes     31         102  
## 7 yes   no             6          77.5

1) What is the rare occurance of the groupy by dataframe?

‘Always’ Alcohol Consumptions is the rare occurance.
There is only one combination of No - SMOKE and always - CALC Alcohol comsumption.
There is NO combination of Yes - SMOKE and always - CALC Alcohol comsumption.

2) What does this mean(in terms of probability)?

In dataset-worldly sense, people donot ALWAYS consume alcohol on regular basis.
In probability terms, CALC = Always, SMOKE = No this combination is highly infrequent in the dataset. P(no,Always) = 1/2111.
And note: if SMOKE = Yes and CALC = Always is considered Probability is zero.

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

Non-smokers are more likely to have a higher frequency of alcohol consumption compared to smokers.* This hypothesis can be tested by counting the corresponding combination counts.

5) Why do you think these are missing?

Why CALC = Always & SMOKE = yes is missing ,can be reasoned by the probability of of CALC is Always is already 1/2111(=0.00048) (individually) and the probability of SMOKE = Yes is also very low (44/2067). So logically the combination must also result in very low probability.
Worldly sense, It is possible that people are more inclined towards either of them not both of them (mostly).

6) Which combinations are the most/least common, and why might that be?

CALC = Sometimes + SMOKE = yes combination is Most Common. _ CALC = Always + SMOKE = no combination is least Common. _ CALC = Always + SMOKE = yes combination has 0 occurence.

7) Find a way to visualize at least one of the combinations.

ggplot((df |> group_by(SMOKE, CALC) |> summarise(count = n(), Median_Weight = median(Weight))), aes(x = CALC, y = count, fill = SMOKE)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal()

## `summarise()` has grouped output by 'SMOKE'. You can override using the
## `.groups` argument.

Conclusion from Analysis

genetics is also played significant influencial role in Obesity_Type_III category.
Males who are both careful and concisous about their food calories and track are very less in number.
almost all of the Obesity_Type_III(execpt 1) are females
people donot ALWAYS consume alcohol on regular basis

Futher Investiagtion:

All the testable hypothesis have to be evaluated Thank You.

STATS_WEEK_3

2024-09-17

Group_by dataframe_1 Obesity levels and Family overweight history and count the occurances

1). What is the rare occurance of the groupy by dataframe?

2) What does this mean in my data context and also in terms of probability?

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

4). Visualization

Group_BY Dataframe 2 : Gender,FAVC (Do you eat high caloric food frequently?) and SCC (Do you monitor the calories you eat daily?)

1). What is the rare occurance of the groupy by dataframe?

2) What does this mean(in terms of probability)?

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

4) Visualization

1). What is the rare occurance of the groupy by dataframe?

2) What does this mean(in terms of probability)?

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

4) Visualization

Group_BY Dataframe 4 : Group by ‘Smoke’,‘CALC’(Alcohol comsumption) and count.

1) What is the rare occurance of the groupy by dataframe?

2) What does this mean(in terms of probability)?

3) Drawing testable Hypothesis on why this group is rare than others (i.e.something quantifiable)

5) Why do you think these are missing?

6) Which combinations are the most/least common, and why might that be?

7) Find a way to visualize at least one of the combinations.

Conclusion from Analysis

Futher Investiagtion: