```{r}

/Users/jagathkumarreddyk/Downloads

```

library(ggplot2)
library(readr)
df  <- read_csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv")
## Rows: 2111 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CAL...
## dbl (8): Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summary

For Numerical columns

summary(df[, c("Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE")])
##       Age            Height          Weight            FCVC      
##  Min.   :14.00   Min.   :1.450   Min.   : 39.00   Min.   :1.000  
##  1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47   1st Qu.:2.000  
##  Median :22.78   Median :1.700   Median : 83.00   Median :2.386  
##  Mean   :24.31   Mean   :1.702   Mean   : 86.59   Mean   :2.419  
##  3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43   3rd Qu.:3.000  
##  Max.   :61.00   Max.   :1.980   Max.   :173.00   Max.   :3.000  
##       NCP             CH2O            FAF              TUE        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.659   1st Qu.:1.585   1st Qu.:0.1245   1st Qu.:0.0000  
##  Median :3.000   Median :2.000   Median :1.0000   Median :0.6253  
##  Mean   :2.686   Mean   :2.008   Mean   :1.0103   Mean   :0.6579  
##  3rd Qu.:3.000   3rd Qu.:2.477   3rd Qu.:1.6667   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :3.000   Max.   :3.0000   Max.   :2.0000

For categorical columns, this should include unique values and counts

uniq_vales_count <- lapply(df,table)
uniq_vales_count[c(1,5,6,9,10,12,15,16,17)]
## $Gender
## 
## Female   Male 
##   1043   1068 
## 
## $family_history_with_overweight
## 
##   no  yes 
##  385 1726 
## 
## $FAVC
## 
##   no  yes 
##  245 1866 
## 
## $CAEC
## 
##     Always Frequently         no  Sometimes 
##         53        242         51       1765 
## 
## $SMOKE
## 
##   no  yes 
## 2067   44 
## 
## $SCC
## 
##   no  yes 
## 2015   96 
## 
## $CALC
## 
##     Always Frequently         no  Sometimes 
##          1         70        639       1401 
## 
## $MTRANS
## 
##            Automobile                  Bike             Motorbike 
##                   457                     7                    11 
## Public_Transportation               Walking 
##                  1580                    56 
## 
## $NObeyesdad
## 
## Insufficient_Weight       Normal_Weight      Obesity_Type_I     Obesity_Type_II 
##                 272                 287                 351                 297 
##    Obesity_Type_III  Overweight_Level_I Overweight_Level_II 
##                 324                 290                 290

Novel Questions

What is the purpose of this project?

This Project’s purpose is to analyze different variables and their relations that have an impact on obesity levels in individuals based on their eating habits and physical condition.

What is the data documentation of this dataset?

This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.

The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.” - From Estimation of Obesity Levels Based On Eating Habits and Physical Condition [Dataset]. (2019). UCI Machine Learning Repository. https://doi.org/10.24432/C5H31Z.

How does average weights and average heights differ for each category of Obesity level?

aver_weight_per_level <- aggregate(Weight~NObeyesdad,df,FUN = function(x) c(mean = mean(x),max = max(x),min = min(x) ))
aver_weight_per_level
##            NObeyesdad Weight.mean Weight.max Weight.min
## 1 Insufficient_Weight    49.90633   65.00000   39.00000
## 2       Normal_Weight    62.15505   87.00000   42.30000
## 3      Obesity_Type_I    92.87020  125.00000   75.00000
## 4     Obesity_Type_II   115.30531  130.00000   93.00000
## 5    Obesity_Type_III   120.94111  173.00000  102.00000
## 6  Overweight_Level_I    74.26683   91.00000   53.00000
## 7 Overweight_Level_II    82.08527  102.00000   60.00000
aver_height_per_level <- aggregate(Height~NObeyesdad,df,FUN = function(x) c(mean = mean(x),max = max(x),min = min(x) ))
print(aver_height_per_level)
##            NObeyesdad Height.mean Height.max Height.min
## 1 Insufficient_Weight    1.691117   1.900000   1.520000
## 2       Normal_Weight    1.676585   1.930000   1.500000
## 3      Obesity_Type_I    1.693804   1.980000   1.500000
## 4     Obesity_Type_II    1.771795   1.920000   1.600000
## 5    Obesity_Type_III    1.687559   1.870000   1.560000
## 6  Overweight_Level_I    1.687836   1.900000   1.450000
## 7 Overweight_Level_II    1.703748   1.930000   1.480000

This aggregation gives an proper an idea of distribution of values for each category of obesity level.

  1. Weight of Insufficinet_Weight is very low although the average height of Insufficinet_Weight is very not different from other categories.

  2. There is a clear distinction in average weights between different categories unlike average heights.

What are the observations from above column summaries?

Raw observations from summary of Numeric and Categorical variables:

  1. The obesity level is divided into 7 levels and the sample has good number of data on all the 7 levels.

  2. The dataset has a good split of the two gender. So the observation can stand for the population.

  3. Majority of sample people use - Public Transportation (and next Automobiles)

  4. Age ranges from 14 to 61.But the 75th percentile is just 26. This implies that 61 is an outlier here and 3/4th of the sample data represents people below 26.

Lets visualise Height, Weight, Obesity levels using scatterplot

ggplot(data = df[,c('Height','Weight','NObeyesdad')], aes(x = Height, y = Weight , colour = NObeyesdad))+
  geom_point(alpha = 0.8) +
  facet_wrap(~  NObeyesdad)

This gives a clear picture of different obesity levels.

  1. The slopes of all the categories is roughly the same, the decide factor is the intercept/constant in line(i.e. In y = m.x+c, the deciding factor is ‘c’).

  2. Obesity_Type_III has outlier w.r.t Weight variable and Obesity_Type_I has outliers w.r.t Height variable

Lets visualize family_history_with_overweight vs obesity levels with bar plots

ggplot(data = df[,c('family_history_with_overweight', 'NObeyesdad')] , aes(x = NObeyesdad, fill = family_history_with_overweight) ) + geom_bar(position = 'dodge',color = 'black') + theme_minimal()+theme(axis.text.x = element_text(angle = 45, hjust = 1))

This graph shows how the family_overweight_history is related to Obesity levels.

  1. Majority of the sample with ‘NO’ family_history_overweight are from Insufficient Weight and Normal weight bars.This shows some level of generic impact on Obesity levels. (Futher analysis is required to justify the relation)

Lets visualise NCP(No. of meals per day) and NObeyesdad(Obesity Levels) using scatterplot

ggplot(data = df[,c('NCP','NObeyesdad','family_history_with_overweight' )], aes(x = NCP , y = NObeyesdad , colour = family_history_with_overweight)) + geom_point(size = 0.5) + theme_minimal()

(This is better visualised using BOXPLOT below)

Lets visualise NCP(No. of meals) and NObeyesdad(Obesity Levels) using BoxPlot

ggplot(data = df[,c('NCP','NObeyesdad','family_history_with_overweight' )], aes(x = NObeyesdad, y = NCP  )) + geom_boxplot() + theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This plot helps to grasp the distribution of NCP for each Obesity levels.

  1. The median for the entire distribution of NCP(No. of meals per day) is 3, So its represented with thick horizontal line.

  2. Obesity_Type_III and Normal_weight category have similar distribution but Obesity_Type_III are two outliers one on 1(min) and 4(max).

  3. In Obesity_Type_I category, the first quantile is less than 2 meals per day and the from Third quantile all of them are have 3 meals per day.

  4. Obesity_Type_II a lot of outliers, which do not belong to first quantile to third quantile range.

Further analysis on….

  1. Numerical value correlation for {NCP and Obesity levels} and {family_overweight_history to Obesity levels}.

  2. Analysis of other variables impact/influence on Obesity levels.