```{r}
/Users/jagathkumarreddyk/Downloads
```
library(ggplot2)
library(readr)
df <- read_csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv")
## Rows: 2111 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CAL...
## dbl (8): Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
For Numerical columns
summary(df[, c("Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE")])
## Age Height Weight FCVC
## Min. :14.00 Min. :1.450 Min. : 39.00 Min. :1.000
## 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47 1st Qu.:2.000
## Median :22.78 Median :1.700 Median : 83.00 Median :2.386
## Mean :24.31 Mean :1.702 Mean : 86.59 Mean :2.419
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43 3rd Qu.:3.000
## Max. :61.00 Max. :1.980 Max. :173.00 Max. :3.000
## NCP CH2O FAF TUE
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.659 1st Qu.:1.585 1st Qu.:0.1245 1st Qu.:0.0000
## Median :3.000 Median :2.000 Median :1.0000 Median :0.6253
## Mean :2.686 Mean :2.008 Mean :1.0103 Mean :0.6579
## 3rd Qu.:3.000 3rd Qu.:2.477 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :4.000 Max. :3.000 Max. :3.0000 Max. :2.0000
For categorical columns, this should include unique values and counts
uniq_vales_count <- lapply(df,table)
uniq_vales_count[c(1,5,6,9,10,12,15,16,17)]
## $Gender
##
## Female Male
## 1043 1068
##
## $family_history_with_overweight
##
## no yes
## 385 1726
##
## $FAVC
##
## no yes
## 245 1866
##
## $CAEC
##
## Always Frequently no Sometimes
## 53 242 51 1765
##
## $SMOKE
##
## no yes
## 2067 44
##
## $SCC
##
## no yes
## 2015 96
##
## $CALC
##
## Always Frequently no Sometimes
## 1 70 639 1401
##
## $MTRANS
##
## Automobile Bike Motorbike
## 457 7 11
## Public_Transportation Walking
## 1580 56
##
## $NObeyesdad
##
## Insufficient_Weight Normal_Weight Obesity_Type_I Obesity_Type_II
## 272 287 351 297
## Obesity_Type_III Overweight_Level_I Overweight_Level_II
## 324 290 290
This Project’s purpose is to analyze different variables and their relations that have an impact on obesity levels in individuals based on their eating habits and physical condition.
This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.
The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.” - From Estimation of Obesity Levels Based On Eating Habits and Physical Condition [Dataset]. (2019). UCI Machine Learning Repository. https://doi.org/10.24432/C5H31Z.
aver_weight_per_level <- aggregate(Weight~NObeyesdad,df,FUN = function(x) c(mean = mean(x),max = max(x),min = min(x) ))
aver_weight_per_level
## NObeyesdad Weight.mean Weight.max Weight.min
## 1 Insufficient_Weight 49.90633 65.00000 39.00000
## 2 Normal_Weight 62.15505 87.00000 42.30000
## 3 Obesity_Type_I 92.87020 125.00000 75.00000
## 4 Obesity_Type_II 115.30531 130.00000 93.00000
## 5 Obesity_Type_III 120.94111 173.00000 102.00000
## 6 Overweight_Level_I 74.26683 91.00000 53.00000
## 7 Overweight_Level_II 82.08527 102.00000 60.00000
aver_height_per_level <- aggregate(Height~NObeyesdad,df,FUN = function(x) c(mean = mean(x),max = max(x),min = min(x) ))
print(aver_height_per_level)
## NObeyesdad Height.mean Height.max Height.min
## 1 Insufficient_Weight 1.691117 1.900000 1.520000
## 2 Normal_Weight 1.676585 1.930000 1.500000
## 3 Obesity_Type_I 1.693804 1.980000 1.500000
## 4 Obesity_Type_II 1.771795 1.920000 1.600000
## 5 Obesity_Type_III 1.687559 1.870000 1.560000
## 6 Overweight_Level_I 1.687836 1.900000 1.450000
## 7 Overweight_Level_II 1.703748 1.930000 1.480000
This aggregation gives an proper an idea of distribution of values for each category of obesity level.
Weight of Insufficinet_Weight is very low although the average height of Insufficinet_Weight is very not different from other categories.
There is a clear distinction in average weights between different categories unlike average heights.
Raw observations from summary of Numeric and Categorical variables:
The obesity level is divided into 7 levels and the sample has good number of data on all the 7 levels.
The dataset has a good split of the two gender. So the observation can stand for the population.
Majority of sample people use - Public Transportation (and next Automobiles)
Age ranges from 14 to 61.But the 75th percentile is just 26. This implies that 61 is an outlier here and 3/4th of the sample data represents people below 26.
ggplot(data = df[,c('Height','Weight','NObeyesdad')], aes(x = Height, y = Weight , colour = NObeyesdad))+
geom_point(alpha = 0.8) +
facet_wrap(~ NObeyesdad)
This gives a clear picture of different obesity levels.
The slopes of all the categories is roughly the same, the decide factor is the intercept/constant in line(i.e. In y = m.x+c, the deciding factor is ‘c’).
Obesity_Type_III has outlier w.r.t Weight variable and Obesity_Type_I has outliers w.r.t Height variable
ggplot(data = df[,c('family_history_with_overweight', 'NObeyesdad')] , aes(x = NObeyesdad, fill = family_history_with_overweight) ) + geom_bar(position = 'dodge',color = 'black') + theme_minimal()+theme(axis.text.x = element_text(angle = 45, hjust = 1))
This graph shows how the family_overweight_history is related to Obesity levels.
ggplot(data = df[,c('NCP','NObeyesdad','family_history_with_overweight' )], aes(x = NCP , y = NObeyesdad , colour = family_history_with_overweight)) + geom_point(size = 0.5) + theme_minimal()
(This is better visualised using BOXPLOT below)
ggplot(data = df[,c('NCP','NObeyesdad','family_history_with_overweight' )], aes(x = NObeyesdad, y = NCP )) + geom_boxplot() + theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This plot helps to grasp the distribution of NCP for each Obesity levels.
The median for the entire distribution of NCP(No. of meals per day) is 3, So its represented with thick horizontal line.
Obesity_Type_III and Normal_weight category have similar distribution but Obesity_Type_III are two outliers one on 1(min) and 4(max).
In Obesity_Type_I category, the first quantile is less than 2 meals per day and the from Third quantile all of them are have 3 meals per day.
Obesity_Type_II a lot of outliers, which do not belong to first quantile to third quantile range.
Numerical value correlation for {NCP and Obesity levels} and {family_overweight_history to Obesity levels}.
Analysis of other variables impact/influence on Obesity levels.