Data Dive 4 - Sampling and Drawing Conclusions

Import the obesity data set for analysis

library (tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.3.5     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

obesity <- read.csv(file.choose())

Create the first random sample: Selecting from Gender, Age, Height, Weight, FAVC (do you eat high calorie foods frequently?), SCC (do you monitor the calories you eat daily?) and NObeyesdad (obesity level)

n = 1000

df_1 <- obesity |>
  select(Gender, Age, Height, Weight, FAVC, SCC, NObeyesdad) |>
  sample_n(n, replace = TRUE)

Scrutinize df_1:

#Summarize the data
df1_summary <- df_1 |>
  group_by(FAVC, SCC, NObeyesdad) |>
  summarize (
  avg_age = mean(Age),
  avg_height = mean(Height),
  avg_weight = mean(Weight)
  )

## `summarise()` has grouped output by 'FAVC', 'SCC'. You can override using the
## `.groups` argument.

head(df1_summary)

## # A tibble: 6 x 6
## # Groups:   FAVC, SCC [1]
##   FAVC  SCC   NObeyesdad          avg_age avg_height avg_weight
##   <chr> <chr> <chr>                 <dbl>      <dbl>      <dbl>
## 1 no    no    Insufficient_Weight    20.7       1.66       48.1
## 2 no    no    Normal_Weight          23.1       1.68       63.2
## 3 no    no    Obesity_Type_I         25.5       1.74       95.5
## 4 no    no    Obesity_Type_II        33.3       1.76      116. 
## 5 no    no    Overweight_Level_I     27.0       1.75       81.5
## 6 no    no    Overweight_Level_II    21.5       1.62       74.1

To briefly scrutinize df_1, the age ranges are all about the same for each group (yes/no to FAVC and yes/no to SCC, disregarding NObeyesdad) - ranging from around 19 to late 20’s/early 30’s. There is one outlier in the age category, for individuals who do not eat high calorie foods but do monitor their calories, the average age is 46.25, which is much higher than the other ages in this dataframe. There did not appear to be any outliers in the height variable. As far as weight, for groups that had people in the “insufficient weight” category, there were no apparent outliers. For groups that had people in the “normal weight” category, those who reported eating high calorie foods and monitoring their calories had a little bit of a higher average weight than other groups (69 kg compared to low 60’s kg), Obesity Type II varied, but I think that that is because of the height difference. Those in Overweight Level I category who reported eating high calorie foods and monitoring their calories actually had a lower average weight than those who did both but were in the normal weight category, which makes me wonder about the actual categorization of these individuals.

Create the second random sample.

df_2 <- obesity |>
  select(Gender, Age, Height, Weight, FAVC, SCC, NObeyesdad) |>
  sample_n(n, replace = TRUE)


#Summarize the data
df2_summary <- df_2 |>
  group_by(FAVC, SCC, NObeyesdad) |>
  summarize (
  avg_age = mean(Age),
  avg_height = mean(Height),
  avg_weight = mean(Weight)
  )

## `summarise()` has grouped output by 'FAVC', 'SCC'. You can override using the
## `.groups` argument.

head(df2_summary)

## # A tibble: 6 x 6
## # Groups:   FAVC, SCC [1]
##   FAVC  SCC   NObeyesdad          avg_age avg_height avg_weight
##   <chr> <chr> <chr>                 <dbl>      <dbl>      <dbl>
## 1 no    no    Insufficient_Weight    22.1       1.63       46.6
## 2 no    no    Normal_Weight          24.3       1.68       65.1
## 3 no    no    Obesity_Type_I         25.8       1.70       93.5
## 4 no    no    Obesity_Type_II        26.7       1.86      124  
## 5 no    no    Overweight_Level_I     27.1       1.68       72.8
## 6 no    no    Overweight_Level_II    20.5       1.60       72.5

For df_2, there are a few ages that are a bit higher than the average across the board - those who report not eating high calorie foods but monitor their calories and are in Overweight Level II have an average age of 33, and those who report not eating high calorie foods and report not monitoring their calories and are in Overweight Level I have an average age of 31. For height, they are pretty average across the board, but individuals who report eating high calorie foods, report not monitoring their calories, and are in Obesity Type II are 1.76 m on average. Looking at weight in the “Insufficient Weight” category, these individuals are similar to df_1 across the board - individuals who report not eating high calorie foods and monitoring their calories have an average weight of 43 kg, which is a bit lower than “Insufficient Weight” for other dataframes. For FAVC = no and SCC = no, Obesity Type II, the average weight was 93 kg, and for FAVC = yes and SCC = no Obesity Type II, the average weight was 114kg. As with df_1, there is a height difference here, so I think that plays into it. However, looking at Obesity Type III, FAVC = no and SCC = no has an average weight of 112 kg and FAVC = yes and SCC = no has an average weight of 119 kg but very similar heights, so that is interesting as well. Also for Overweight Level I, FAVC = yes and SCC = yes, the average weight is 62 kg, which is much lower than FAVC = no and SCC = yes where the weight is 75 kg.

Create the third random sample.

df_3 <- obesity |>
  select(Gender, Age, Height, Weight, FAVC, SCC, NObeyesdad) |>
  sample_n(n, replace = TRUE)
  

#Summarize the data
df3_summary <- df_3 |>
  group_by(FAVC, SCC, NObeyesdad) |>
  summarize (
  avg_age = mean(Age),
  avg_height = mean(Height),
  avg_weight = mean(Weight)
  )

## `summarise()` has grouped output by 'FAVC', 'SCC'. You can override using the
## `.groups` argument.

head(df3_summary)

## # A tibble: 6 x 6
## # Groups:   FAVC, SCC [1]
##   FAVC  SCC   NObeyesdad          avg_age avg_height avg_weight
##   <chr> <chr> <chr>                 <dbl>      <dbl>      <dbl>
## 1 no    no    Insufficient_Weight    20.6       1.64       46.5
## 2 no    no    Normal_Weight          23.7       1.64       59.8
## 3 no    no    Obesity_Type_I         21.7       1.74       98  
## 4 no    no    Obesity_Type_II        25         1.83      121  
## 5 no    no    Overweight_Level_I     29.4       1.68       74.9
## 6 no    no    Overweight_Level_II    21.7       1.63       73.9

Looking at df_3, where FAVC = no and SCC = no, for Obesity Type I, Obesity Type II, Overweight Level I all have higher average ages (33.67, 34.13, and 36.24 respectively), as well as FAVC = no and SCC = yes for Normal Weight, which is 35.5 years. Looking at height, for FAVC = yes, SCC = no, and category Obesity Type II, the average is 1.77 m, as is FAVC = no, SCC = no, category Obesity Type II, which is a bit higher than the general range of average heights across the board. Looking at weight - Overweight Level II - for FAVC = no and SCC = no, the average weight is 76 kg, but for FAVC = yes and SCC = no, the average weight is 86 kg. The heights for these are 1.63 m and 1.72, respectively, which might explain why they are both in the same category with such different weights.

Create the fourth random sample.

df_4 <- obesity |>
  select(Gender, Age, Height, Weight, FAVC, SCC, NObeyesdad) |>
  sample_n(n, replace = TRUE)
  

#Summarize the data
df4_summary <- df_4 |>
  group_by(FAVC, SCC, NObeyesdad) |>
  summarize (
  avg_age = mean(Age),
  avg_height = mean(Height),
  avg_weight = mean(Weight)
  )

## `summarise()` has grouped output by 'FAVC', 'SCC'. You can override using the
## `.groups` argument.

head(df4_summary)

## # A tibble: 6 x 6
## # Groups:   FAVC, SCC [1]
##   FAVC  SCC   NObeyesdad          avg_age avg_height avg_weight
##   <chr> <chr> <chr>                 <dbl>      <dbl>      <dbl>
## 1 no    no    Insufficient_Weight    20.9       1.62       46.2
## 2 no    no    Normal_Weight          23.3       1.69       63.0
## 3 no    no    Obesity_Type_I         20         1.70       96  
## 4 no    no    Obesity_Type_II        33.0       1.80      120. 
## 5 no    no    Overweight_Level_I     33.7       1.64       74.8
## 6 no    no    Overweight_Level_II    23.5       1.60       72.2

Looking at df_4, the average age ranges from 18-34, but it increases by increments of 1-2 years at most, so I wouldn’t consider the ages of 33 or 34 to be outliers in this sample in the way that they are in other samples because of the gradual progression to get to those ages. There is one slight outlier when looking at average height - for FAVC = no, SCC = no, and NObeyesdad = Obesity Type II, the average height is 1.79m, which is a bit outside the range of 1.58m-1.76m. Finally, looking at weight, for FAVC = yes, SCC = no, NObeyesdad = Insufficient Weight, the average weight was 50.22 kg, and for FAVC = yes, SCC = yes, and NObeyesdad = Insufficient Weight, the average weight was 54.11 kg, which are both higher than the other options in the Insufficient Weight category, which are around 45 kg. There is also a bit of an outlier in the Normal Weight category where FAVC and SCC both = yes. All the other options are between 62-65 kg, but this data point is at 57 kg, which is lower. There were two data points in Obesity Type II which varied greatly, one being 93.53 kg and the other being 97.28 kg, with no distinct difference in the height.

Create the fifth random sample.

df_5 <- obesity |>
  select(Gender, Age, Height, Weight, FAVC, SCC, NObeyesdad) |>
  sample_n(n, replace = TRUE)
  

#Summarize the data
df5_summary <- df_5 |>
  group_by(FAVC, SCC, NObeyesdad) |>
  summarize (
  avg_age = mean(Age),
  avg_height = mean(Height),
  avg_weight = mean(Weight)
  )

## `summarise()` has grouped output by 'FAVC', 'SCC'. You can override using the
## `.groups` argument.

head(df5_summary)

## # A tibble: 6 x 6
## # Groups:   FAVC, SCC [1]
##   FAVC  SCC   NObeyesdad          avg_age avg_height avg_weight
##   <chr> <chr> <chr>                 <dbl>      <dbl>      <dbl>
## 1 no    no    Insufficient_Weight    20.6       1.67       48.4
## 2 no    no    Normal_Weight          23.6       1.70       64.3
## 3 no    no    Obesity_Type_I         29.2       1.68       89  
## 4 no    no    Obesity_Type_II        29.0       1.73      113. 
## 5 no    no    Overweight_Level_I     27.0       1.69       75.4
## 6 no    no    Overweight_Level_II    22.3       1.62       74.8

Finally, looking at the summary for sample 5, there are no outliers as far as age goes. There is one outlier for height, FAVC and SCC = no, NObeyesdad = Obesity Type II, where the height is 1.84 m. Moving onto weight, there is an outlier in the Normal Weight category, where FAVC = no and SCC = yes, where the average weight is 69 kg, higher than the other data points in that weight category. There is also an outlier in the Obesity Type I category, where FAVC and SCC both = yes, the average weight is 83 kg, much lower than the other datapoints which are around the 92kg area.

Scrutinize the samples overall

Overall, I would say that these samples are mostly very similar, with the exception of a few outliers, which is to be expected, especially in a dataset concerning health data, as each individual can have very different health markers and lifestyle habits and be in the came category as people vastly different from them. The main outliers I noticed were in df_1, where the average age in one category was 46 years. This would be considered an outlier in every dataframe, whereas some dataframes had age outliers at 31-33 years, but these would not be considered outliers in every dataframe. The second main outlier I noticed was in df_5, where the average height for one of the groups was 1.84m. This is another data point that would have been an outlier in any dataframe, not just it’s own individual sample. Anomalies that were anomalies in their subsample but not others were already discussed in the summary of each dataframe.

One big thing that this investigation has shown me is the need for me to use BMI as a way of comparing groups. I noticed that there would often be average weights in “more obese” categories that were lower than a “less obese” category, but because the heights were so varied, I wasn’t getting the full picture. It will be important to note that BMI cannot be compared to other health markers if I want to do a true, fair analysis, because BMI is only a measure of weight versus height, but I think it will help even things out across the board.

Data Dive 4 - Sampling and Drawing Conclusions

Kylie Heagy

2024-09-23