library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Load the Obesity Dataset

df <- read.csv("~/Downloads/ObesityDataSet_raw_and_data_sinthetic.csv", header=TRUE)

1) A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

E.g., this could be a column name, or just some value inside a cell of your data

In this dataset, 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.

colnames(df)
##  [1] "Gender"                         "Age"                           
##  [3] "Height"                         "Weight"                        
##  [5] "family_history_with_overweight" "FAVC"                          
##  [7] "FCVC"                           "NCP"                           
##  [9] "CAEC"                           "SMOKE"                         
## [11] "CH2O"                           "SCC"                           
## [13] "FAF"                            "TUE"                           
## [15] "CALC"                           "MTRANS"                        
## [17] "NObeyesdad"
  1. NObeyesdad : Obesity levels based on WHO approved Body Mass Index classification.
  1. The formula used to calculate Mass Body Index = weight/(height*height).
  2. The information about categories
| Obesity Level | Body Mass Index |
|---------------|-----------------|
| Underweight   | Less than 18.5  |
| Normal        | 18.5 to 24.9    |
| Overweight    | 25.0 to 29.9    |
| Obesity I     | 30.0 to 34.9    |
| Obesity II    | 35.0 to 39.9    |
| Obesity III   | Higher than 40  |
  1. FAVC : Frequent consumption of high caloric food ; The data is answer to the question ‘Do you usually eat vegetables in your meals?’ NOT THE FREQUENCY
  2. FCVC : Frequency of consumption of vegetables ; The numerical values are associated with different categories. ‘never’ as 1, ‘sometimes’ as 2 and ‘always’ as 3.
  3. The frequency-columns are measured across different time intervals depending on the columns.
  1. Example, FAF - Fitness Activity Frequency - is measured across a week and how many days do a person engage in Physical Activity, instead of the number of hours they engage in.

  2. Other example is TUE - Time using Electronics - This is measured in hours per day

Why do you think they chose to encode the data the way they did?

  1. Based on the following reasoning author made the encoding of the data. For example FAVC, there isn’t a good criteria to report the vegetables consumption frequency, since no-one generally measures their vegetables to meals ratio. Thus ‘never’, ‘sometimes’,‘always’ categories are used.

Regarding column names:

  1. Consumption of Alcohol is straightforward CALC, but FAF isn’t as straightforward as Frequency of Physical Activity. Fitness Activity Frequency fits the encoded FAF.

What could have happened if you didn’t read the documentation?

  1. I initially misinterpreted the classification of obesity levels as being linear (w/h); however, upon reviewing the documentation, I learned that it is based on a quadratic equation (w/(h²)).

  2. This misunderstanding could have impacted my future assignments and resulted in incomplete analyses.

2) At least one element or your data that is unclear even after reading the documentation. You may need to do some digging, but is there anything about the data that your documentation does not explain?

  1. In CAEC, CALC variables, the exact range/metric criteria used to classify the frequency categories - ‘Sometimes’, ‘Frequently’, ‘Always’ - is not explained. As it is subjective to each individual, an appropriate numerical range would have explained more about the data.
  2. Although the documentation explains all columns’ datatypes and most of its classifying criteria, it is yet to explain the encoded columns names and the reason for choosing them.

3) Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear.You can use color or an annotation, but also make sure to explain your thoughts using Markdown Do you notice any significant risks? If so, what could you do to reduce negative consequences?

ggplot(data = df[, c('CAEC', 'NObeyesdad')], aes(x = NObeyesdad, fill = CAEC)) + geom_bar(position = 'dodge', color = 'black') +
  scale_fill_manual(values = c("white", "grey", "black", "red"))  + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Notice the ‘red’ bars:

  1. Most of the data entries are “sometimes”, this reflects the highly subjective nature of high-caloric meal frequency. For instance, what a person of normal weight may classify as “frequently” consuming high-caloric meals over 3-4 days could be perceived differently by an individual classified as Obesity Type III, who might categorize 5 days as “frequently” and 3-4 days as “sometimes.”

  2. Therefore, a thorough analysis of this variable does not provide a complete understanding due to inherent measurement bias. While there are no significant risks or consequences associated with this bias, it is crucial to acknowledge the subjectivity involved in measuring the CAEC variable.