Les nostres dades provenen d’una pàgina web que va fer una enquesta sobre l’obesitat: https://www.kaggle.com/datasets/itsrohithere/obesity-and-lifestyle-dataset?resource=download
L’objectiu del projecte és trobar quina faceta de les nostres vides influeix més en la salut física.
La pregunta que volem respondre és: Què influeix més en l’IMC (Índex de Massa Corporal) d’una persona?
Aquestes dades són adequades per la nostra pregunta perquè s’han tingut en compte diferents hàbits i factors que poden arribar a portar a una persona a patir obesitat.
load("Obesity and Lifestyle Data.RData")
dades = datos
El format original era .csv. No hem fet cap modificació al conjunt de dades
dim(dades)
## [1] 1000 11
glimpse(dades)
## Rows: 1,000
## Columns: 11
## $ Age <int> 56, 46, 32, 60, 25, 38, 56, 36, 40, 28, 28…
## $ Gender <chr> "Male", "Female", "Female", "Female", "Mal…
## $ Height <dbl> 1.84, 1.72, 1.64, 2.00, 1.71, 1.73, 1.58, …
## $ Weight <dbl> 69.6, 110.1, 70.5, 113.8, 102.8, 106.3, 51…
## $ Family_History_Obesity <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "N…
## $ Physical_Activity_Frequency <dbl> 7.4, 2.8, 2.8, 9.6, 0.1, 7.2, 7.1, 6.2, 9.…
## $ Dietary_Habits <chr> "Balanced", "High-calorie", "Low-calorie",…
## $ Water_Intake <dbl> 4.4, 4.3, 1.1, 4.6, 1.1, 2.0, 1.1, 2.3, 3.…
## $ Smoking_Habits <chr> "No", "Yes", "Yes", "No", "No", "No", "Yes…
## $ Alcohol_Consumption <chr> "Occasionally", "Never", "Never", "Frequen…
## $ Obesity_Level <chr> "Obese Type I", "Overweight", "Normal Weig…
tibble(
variable = names(dades),
tipus = sapply(dades, class)
)
## # A tibble: 11 × 2
## variable tipus
## <chr> <chr>
## 1 Age integer
## 2 Gender character
## 3 Height numeric
## 4 Weight numeric
## 5 Family_History_Obesity character
## 6 Physical_Activity_Frequency numeric
## 7 Dietary_Habits character
## 8 Water_Intake numeric
## 9 Smoking_Habits character
## 10 Alcohol_Consumption character
## 11 Obesity_Level character
summary(dades)
## Age Gender Height Weight
## Min. :18.00 Length:1000 Min. :1.500 Min. : 50.00
## 1st Qu.:29.00 Class :character 1st Qu.:1.620 1st Qu.: 68.67
## Median :42.00 Mode :character Median :1.750 Median : 84.75
## Mean :40.99 Mean :1.751 Mean : 84.83
## 3rd Qu.:52.00 3rd Qu.:1.870 3rd Qu.:102.03
## Max. :64.00 Max. :2.000 Max. :119.80
## Family_History_Obesity Physical_Activity_Frequency Dietary_Habits
## Length:1000 Min. : 0.000 Length:1000
## Class :character 1st Qu.: 2.500 Class :character
## Mode :character Median : 5.000 Mode :character
## Mean : 4.941
## 3rd Qu.: 7.300
## Max. :10.000
## Water_Intake Smoking_Habits Alcohol_Consumption Obesity_Level
## Min. :1.000 Length:1000 Length:1000 Length:1000
## 1st Qu.:1.900 Class :character Class :character Class :character
## Median :2.950 Mode :character Mode :character Mode :character
## Mean :2.952
## 3rd Qu.:3.900
## Max. :5.000
No n’hi ha cap valor perdut ja que per cada columna se’ns genera les seves variables descriptives. En principi, no hem de transformar cap variable.
ggplot(dades, aes(x = Age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(dades, aes(x = Weight)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(dades, aes(x = Physical_Activity_Frequency)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
L’única dada externa que necessitaríem seria l’IMC ideal per cada persona, que hauríem de buscar una pàgina web on ens doni per cada rang d’IMC l’interval en què es troba la persona de composició corporal: https://www.texasheart.org/heart-health/heart-information-+center/topics/calculadora-del-indice-de-masa-corporal-imc/