| Tugas Praktikum Sesi UTS, STA1232-Analisis Eksplorasi Data, IPB
University |
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("C:/Users/Muhammad Hafiz F/Documents/ali/2024-2025 (Sem 4)/STA1232-Analisis Eksplorasi Data/Tugas Praktikum Sesi UTS/Sleep_health_and_lifestyle_dataset.csv")
## Rows: 374 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Occupation, BMI Category, Blood Pressure, Sleep Disorder
## dbl (8): Person ID, Age, Sleep Duration, Quality of Sleep, Physical Activity...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 374 × 13
## `Person ID` Gender Age Occupation `Sleep Duration` `Quality of Sleep`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1 Male 27 Software Engine… 6.1 6
## 2 2 Male 28 Doctor 6.2 6
## 3 3 Male 28 Doctor 6.2 6
## 4 4 Male 28 Sales Represent… 5.9 4
## 5 5 Male 28 Sales Represent… 5.9 4
## 6 6 Male 28 Software Engine… 5.9 4
## 7 7 Male 29 Teacher 6.3 6
## 8 8 Male 29 Doctor 7.8 7
## 9 9 Male 29 Doctor 7.8 7
## 10 10 Male 29 Doctor 7.8 7
## # ℹ 364 more rows
## # ℹ 7 more variables: `Physical Activity Level` <dbl>, `Stress Level` <dbl>,
## # `BMI Category` <chr>, `Blood Pressure` <chr>, `Heart Rate` <dbl>,
## # `Daily Steps` <dbl>, `Sleep Disorder` <chr>
## Person ID Gender Age
## 0 0 0
## Occupation Sleep Duration Quality of Sleep
## 0 0 0
## Physical Activity Level Stress Level BMI Category
## 0 0 0
## Blood Pressure Heart Rate Daily Steps
## 0 0 0
## Sleep Disorder
## 0
df %>%
count(`Sleep Disorder`) %>%
mutate(percentage = n / sum(n) * 100)
## # A tibble: 3 × 3
## `Sleep Disorder` n percentage
## <chr> <int> <dbl>
## 1 Insomnia 77 20.6
## 2 None 219 58.6
## 3 Sleep Apnea 78 20.9
ggplot(df, aes(x = `Sleep Disorder`)) +
geom_bar(fill = "steelblue") +
theme_minimal() +
labs(title = "Distribution of Sleep Disorders", x = "Sleep Disorder", y = "Count")

summary(df %>% select(Age, `Sleep Duration`, `Quality of Sleep`, `Physical Activity Level`, `Stress Level`, `Heart Rate`, `Daily Steps`))
## Age Sleep Duration Quality of Sleep Physical Activity Level
## Min. :27.00 Min. :5.800 Min. :4.000 Min. :30.00
## 1st Qu.:35.25 1st Qu.:6.400 1st Qu.:6.000 1st Qu.:45.00
## Median :43.00 Median :7.200 Median :7.000 Median :60.00
## Mean :42.18 Mean :7.132 Mean :7.313 Mean :59.17
## 3rd Qu.:50.00 3rd Qu.:7.800 3rd Qu.:8.000 3rd Qu.:75.00
## Max. :59.00 Max. :8.500 Max. :9.000 Max. :90.00
## Stress Level Heart Rate Daily Steps
## Min. :3.000 Min. :65.00 Min. : 3000
## 1st Qu.:4.000 1st Qu.:68.00 1st Qu.: 5600
## Median :5.000 Median :70.00 Median : 7000
## Mean :5.385 Mean :70.17 Mean : 6817
## 3rd Qu.:7.000 3rd Qu.:72.00 3rd Qu.: 8000
## Max. :8.000 Max. :86.00 Max. :10000
df %>%
pivot_longer(cols = c(Age, `Sleep Duration`, `Quality of Sleep`, `Physical Activity Level`, `Stress Level`, `Heart Rate`, `Daily Steps`),
names_to = "Feature", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(fill = "darkblue", bins = 30, alpha = 0.7) +
facet_wrap(~ Feature, scales = "free") +
theme_minimal()

df %>%
select(Gender, Occupation, `BMI Category`, `Sleep Disorder`) %>%
map(~ table(.))
## $Gender
## .
## Female Male
## 185 189
##
## $Occupation
## .
## Accountant Doctor Engineer
## 37 71 63
## Lawyer Manager Nurse
## 47 1 73
## Sales Representative Salesperson Scientist
## 2 32 4
## Software Engineer Teacher
## 4 40
##
## $`BMI Category`
## .
## Normal Normal Weight Obese Overweight
## 195 21 10 148
##
## $`Sleep Disorder`
## .
## Insomnia None Sleep Apnea
## 77 219 78
df %>%
count(Gender, `Sleep Disorder`) %>%
ggplot(aes(x = Gender, y = n, fill = `Sleep Disorder`)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Gender vs Sleep Disorder")

## Warning: package 'GGally' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(df %>% select_if(is.numeric), label = TRUE, hjust = 0.8)

ggplot(df, aes(x = `Sleep Disorder`, y = `Stress Level`, fill = `Sleep Disorder`)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Stress Level by Sleep Disorder")

AED dengan konteks akan
membangun model klasifikasi untuk “Sleep Disorder”
Class
Imbalance
- “None” (tidak ada gangguan) dominan (58.6%),
sedangkan Insomnia dan Sleep Apnea hanya sekitar
20%.
- Model klasifikasi kemungkinan hanya akan mengkalisifikasi
sebagai “None” terus kecuali data diolah agar meng-address
imbalance.
Korelasi
- Sleep Duration & Quality of Sleep (-0.9) →
Korelasi negatif besar.
- Tidur lebih lama ≠ Tidur lebih berkualitas.
- Stress Level & Quality of Sleep (-0.7)
- Makin stres = makin jelek kualitas tidurnya.
- Stress Level & Heart Rate (0.7)
- Stres tinggi → Detak jantung naik.
- Physical Activity & Daily Steps (0.8)
- Makin banyak jalan tentu membuat responden terhitung makin
aktif.
Kesimpulan?
- Beberapa variabel mungkin redundan.
- Sleep Duration & Quality of Sleep-jika salah satu sudah cukup
kuat, mungkin bisa buang yang lain.
- Stress Level terlihat penting.
- Berhubungan ke Heart Rate & Sleep
Quality → Kandidat fitur penting.
Sleep
Disorder vs Gender
- Laki-laki lebih banyak yang “None” (tidak ada
gangguan).
- Perempuan lebih banyak Insomnia.
- Sleep Apnea cukup seimbang, namun agak lebih banyak di
laki-laki.
- Bisa jadi faktor biologis atau hanya bias
laporan.
Penting?
- Jika Gender menjadi prediktor kuat, maka
keep.
- Jika hanya noise, bisa dibuang.
Stres vs
Sleep Disorder
- Insomnia → Stres tinggi (~7)
- Sleep Apnea → Lebih tinggi dari Insomnia.
- “None” → Lebih rendah (~5-6).
Kesimpulan?
- Stres bisa jadi fitur kunci.
- Namun apakah ini penyebab atau akibat dari gangguan
tidur?
Distribusi
Data & Pekerjaan
- Umur: Mayoritas 30-50 tahun.
- Daily Steps: Rentang besar (3k-10k), cukup
noisy datanya.
- Heart Rate: Mayoritas 65-75
bpm.
- Physical Activity Level: Distribusinya
cukup aneh untuk data yang sifatnya nature (walaupun memang ini data
sintetis sih).
- Pekerjaan:
- Banyak Nurse & Dokter → Bias dari tenaga
kesehatan?
- Cuma 4 Scientist, 4 Software Engineer → Kemungkinan
tidak cukup untuk jadi relevan.
Kesimpulan
Akhir
- Imbalance = Problem → Harus diatasi biar model
tidak terus mengklasifikasi sebagai “None”.
- Stres & Sleep Quality keliatan fitur
penting.
- Beberapa fitur berkorelasi, mungkin perlu
buang/merge.
- Gender & Pekerjaan bisa jadi noise.