Tugas Praktikum Sesi UTS, STA1232-Analisis Eksplorasi Data, IPB University
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("C:/Users/Muhammad Hafiz F/Documents/ali/2024-2025 (Sem 4)/STA1232-Analisis Eksplorasi Data/Tugas Praktikum Sesi UTS/Sleep_health_and_lifestyle_dataset.csv")
## Rows: 374 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Occupation, BMI Category, Blood Pressure, Sleep Disorder
## dbl (8): Person ID, Age, Sleep Duration, Quality of Sleep, Physical Activity...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df
## # A tibble: 374 × 13
##    `Person ID` Gender   Age Occupation       `Sleep Duration` `Quality of Sleep`
##          <dbl> <chr>  <dbl> <chr>                       <dbl>              <dbl>
##  1           1 Male      27 Software Engine…              6.1                  6
##  2           2 Male      28 Doctor                        6.2                  6
##  3           3 Male      28 Doctor                        6.2                  6
##  4           4 Male      28 Sales Represent…              5.9                  4
##  5           5 Male      28 Sales Represent…              5.9                  4
##  6           6 Male      28 Software Engine…              5.9                  4
##  7           7 Male      29 Teacher                       6.3                  6
##  8           8 Male      29 Doctor                        7.8                  7
##  9           9 Male      29 Doctor                        7.8                  7
## 10          10 Male      29 Doctor                        7.8                  7
## # ℹ 364 more rows
## # ℹ 7 more variables: `Physical Activity Level` <dbl>, `Stress Level` <dbl>,
## #   `BMI Category` <chr>, `Blood Pressure` <chr>, `Heart Rate` <dbl>,
## #   `Daily Steps` <dbl>, `Sleep Disorder` <chr>
colSums(is.na(df))
##               Person ID                  Gender                     Age 
##                       0                       0                       0 
##              Occupation          Sleep Duration        Quality of Sleep 
##                       0                       0                       0 
## Physical Activity Level            Stress Level            BMI Category 
##                       0                       0                       0 
##          Blood Pressure              Heart Rate             Daily Steps 
##                       0                       0                       0 
##          Sleep Disorder 
##                       0
df %>%
  count(`Sleep Disorder`) %>%
  mutate(percentage = n / sum(n) * 100)
## # A tibble: 3 × 3
##   `Sleep Disorder`     n percentage
##   <chr>            <int>      <dbl>
## 1 Insomnia            77       20.6
## 2 None               219       58.6
## 3 Sleep Apnea         78       20.9
ggplot(df, aes(x = `Sleep Disorder`)) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Distribution of Sleep Disorders", x = "Sleep Disorder", y = "Count")

summary(df %>% select(Age, `Sleep Duration`, `Quality of Sleep`, `Physical Activity Level`, `Stress Level`, `Heart Rate`, `Daily Steps`))
##       Age        Sleep Duration  Quality of Sleep Physical Activity Level
##  Min.   :27.00   Min.   :5.800   Min.   :4.000    Min.   :30.00          
##  1st Qu.:35.25   1st Qu.:6.400   1st Qu.:6.000    1st Qu.:45.00          
##  Median :43.00   Median :7.200   Median :7.000    Median :60.00          
##  Mean   :42.18   Mean   :7.132   Mean   :7.313    Mean   :59.17          
##  3rd Qu.:50.00   3rd Qu.:7.800   3rd Qu.:8.000    3rd Qu.:75.00          
##  Max.   :59.00   Max.   :8.500   Max.   :9.000    Max.   :90.00          
##   Stress Level     Heart Rate     Daily Steps   
##  Min.   :3.000   Min.   :65.00   Min.   : 3000  
##  1st Qu.:4.000   1st Qu.:68.00   1st Qu.: 5600  
##  Median :5.000   Median :70.00   Median : 7000  
##  Mean   :5.385   Mean   :70.17   Mean   : 6817  
##  3rd Qu.:7.000   3rd Qu.:72.00   3rd Qu.: 8000  
##  Max.   :8.000   Max.   :86.00   Max.   :10000
df %>%
  pivot_longer(cols = c(Age, `Sleep Duration`, `Quality of Sleep`, `Physical Activity Level`, `Stress Level`, `Heart Rate`, `Daily Steps`),
               names_to = "Feature", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(fill = "darkblue", bins = 30, alpha = 0.7) +
  facet_wrap(~ Feature, scales = "free") +
  theme_minimal()

df %>%
  select(Gender, Occupation, `BMI Category`, `Sleep Disorder`) %>%
  map(~ table(.))
## $Gender
## .
## Female   Male 
##    185    189 
## 
## $Occupation
## .
##           Accountant               Doctor             Engineer 
##                   37                   71                   63 
##               Lawyer              Manager                Nurse 
##                   47                    1                   73 
## Sales Representative          Salesperson            Scientist 
##                    2                   32                    4 
##    Software Engineer              Teacher 
##                    4                   40 
## 
## $`BMI Category`
## .
##        Normal Normal Weight         Obese    Overweight 
##           195            21            10           148 
## 
## $`Sleep Disorder`
## .
##    Insomnia        None Sleep Apnea 
##          77         219          78
df %>%
  count(Gender, `Sleep Disorder`) %>%
  ggplot(aes(x = Gender, y = n, fill = `Sleep Disorder`)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Gender vs Sleep Disorder")

library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(df %>% select_if(is.numeric), label = TRUE, hjust = 0.8)

ggplot(df, aes(x = `Sleep Disorder`, y = `Stress Level`, fill = `Sleep Disorder`)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Stress Level by Sleep Disorder")

1 AED dengan konteks akan membangun model klasifikasi untuk “Sleep Disorder”

1.1 Class Imbalance

  • “None” (tidak ada gangguan) dominan (58.6%), sedangkan Insomnia dan Sleep Apnea hanya sekitar 20%.
  • Model klasifikasi kemungkinan hanya akan mengkalisifikasi sebagai “None” terus kecuali data diolah agar meng-address imbalance.

1.2 Korelasi

  • Sleep Duration & Quality of Sleep (-0.9)Korelasi negatif besar.
    • Tidur lebih lama ≠ Tidur lebih berkualitas.
  • Stress Level & Quality of Sleep (-0.7)
    • Makin stres = makin jelek kualitas tidurnya.
  • Stress Level & Heart Rate (0.7)
    • Stres tinggi → Detak jantung naik.
  • Physical Activity & Daily Steps (0.8)
    • Makin banyak jalan tentu membuat responden terhitung makin aktif.

1.2.1 Kesimpulan?

  • Beberapa variabel mungkin redundan.
    • Sleep Duration & Quality of Sleep-jika salah satu sudah cukup kuat, mungkin bisa buang yang lain.
  • Stress Level terlihat penting.
    • Berhubungan ke Heart Rate & Sleep QualityKandidat fitur penting.

1.3 Sleep Disorder vs Gender

  • Laki-laki lebih banyak yang “None” (tidak ada gangguan).
  • Perempuan lebih banyak Insomnia.
  • Sleep Apnea cukup seimbang, namun agak lebih banyak di laki-laki.
  • Bisa jadi faktor biologis atau hanya bias laporan.

1.3.1 Penting?

  • Jika Gender menjadi prediktor kuat, maka keep.
  • Jika hanya noise, bisa dibuang.

1.4 Stres vs Sleep Disorder

  • Insomnia → Stres tinggi (~7)
  • Sleep Apnea → Lebih tinggi dari Insomnia.
  • “None” → Lebih rendah (~5-6).

1.4.1 Kesimpulan?

  • Stres bisa jadi fitur kunci.
    • Namun apakah ini penyebab atau akibat dari gangguan tidur?

1.5 Distribusi Data & Pekerjaan

  • Umur: Mayoritas 30-50 tahun.
  • Daily Steps: Rentang besar (3k-10k), cukup noisy datanya.
  • Heart Rate: Mayoritas 65-75 bpm.
  • Physical Activity Level: Distribusinya cukup aneh untuk data yang sifatnya nature (walaupun memang ini data sintetis sih).
  • Pekerjaan:
    • Banyak Nurse & Dokter → Bias dari tenaga kesehatan?
    • Cuma 4 Scientist, 4 Software Engineer → Kemungkinan tidak cukup untuk jadi relevan.

1.6 Kesimpulan Akhir

  • Imbalance = Problem → Harus diatasi biar model tidak terus mengklasifikasi sebagai “None”.
  • Stres & Sleep Quality keliatan fitur penting.
  • Beberapa fitur berkorelasi, mungkin perlu buang/merge.
  • Gender & Pekerjaan bisa jadi noise.