This assignment analyses the Sleep Health & Lifestyle Dataset to understand how lifestyle factors such as occupational role and psychological stress could shape two core sleep outcomes: sleep duration and self-rated sleep quality.Understanding these relationships is relevant for us as individual and corporations when creating workplace wellness programs to improve employees performance.
APPROACH
To conduct the analysis of the dataset, i will implement a structured data science pipeline using the “tidyverse” framework as followed:
Load the CSV dataset in R and commit to an existing GitHub repository ensuring its accessibility at anytime .
Rename some variables by removing “_” and follow a consistent naming convention.
Transform my dataset from Wide to Long format using the pivot_longer function to convert the sleep duration and quality into a single metric column for easier modelling and faceted plotting.
Analyze and Visualize the Correlation between Sleep Duration and Quality related to Occupation. Interpret the result
Analyze and Visualize the Correlation between Sleep Duration and Quality related to Stress Level. Interpret the result
Load Library
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) # for cleaning column names
Warning: package 'janitor' was built under R version 4.5.2
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Load and clean the data
I will read the file from my GitHub. For the cleaning i will need the help of the LLM Claude because some columns names are inconsistent and some variables are compounds. Here is the prompt : “Clean this dataset by renaming to snake_case and parsing compounds variables.”
Rows: 374 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Gender, Occupation, BMI Category, Blood Pressure, Sleep Disorder
dbl (8): Person ID, Age, Sleep Duration, Quality of Sleep, Physical Activity...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Sleep_health_and_lifestyle,10)
# A tibble: 10 × 13
`Person ID` Gender Age Occupation `Sleep Duration` `Quality of Sleep`
<dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 1 Male 27 Software Engine… 6.1 6
2 2 Male 28 Doctor 6.2 6
3 3 Male 28 Doctor 6.2 6
4 4 Male 28 Sales Represent… 5.9 4
5 5 Male 28 Sales Represent… 5.9 4
6 6 Male 28 Software Engine… 5.9 4
7 7 Male 29 Teacher 6.3 6
8 8 Male 29 Doctor 7.8 7
9 9 Male 29 Doctor 7.8 7
10 10 Male 29 Doctor 7.8 7
# ℹ 7 more variables: `Physical Activity Level` <dbl>, `Stress Level` <dbl>,
# `BMI Category` <chr>, `Blood Pressure` <chr>, `Heart Rate` <dbl>,
# `Daily Steps` <dbl>, `Sleep Disorder` <chr>