Table of Contents

This report presents a detailed analysis of sleep health and lifestyle factors using data exploration techniques.

1. The Purpose of the Project

Sleep quality is an essential component of a healthy lifestyle. However, modern lifestyles have led to an increase in sleep-related issues such as insomnia and sleep apnea.

The purpose of this project is to analyze the relationship between lifestyle factors (such as stress, physical activity, and BMI) and sleep health using data analysis techniques.

2. Data Features

The dataset consists of the following features: - Gender, Age, Occupation - Sleep Duration, Quality of Sleep - Physical Activity Level, Stress Level - BMI Category, Blood Pressure - Heart Rate, Daily Steps - Sleep Disorder (Target variable)

3. Importing Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded

4. Reading Data

knitr::opts_chunk$set(fig.width = 10, fig.height = 6)
data <- read.csv("Sleep_health_and_lifestyle_dataset.csv")
head(data)
##   Person.ID Gender Age           Occupation Sleep.Duration Quality.of.Sleep
## 1         1   Male  27    Software Engineer            6.1                6
## 2         2   Male  28               Doctor            6.2                6
## 3         3   Male  28               Doctor            6.2                6
## 4         4   Male  28 Sales Representative            5.9                4
## 5         5   Male  28 Sales Representative            5.9                4
## 6         6   Male  28    Software Engineer            5.9                4
##   Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1                      42            6   Overweight         126/83         77
## 2                      60            8       Normal         125/80         75
## 3                      60            8       Normal         125/80         75
## 4                      30            8        Obese         140/90         85
## 5                      30            8        Obese         140/90         85
## 6                      30            8        Obese         140/90         85
##   Daily.Steps Sleep.Disorder
## 1        4200           None
## 2       10000           None
## 3       10000           None
## 4        3000    Sleep Apnea
## 5        3000    Sleep Apnea
## 6        3000       Insomnia

5. Statistical Information

cat("The dimension of data is:", dim(data))
## The dimension of data is: 374 13
str(data)
## 'data.frame':    374 obs. of  13 variables:
##  $ Person.ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                 : chr  "Male" "Male" "Male" "Male" ...
##  $ Age                    : int  27 28 28 28 28 28 29 29 29 29 ...
##  $ Occupation             : chr  "Software Engineer" "Doctor" "Doctor" "Sales Representative" ...
##  $ Sleep.Duration         : num  6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
##  $ Quality.of.Sleep       : int  6 6 6 4 4 4 6 7 7 7 ...
##  $ Physical.Activity.Level: int  42 60 60 30 30 30 40 75 75 75 ...
##  $ Stress.Level           : int  6 8 8 8 8 8 7 6 6 6 ...
##  $ BMI.Category           : chr  "Overweight" "Normal" "Normal" "Obese" ...
##  $ Blood.Pressure         : chr  "126/83" "125/80" "125/80" "140/90" ...
##  $ Heart.Rate             : int  77 75 75 85 85 85 82 70 70 70 ...
##  $ Daily.Steps            : int  4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
##  $ Sleep.Disorder         : chr  "None" "None" "None" "Sleep Apnea" ...
summary(data)
##    Person.ID         Gender               Age         Occupation       
##  Min.   :  1.00   Length:374         Min.   :27.00   Length:374        
##  1st Qu.: 94.25   Class :character   1st Qu.:35.25   Class :character  
##  Median :187.50   Mode  :character   Median :43.00   Mode  :character  
##  Mean   :187.50                      Mean   :42.18                     
##  3rd Qu.:280.75                      3rd Qu.:50.00                     
##  Max.   :374.00                      Max.   :59.00                     
##  Sleep.Duration  Quality.of.Sleep Physical.Activity.Level  Stress.Level  
##  Min.   :5.800   Min.   :4.000    Min.   :30.00           Min.   :3.000  
##  1st Qu.:6.400   1st Qu.:6.000    1st Qu.:45.00           1st Qu.:4.000  
##  Median :7.200   Median :7.000    Median :60.00           Median :5.000  
##  Mean   :7.132   Mean   :7.313    Mean   :59.17           Mean   :5.385  
##  3rd Qu.:7.800   3rd Qu.:8.000    3rd Qu.:75.00           3rd Qu.:7.000  
##  Max.   :8.500   Max.   :9.000    Max.   :90.00           Max.   :8.000  
##  BMI.Category       Blood.Pressure       Heart.Rate     Daily.Steps   
##  Length:374         Length:374         Min.   :65.00   Min.   : 3000  
##  Class :character   Class :character   1st Qu.:68.00   1st Qu.: 5600  
##  Mode  :character   Mode  :character   Median :70.00   Median : 7000  
##                                        Mean   :70.17   Mean   : 6817  
##                                        3rd Qu.:72.00   3rd Qu.: 8000  
##                                        Max.   :86.00   Max.   :10000  
##  Sleep.Disorder    
##  Length:374        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Observations: - Dataset contains 374 rows and 13 columns - No missing values present - Mix of numerical and categorical variables

6.Data Cleaning and Preprocessing

Standardizing column names

colnames(data) <- make.names(colnames(data))

Checking missing values

colSums(is.na(data))
##               Person.ID                  Gender                     Age 
##                       0                       0                       0 
##              Occupation          Sleep.Duration        Quality.of.Sleep 
##                       0                       0                       0 
## Physical.Activity.Level            Stress.Level            BMI.Category 
##                       0                       0                       0 
##          Blood.Pressure              Heart.Rate             Daily.Steps 
##                       0                       0                       0 
##          Sleep.Disorder 
##                       0

Checking duplicate rows

sum(duplicated(data))
## [1] 0

Splitting Blood Pressure into numeric columns

data <- data %>%
  separate(Blood.Pressure, into = c("Systolic", "Diastolic"), sep = "/", convert = TRUE)

Checking structure after transformation

str(data)
## 'data.frame':    374 obs. of  14 variables:
##  $ Person.ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                 : chr  "Male" "Male" "Male" "Male" ...
##  $ Age                    : int  27 28 28 28 28 28 29 29 29 29 ...
##  $ Occupation             : chr  "Software Engineer" "Doctor" "Doctor" "Sales Representative" ...
##  $ Sleep.Duration         : num  6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
##  $ Quality.of.Sleep       : int  6 6 6 4 4 4 6 7 7 7 ...
##  $ Physical.Activity.Level: int  42 60 60 30 30 30 40 75 75 75 ...
##  $ Stress.Level           : int  6 8 8 8 8 8 7 6 6 6 ...
##  $ BMI.Category           : chr  "Overweight" "Normal" "Normal" "Obese" ...
##  $ Systolic               : int  126 125 125 140 140 140 140 120 120 120 ...
##  $ Diastolic              : int  83 80 80 90 90 90 90 80 80 80 ...
##  $ Heart.Rate             : int  77 75 75 85 85 85 82 70 70 70 ...
##  $ Daily.Steps            : int  4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
##  $ Sleep.Disorder         : chr  "None" "None" "None" "Sleep Apnea" ...

Observations: -The dataset was checked for missing values, and no null values were found, indicating a complete dataset. -Duplicate records were examined, and no duplicates were detected. -Column names were standardized using make.names() to ensure consistency and avoid errors during analysis. -The Blood Pressure column, originally in string format (e.g., “120/80”), was split into two separate numerical variables: Systolic and Diastolic, enabling more detailed and meaningful analysis. -Data types were verified and converted where necessary to ensure proper statistical analysis.

7. Exploratory Data Analysis (EDA)

EDA is performed to understand the dataset, identify patterns, relationships, and detect any anomalies.

Column Names and Unique Values

colnames(data)
##  [1] "Person.ID"               "Gender"                 
##  [3] "Age"                     "Occupation"             
##  [5] "Sleep.Duration"          "Quality.of.Sleep"       
##  [7] "Physical.Activity.Level" "Stress.Level"           
##  [9] "BMI.Category"            "Systolic"               
## [11] "Diastolic"               "Heart.Rate"             
## [13] "Daily.Steps"             "Sleep.Disorder"
cat("Number of unique values in each column:\n")
## Number of unique values in each column:
sapply(data, function(x) length(unique(x)))
##               Person.ID                  Gender                     Age 
##                     374                       2                      31 
##              Occupation          Sleep.Duration        Quality.of.Sleep 
##                      11                      27                       6 
## Physical.Activity.Level            Stress.Level            BMI.Category 
##                      16                       6                       4 
##                Systolic               Diastolic              Heart.Rate 
##                      18                      17                      19 
##             Daily.Steps          Sleep.Disorder 
##                      20                       3

Inference: The dataset contains a mix of categorical and numerical variables, with varying levels of diversity in each feature.

Sleep Disorder Distribution

ggplot(data, aes(x = Sleep.Disorder, fill = Sleep.Disorder)) + geom_bar()

Inference: Majority of individuals fall under “None”, indicating no disorder, while insomnia and sleep apnea form smaller but important groups.

Gender vs Sleep Disorder

ggplot(data, aes(x = Gender, fill = Sleep.Disorder)) + geom_bar(position = "dodge")

Inference: Males show higher insomnia cases, whereas females show higher sleep apnea cases, indicating possible biological or lifestyle influences.

Occupation vs Sleep Disorder

ggplot(data, aes(x = Occupation, fill = Sleep.Disorder)) +
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Inference: Occupations like sales and nursing show higher disorder rates, suggesting job-related stress and irregular schedules impact sleep.

Sleep Quality vs Sleep Duration

ggplot(data, aes(x = Stress.Level, y = Quality.of.Sleep)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "blue") +
  theme_minimal(base_size = 16) +
  labs(title = "Stress vs Sleep Quality")
## `geom_smooth()` using formula = 'y ~ x'

Inference: Clear negative relationship — higher stress leads to poorer sleep quality.

Sleep Quality vs Sleep Duration

aggregate(Sleep.Duration ~ Quality.of.Sleep + Sleep.Disorder, data, mean)
##    Quality.of.Sleep Sleep.Disorder Sleep.Duration
## 1                 4       Insomnia       5.900000
## 2                 5       Insomnia       6.500000
## 3                 6       Insomnia       6.371875
## 4                 7       Insomnia       6.638235
## 5                 8       Insomnia       7.520000
## 6                 9       Insomnia       8.300000
## 7                 6           None       6.117500
## 8                 7           None       7.540000
## 9                 8           None       7.399010
## 10                9           None       8.365789
## 11                4    Sleep Apnea       5.850000
## 12                5    Sleep Apnea       6.500000
## 13                6    Sleep Apnea       6.118182
## 14                7    Sleep Apnea       7.500000
## 15                8    Sleep Apnea       7.366667
## 16                9    Sleep Apnea       8.096875

Inference: Higher sleep quality corresponds with longer sleep duration across all disorder categories.

Physical Activity vs Sleep Duration

ggplot(data, aes(x = Physical.Activity.Level, y = Sleep.Duration)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "green") +
  theme_minimal(base_size = 16) +
  labs(title = "Physical Activity vs Sleep Duration")
## `geom_smooth()` using formula = 'y ~ x'

Inference: : Higher physical activity levels are associated with longer sleep duration, indicating that an active lifestyle contributes positively to sleep health.

Physical Activity vs Sleep Disorder

ggplot(data, aes(x = Sleep.Disorder, y = Physical.Activity.Level, fill = Sleep.Disorder)) + geom_violin()

Inference: Individuals with no sleep disorder show a wider and generally higher distribution of physical activity levels. People with insomnia are mostly concentrated around moderate activity levels (around 40–50), indicating comparatively lower activity. Those with sleep apnea show a bimodal spread — some have very low activity while others have high activity, suggesting inconsistency.

Age vs Sleep Disorder

ggplot(data, aes(x = Age, color = Sleep.Disorder)) + stat_ecdf()

Inference: Age distribution varies slightly across disorder types, but no extreme separation is observed.

Sleep Duration vs Sleep Disorder

ggplot(data, aes(x = Sleep.Disorder, y = Sleep.Duration, fill = Sleep.Disorder)) + geom_boxplot()

Inference: Individuals with disorders tend to have lower or more variable sleep duration.

Stress Level vs Sleep Disorder

ggplot(data, aes(x = factor(Stress.Level), fill = Sleep.Disorder)) + geom_bar(position = "dodge")

Inference: Higher stress levels are strongly associated with increased sleep disorder cases.

BMI vs Sleep Disorder

ggplot(data, aes(x = BMI.Category, fill = Sleep.Disorder)) + geom_bar(position = "dodge")

Inference: Overweight individuals show significantly higher disorder prevalence.

Stress-Level Facet Analysis

ggplot(data, aes(x = Sleep.Disorder, fill = Sleep.Disorder)) + geom_bar() + facet_wrap(~Stress.Level)

Inference: As stress level increases, the proportion of sleep disorders also increases significantly.

##Correlation Heatmap

library(corrplot)

numeric_data <- data[sapply(data, is.numeric)]
corrplot(cor(numeric_data), method = "color")

Inference: Stress and sleep quality show strong relationships with sleep patterns.

Detailed Analysis

This section provides an in-depth interpretation of the patterns observed during the exploratory data analysis.

1. Sleep Duration Patterns

The majority of individuals sleep between 6–8 hours, which is considered a healthy range. However, a small portion of individuals sleep less than 6 hours, which may indicate poor sleep habits and potential health risks. Similarly, very high sleep durations may also reflect underlying health issues.

2. Impact of Stress on Sleep

A strong negative relationship is observed between stress level and sleep quality. Individuals with higher stress levels consistently report lower sleep quality scores. This suggests that stress is one of the most critical factors affecting sleep health. Managing stress could significantly improve sleep outcomes.

3. Role of Physical Activity

Physical activity shows a positive relationship with sleep duration. Individuals who engage in higher levels of daily activity tend to sleep longer and more consistently. This indicates that an active lifestyle contributes to better sleep patterns.

4. Influence of BMI on Sleep Disorders

BMI category plays an important role in sleep disorders. Overweight and obese individuals show a higher occurrence of sleep apnea and insomnia. This may be due to physiological factors such as breathing difficulties and metabolic imbalance.

5. Gender-based Analysis

Gender-based comparisons reveal that males have a slightly higher tendency toward insomnia, whereas females show a higher occurrence of sleep apnea. However, the differences are not extreme, suggesting that gender alone is not a dominant factor.

6. Occupational Impact

Occupation has a noticeable effect on sleep health. Jobs that involve high stress, irregular schedules, or long working hours (such as sales roles and healthcare professions) show higher levels of sleep disorders. This highlights the impact of work-life balance on sleep quality.

7. Stress Category Analysis

When stress levels are grouped into categories (low, medium, high), it becomes evident that individuals in the high-stress category have a significantly higher proportion of sleep disorders. This reinforces the earlier observation that stress is a key predictor of sleep issues.

8. Correlation Insights

The correlation analysis shows that: - Stress level is negatively correlated with sleep quality - Physical activity is positively correlated with sleep duration - Heart rate and BMI show moderate relationships with sleep disorders

These relationships help identify the most influential variables affecting sleep health.

Overall Interpretation

The detailed analysis indicates that sleep health is a multidimensional issue influenced by lifestyle, physiological, and occupational factors. Among all variables, stress emerges as the most dominant factor negatively affecting sleep, while physical activity acts as a protective factor. BMI and occupation further contribute to variations in sleep disorder prevalence.

Conclusion

The analysis successfully identifies key lifestyle factors affecting sleep health. The results highlight the importance of maintaining balanced habits to improve sleep quality.