This report presents a detailed analysis of sleep health and lifestyle factors using data exploration techniques.
Sleep quality is an essential component of a healthy lifestyle. However, modern lifestyles have led to an increase in sleep-related issues such as insomnia and sleep apnea.
The purpose of this project is to analyze the relationship between lifestyle factors (such as stress, physical activity, and BMI) and sleep health using data analysis techniques.
The dataset consists of the following features: - Gender, Age, Occupation - Sleep Duration, Quality of Sleep - Physical Activity Level, Stress Level - BMI Category, Blood Pressure - Heart Rate, Daily Steps - Sleep Disorder (Target variable)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
knitr::opts_chunk$set(fig.width = 10, fig.height = 6)
data <- read.csv("Sleep_health_and_lifestyle_dataset.csv")
head(data)
## Person.ID Gender Age Occupation Sleep.Duration Quality.of.Sleep
## 1 1 Male 27 Software Engineer 6.1 6
## 2 2 Male 28 Doctor 6.2 6
## 3 3 Male 28 Doctor 6.2 6
## 4 4 Male 28 Sales Representative 5.9 4
## 5 5 Male 28 Sales Representative 5.9 4
## 6 6 Male 28 Software Engineer 5.9 4
## Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1 42 6 Overweight 126/83 77
## 2 60 8 Normal 125/80 75
## 3 60 8 Normal 125/80 75
## 4 30 8 Obese 140/90 85
## 5 30 8 Obese 140/90 85
## 6 30 8 Obese 140/90 85
## Daily.Steps Sleep.Disorder
## 1 4200 None
## 2 10000 None
## 3 10000 None
## 4 3000 Sleep Apnea
## 5 3000 Sleep Apnea
## 6 3000 Insomnia
cat("The dimension of data is:", dim(data))
## The dimension of data is: 374 13
str(data)
## 'data.frame': 374 obs. of 13 variables:
## $ Person.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Age : int 27 28 28 28 28 28 29 29 29 29 ...
## $ Occupation : chr "Software Engineer" "Doctor" "Doctor" "Sales Representative" ...
## $ Sleep.Duration : num 6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
## $ Quality.of.Sleep : int 6 6 6 4 4 4 6 7 7 7 ...
## $ Physical.Activity.Level: int 42 60 60 30 30 30 40 75 75 75 ...
## $ Stress.Level : int 6 8 8 8 8 8 7 6 6 6 ...
## $ BMI.Category : chr "Overweight" "Normal" "Normal" "Obese" ...
## $ Blood.Pressure : chr "126/83" "125/80" "125/80" "140/90" ...
## $ Heart.Rate : int 77 75 75 85 85 85 82 70 70 70 ...
## $ Daily.Steps : int 4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
## $ Sleep.Disorder : chr "None" "None" "None" "Sleep Apnea" ...
summary(data)
## Person.ID Gender Age Occupation
## Min. : 1.00 Length:374 Min. :27.00 Length:374
## 1st Qu.: 94.25 Class :character 1st Qu.:35.25 Class :character
## Median :187.50 Mode :character Median :43.00 Mode :character
## Mean :187.50 Mean :42.18
## 3rd Qu.:280.75 3rd Qu.:50.00
## Max. :374.00 Max. :59.00
## Sleep.Duration Quality.of.Sleep Physical.Activity.Level Stress.Level
## Min. :5.800 Min. :4.000 Min. :30.00 Min. :3.000
## 1st Qu.:6.400 1st Qu.:6.000 1st Qu.:45.00 1st Qu.:4.000
## Median :7.200 Median :7.000 Median :60.00 Median :5.000
## Mean :7.132 Mean :7.313 Mean :59.17 Mean :5.385
## 3rd Qu.:7.800 3rd Qu.:8.000 3rd Qu.:75.00 3rd Qu.:7.000
## Max. :8.500 Max. :9.000 Max. :90.00 Max. :8.000
## BMI.Category Blood.Pressure Heart.Rate Daily.Steps
## Length:374 Length:374 Min. :65.00 Min. : 3000
## Class :character Class :character 1st Qu.:68.00 1st Qu.: 5600
## Mode :character Mode :character Median :70.00 Median : 7000
## Mean :70.17 Mean : 6817
## 3rd Qu.:72.00 3rd Qu.: 8000
## Max. :86.00 Max. :10000
## Sleep.Disorder
## Length:374
## Class :character
## Mode :character
##
##
##
Observations: - Dataset contains 374 rows and 13 columns - No missing values present - Mix of numerical and categorical variables
colnames(data) <- make.names(colnames(data))
colSums(is.na(data))
## Person.ID Gender Age
## 0 0 0
## Occupation Sleep.Duration Quality.of.Sleep
## 0 0 0
## Physical.Activity.Level Stress.Level BMI.Category
## 0 0 0
## Blood.Pressure Heart.Rate Daily.Steps
## 0 0 0
## Sleep.Disorder
## 0
sum(duplicated(data))
## [1] 0
data <- data %>%
separate(Blood.Pressure, into = c("Systolic", "Diastolic"), sep = "/", convert = TRUE)
str(data)
## 'data.frame': 374 obs. of 14 variables:
## $ Person.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Age : int 27 28 28 28 28 28 29 29 29 29 ...
## $ Occupation : chr "Software Engineer" "Doctor" "Doctor" "Sales Representative" ...
## $ Sleep.Duration : num 6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
## $ Quality.of.Sleep : int 6 6 6 4 4 4 6 7 7 7 ...
## $ Physical.Activity.Level: int 42 60 60 30 30 30 40 75 75 75 ...
## $ Stress.Level : int 6 8 8 8 8 8 7 6 6 6 ...
## $ BMI.Category : chr "Overweight" "Normal" "Normal" "Obese" ...
## $ Systolic : int 126 125 125 140 140 140 140 120 120 120 ...
## $ Diastolic : int 83 80 80 90 90 90 90 80 80 80 ...
## $ Heart.Rate : int 77 75 75 85 85 85 82 70 70 70 ...
## $ Daily.Steps : int 4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
## $ Sleep.Disorder : chr "None" "None" "None" "Sleep Apnea" ...
Observations: -The dataset was checked for missing values, and no null values were found, indicating a complete dataset. -Duplicate records were examined, and no duplicates were detected. -Column names were standardized using make.names() to ensure consistency and avoid errors during analysis. -The Blood Pressure column, originally in string format (e.g., “120/80”), was split into two separate numerical variables: Systolic and Diastolic, enabling more detailed and meaningful analysis. -Data types were verified and converted where necessary to ensure proper statistical analysis.
EDA is performed to understand the dataset, identify patterns, relationships, and detect any anomalies.
colnames(data)
## [1] "Person.ID" "Gender"
## [3] "Age" "Occupation"
## [5] "Sleep.Duration" "Quality.of.Sleep"
## [7] "Physical.Activity.Level" "Stress.Level"
## [9] "BMI.Category" "Systolic"
## [11] "Diastolic" "Heart.Rate"
## [13] "Daily.Steps" "Sleep.Disorder"
cat("Number of unique values in each column:\n")
## Number of unique values in each column:
sapply(data, function(x) length(unique(x)))
## Person.ID Gender Age
## 374 2 31
## Occupation Sleep.Duration Quality.of.Sleep
## 11 27 6
## Physical.Activity.Level Stress.Level BMI.Category
## 16 6 4
## Systolic Diastolic Heart.Rate
## 18 17 19
## Daily.Steps Sleep.Disorder
## 20 3
Inference: The dataset contains a mix of categorical and numerical variables, with varying levels of diversity in each feature.
ggplot(data, aes(x = Sleep.Disorder, fill = Sleep.Disorder)) + geom_bar()
Inference: Majority of individuals fall under “None”, indicating no disorder, while insomnia and sleep apnea form smaller but important groups.
ggplot(data, aes(x = Gender, fill = Sleep.Disorder)) + geom_bar(position = "dodge")
Inference: Males show higher insomnia cases, whereas females show higher sleep apnea cases, indicating possible biological or lifestyle influences.
ggplot(data, aes(x = Occupation, fill = Sleep.Disorder)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Inference: Occupations like sales and nursing show higher disorder rates, suggesting job-related stress and irregular schedules impact sleep.
ggplot(data, aes(x = Stress.Level, y = Quality.of.Sleep)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", color = "blue") +
theme_minimal(base_size = 16) +
labs(title = "Stress vs Sleep Quality")
## `geom_smooth()` using formula = 'y ~ x'
Inference: Clear negative relationship — higher stress
leads to poorer sleep quality.
aggregate(Sleep.Duration ~ Quality.of.Sleep + Sleep.Disorder, data, mean)
## Quality.of.Sleep Sleep.Disorder Sleep.Duration
## 1 4 Insomnia 5.900000
## 2 5 Insomnia 6.500000
## 3 6 Insomnia 6.371875
## 4 7 Insomnia 6.638235
## 5 8 Insomnia 7.520000
## 6 9 Insomnia 8.300000
## 7 6 None 6.117500
## 8 7 None 7.540000
## 9 8 None 7.399010
## 10 9 None 8.365789
## 11 4 Sleep Apnea 5.850000
## 12 5 Sleep Apnea 6.500000
## 13 6 Sleep Apnea 6.118182
## 14 7 Sleep Apnea 7.500000
## 15 8 Sleep Apnea 7.366667
## 16 9 Sleep Apnea 8.096875
Inference: Higher sleep quality corresponds with longer sleep duration across all disorder categories.
ggplot(data, aes(x = Physical.Activity.Level, y = Sleep.Duration)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", color = "green") +
theme_minimal(base_size = 16) +
labs(title = "Physical Activity vs Sleep Duration")
## `geom_smooth()` using formula = 'y ~ x'
Inference: : Higher physical activity levels are associated with longer sleep duration, indicating that an active lifestyle contributes positively to sleep health.
ggplot(data, aes(x = Sleep.Disorder, y = Physical.Activity.Level, fill = Sleep.Disorder)) + geom_violin()
Inference: Individuals with no sleep disorder show a wider and generally higher distribution of physical activity levels. People with insomnia are mostly concentrated around moderate activity levels (around 40–50), indicating comparatively lower activity. Those with sleep apnea show a bimodal spread — some have very low activity while others have high activity, suggesting inconsistency.
ggplot(data, aes(x = Age, color = Sleep.Disorder)) + stat_ecdf()
Inference: Age distribution varies slightly across disorder types, but no extreme separation is observed.
ggplot(data, aes(x = Sleep.Disorder, y = Sleep.Duration, fill = Sleep.Disorder)) + geom_boxplot()
Inference: Individuals with disorders tend to have lower or more variable sleep duration.
ggplot(data, aes(x = factor(Stress.Level), fill = Sleep.Disorder)) + geom_bar(position = "dodge")
Inference: Higher stress levels are strongly associated with increased sleep disorder cases.
ggplot(data, aes(x = BMI.Category, fill = Sleep.Disorder)) + geom_bar(position = "dodge")
Inference: Overweight individuals show significantly higher disorder prevalence.
ggplot(data, aes(x = Sleep.Disorder, fill = Sleep.Disorder)) + geom_bar() + facet_wrap(~Stress.Level)
Inference: As stress level increases, the proportion of sleep disorders also increases significantly.
##Correlation Heatmap
library(corrplot)
numeric_data <- data[sapply(data, is.numeric)]
corrplot(cor(numeric_data), method = "color")
Inference: Stress and sleep quality show strong relationships with sleep patterns.
This section provides an in-depth interpretation of the patterns observed during the exploratory data analysis.
The majority of individuals sleep between 6–8 hours, which is considered a healthy range. However, a small portion of individuals sleep less than 6 hours, which may indicate poor sleep habits and potential health risks. Similarly, very high sleep durations may also reflect underlying health issues.
A strong negative relationship is observed between stress level and sleep quality. Individuals with higher stress levels consistently report lower sleep quality scores. This suggests that stress is one of the most critical factors affecting sleep health. Managing stress could significantly improve sleep outcomes.
Physical activity shows a positive relationship with sleep duration. Individuals who engage in higher levels of daily activity tend to sleep longer and more consistently. This indicates that an active lifestyle contributes to better sleep patterns.
BMI category plays an important role in sleep disorders. Overweight and obese individuals show a higher occurrence of sleep apnea and insomnia. This may be due to physiological factors such as breathing difficulties and metabolic imbalance.
Gender-based comparisons reveal that males have a slightly higher tendency toward insomnia, whereas females show a higher occurrence of sleep apnea. However, the differences are not extreme, suggesting that gender alone is not a dominant factor.
Occupation has a noticeable effect on sleep health. Jobs that involve high stress, irregular schedules, or long working hours (such as sales roles and healthcare professions) show higher levels of sleep disorders. This highlights the impact of work-life balance on sleep quality.
When stress levels are grouped into categories (low, medium, high), it becomes evident that individuals in the high-stress category have a significantly higher proportion of sleep disorders. This reinforces the earlier observation that stress is a key predictor of sleep issues.
The correlation analysis shows that: - Stress level is negatively correlated with sleep quality - Physical activity is positively correlated with sleep duration - Heart rate and BMI show moderate relationships with sleep disorders
These relationships help identify the most influential variables affecting sleep health.
The detailed analysis indicates that sleep health is a multidimensional issue influenced by lifestyle, physiological, and occupational factors. Among all variables, stress emerges as the most dominant factor negatively affecting sleep, while physical activity acts as a protective factor. BMI and occupation further contribute to variations in sleep disorder prevalence.
The analysis successfully identifies key lifestyle factors affecting sleep health. The results highlight the importance of maintaining balanced habits to improve sleep quality.