Data Source: Kaggle - Sleep Health and Lifestyle Dataset by Laksika Tharmalingam
Sleep is a process that allows your body to rest, repair and restore itself. On average, depends on a person’s constitution, we need from six to eight hours of sleep everyday. More or less than that will usually come with negative effects on the body. In this section, we will learn more about the effect of sleep on blood pressure, by analyzing the data below.
In the Sleep Health and Lifestyle Dataset, there are information about the quality of sleep of hundreds of individuals, as well as their corresponding blood pressure and other information. Below are the variables used in the dataset:
The part below will describe the though process which was used to arrive at the conclusion. Note that this dataset was not collected by a formal procedure, may not reflect real life scenario, and can only be used for education purposes (training, in this case).
First, we load the necessary package for the analysis: tidyverse.
Next, we load the dataset, and perform initial check on the data
sum(is.na(data)) #---Make sure that there is no observation with empty data
## [1] 0
dim(data) #---Data have a total of 374 rows and 16 variables
## [1] 374 13
summary(data)
## Person.ID Gender Age Occupation
## Min. : 1.00 Length:374 Min. :27.00 Length:374
## 1st Qu.: 94.25 Class :character 1st Qu.:35.25 Class :character
## Median :187.50 Mode :character Median :43.00 Mode :character
## Mean :187.50 Mean :42.18
## 3rd Qu.:280.75 3rd Qu.:50.00
## Max. :374.00 Max. :59.00
## Sleep.Duration Quality.of.Sleep Physical.Activity.Level Stress.Level
## Min. :5.800 Min. :4.000 Min. :30.00 Min. :3.000
## 1st Qu.:6.400 1st Qu.:6.000 1st Qu.:45.00 1st Qu.:4.000
## Median :7.200 Median :7.000 Median :60.00 Median :5.000
## Mean :7.132 Mean :7.313 Mean :59.17 Mean :5.385
## 3rd Qu.:7.800 3rd Qu.:8.000 3rd Qu.:75.00 3rd Qu.:7.000
## Max. :8.500 Max. :9.000 Max. :90.00 Max. :8.000
## BMI.Category Blood.Pressure Heart.Rate Daily.Steps
## Length:374 Length:374 Min. :65.00 Min. : 3000
## Class :character Class :character 1st Qu.:68.00 1st Qu.: 5600
## Mode :character Mode :character Median :70.00 Median : 7000
## Mean :70.17 Mean : 6817
## 3rd Qu.:72.00 3rd Qu.: 8000
## Max. :86.00 Max. :10000
## Sleep.Disorder
## Length:374
## Class :character
## Mode :character
##
##
##
In order to make it easier to work with the data, we will now proceed to change the names of the variables.
names(data) <- c("id", "gender", "age", "occupation", "s_duration", "s_quality",
"physical_activity", "stress_level", "bmi", "blood_pres",
"heart_rate", "daily_steps", "s_disorder")
names(data)
## [1] "id" "gender" "age"
## [4] "occupation" "s_duration" "s_quality"
## [7] "physical_activity" "stress_level" "bmi"
## [10] "blood_pres" "heart_rate" "daily_steps"
## [13] "s_disorder"
In the dataset, there are variables that might be highly correlated with each other, for example, the quality of sleep and the duration of sleep. We shall now address their correlation.
cor(data[c(5,6,7,8,11,12)])
## s_duration s_quality physical_activity stress_level
## s_duration 1.00000000 0.88321300 0.21236031 -0.81102303
## s_quality 0.88321300 1.00000000 0.19289645 -0.89875203
## physical_activity 0.21236031 0.19289645 1.00000000 -0.03413446
## stress_level -0.81102303 -0.89875203 -0.03413446 1.00000000
## heart_rate -0.51645489 -0.65986473 0.13697098 0.67002646
## daily_steps -0.03953254 0.01679141 0.77272305 0.18682895
## heart_rate daily_steps
## s_duration -0.51645489 -0.03953254
## s_quality -0.65986473 0.01679141
## physical_activity 0.13697098 0.77272305
## stress_level 0.67002646 0.18682895
## heart_rate 1.00000000 -0.03030858
## daily_steps -0.03030858 1.00000000
As expected, s_duration and s_quality are highly correlated with each other, physical_activity and daily_steps are also similar. Note that stress_level is also highly, albeit negatively, correlated with the quality of sleep. We will now begin to explore the dataset while using one variable of each pair.
We will begin to see whether sleep quality have anything to do with each person’s gender and their occupation.
data %>% group_by(gender) %>% summarise(avg = mean(s_duration))
## # A tibble: 2 × 2
## gender avg
## <chr> <dbl>
## 1 Female 7.23
## 2 Male 7.04
data %>% group_by(occupation) %>% summarise(avg = mean(s_duration)) %>% arrange(by = avg)
## # A tibble: 11 × 2
## occupation avg
## <chr> <dbl>
## 1 Sales Representative 5.9
## 2 Scientist 6
## 3 Salesperson 6.40
## 4 Teacher 6.69
## 5 Software Engineer 6.75
## 6 Manager 6.9
## 7 Doctor 6.97
## 8 Nurse 7.06
## 9 Accountant 7.11
## 10 Lawyer 7.41
## 11 Engineer 7.99
It appears that there are almost no difference between the average duration of sleep between Male and Female in this case. However, based on the occupation of each person, there might be significant difference in the duration of their sleep. This deviation might be used in latter analysis.
Next, we will now begin to analyze the dataset to check whether sleep quality will affect our blood pressure. After the exploratory data analysis, we will use s_duration, physical_activity and stress_level, together with gender and occupation in this process. Note that we might come back to s_quality if needed.
To work with the blood pressure data, first, we will need to modify it. Based on the values of the blood_pres variable, we will create systolic and diastolic variables. Systolic blood pressure is the force of the blood flow when blood is pumped out the heart. Meanwhile diastolic blood pressure is measured when the heart is filling with blood. Both of these values are used to diagnose whether someone has high blood pressure.
#---Split blood_pres into systolic and diastolic variables
data <- data %>% separate(blood_pres, into = c("systolic", "diastolic"),
sep = "/", remove = FALSE)
#---Change them into numeric values for further usage
data$systolic <- as.numeric(data$systolic)
data$diastolic <- as.numeric(data$diastolic)
We will now categorize the observations based on their blood pressure values. The categories are Normal, Elevated, Hypertension Stage 1, Hypertension Stage 2 and Crisis.
data$blood_pres_c <- rep("Other", nrow(data))
data$blood_pres_c <- with(data, ifelse(systolic > 180 | diastolic > 120, "Crisis",
ifelse(systolic >= 140 | diastolic >= 90, "Stage 2",
ifelse(systolic >= 130 | diastolic >= 80, "Stage 1",
ifelse(systolic >= 120 & diastolic < 80, "Elevated",
"Normal")))))
unique(data$blood_pres_c)
## [1] "Stage 1" "Stage 2" "Normal" "Elevated"
In our data, there are no observation with “Crisis” category, which aligns with the range of values in both the systolic and diastolic variables (the highest value of systolic is 142, while the highest value of diastolic is 95).
data %>% ggplot(aes(x = blood_pres_c, fill = blood_pres_c)) + geom_bar()
Note that in our dataset, the number of individuals belong to each category is not equal. There are more than 250 people with stage 1 category, while barely anyone belong to the elevated group. We should keep this result in mind as the analysis of elevated group might not be accurate. We will now see whether there is a correlation between the sleep duration and blood pressure. The result of the analysis will be visualized below:
data %>% ggplot(aes(x = blood_pres_c, y = s_duration, fill = blood_pres_c)) +
geom_bar(stat = "summary", fun = mean)
data %>% ggplot(aes(x = blood_pres_c, y = s_duration, fill = blood_pres_c)) +
geom_boxplot()
As we can see, at first, it seems like there is little correlation between the blood pressure and the duration of sleep (on average, the amount of sleep for each category is around 7 hours). However, when we use box plot on the data, the median is much lower for individuals with stage 2 hypertension (around 6.6 hours of sleep), 0.7 hours less than the other categories. The box plot also points out that other factors might affect the blood pressure, as a significant number still have enough sleep every night. Now we want to check the quality of sleep instead of the duration.
data %>%
ggplot(aes(x = blood_pres_c, y = s_quality, fill = blood_pres_c)) +
geom_bar(stat = "summary", fun = mean)
data %>%
ggplot(aes(x = blood_pres_c, y = s_quality, fill = blood_pres_c)) +
geom_boxplot()
It appears that the quality of sleep is a more accurate measure, as stage 1 and stage 2 categories both have lower average and lower median values. Next, we want to see what effect does stress have on blood pressure. Stress is known as a factor for higher blood pressure.
data %>%
ggplot(aes(x = blood_pres_c, y = stress_level, fill = blood_pres_c)) +
geom_bar(stat = "summary", fun = mean)
data %>%
ggplot(aes(x = blood_pres_c, y = stress_level, fill = blood_pres_c)) +
geom_boxplot()
As we can see from the plots above, individuals with lower blood pressure tend to have lower level of stress. However, the high fluctuation for stage 2 hypertension in the box plot indicates the presence of other factors. Lastly, we want to see if physical activity has any effect on the blood pressure.
data %>%
ggplot(aes(x = blood_pres_c, y = physical_activity, fill = blood_pres_c)) +
geom_bar(stat = "summary", fun = mean)
data %>%
ggplot(aes(x = blood_pres_c, y = physical_activity, fill = blood_pres_c)) +
geom_boxplot()
Unexpectedly, people belong to the stage 2 category exercise the most out of the four groups. The result might point out that we need a moderate amount of physical activity everyday to stay healthy.
Now that we have seen the correlation between blood pressure and some key factors, we will now dive further into gender, occupations and age. First is gender.
data %>%
ggplot(aes(x = gender, fill = gender)) +
geom_bar() + facet_grid(.~blood_pres_c) +
theme(axis.text.x = element_blank(), axis.title.x = element_blank())
We can see that most of the male individuals in the dataset have stage 1 hypertension. Female, on ther other hand, is more well-distributed. However, a significant number belong to the stage 2 hypertension, a lot higher than male. Next is occupation.
data %>% ggplot(aes(x = occupation, fill = occupation)) +
geom_bar() + facet_grid(.~blood_pres_c) +
theme(axis.text.x = element_blank(), axis.title.x = element_blank())
The bar chart above shows interesting results. Accountant seems to have the easiest time out of all occupation, as most of them have normal blood pressure. A significant number of doctor, engineer, lawyer and salesperson belong to the stage 1 group. Surprisingly, most of those in stage 2 category are nurses and teachers. The result might indicate that the amount of workload and responsibility might also affect the blood pressure of each individual. Last is the age variable. Here, we will split them into several bins, within which is 5 years of age.
data %>% ggplot(aes(x = age, fill = "red")) +
geom_histogram(binwidth = 5) + facet_grid(.~blood_pres_c)
The result is expected. The group with normal blood pressure tends to be younger than the group with stage 2 hypertension. However, unexpectedly, within the stage 1 category, the distribution of age seem to be slightly lean to the younger side of the chart.