Introduction

Data Source: Kaggle - Sleep Health and Lifestyle Dataset by Laksika Tharmalingam

Sleep is a process that allows your body to rest, repair and restore itself. On average, depends on a person’s constitution, we need from six to eight hours of sleep everyday. More or less than that will usually come with negative effects on the body. In this section, we will learn more about the effect of sleep on blood pressure, by analyzing the data below.

In the Sleep Health and Lifestyle Dataset, there are information about the quality of sleep of hundreds of individuals, as well as their corresponding blood pressure and other information. Below are the variables used in the dataset:

  1. Person ID: An identifier for each individual.
  2. Gender: The gender of the person (Male/Female).
  3. Age: The age of the person in years.
  4. Occupation: The occupation or profession of the person.
  5. Sleep Duration (hours): The number of hours the person sleeps per day.
  6. Quality of Sleep (scale: 1-10): A subjective rating of the quality of sleep, ranging from 1 to 10.
  7. Physical Activity Level (minutes/day): The number of minutes the person engages in physical activity daily.
  8. Stress Level (scale: 1-10): A subjective rating of the stress level experienced by the person, ranging from 1 to 10.
  9. BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).
  10. Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.
  11. Heart Rate (bpm): The resting heart rate of the person in beats per minute.
  12. Daily Steps: The number of steps the person takes per day.
  13. Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).

The part below will describe the though process which was used to arrive at the conclusion. Note that this dataset was not collected by a formal procedure, may not reflect real life scenario, and can only be used for education purposes (training, in this case).

Analysis

First, we load the necessary package for the analysis: tidyverse.

Next, we load the dataset, and perform initial check on the data

sum(is.na(data)) #---Make sure that there is no observation with empty data
## [1] 0
dim(data)        #---Data have a total of 374 rows and 16 variables
## [1] 374  13
summary(data)
##    Person.ID         Gender               Age         Occupation       
##  Min.   :  1.00   Length:374         Min.   :27.00   Length:374        
##  1st Qu.: 94.25   Class :character   1st Qu.:35.25   Class :character  
##  Median :187.50   Mode  :character   Median :43.00   Mode  :character  
##  Mean   :187.50                      Mean   :42.18                     
##  3rd Qu.:280.75                      3rd Qu.:50.00                     
##  Max.   :374.00                      Max.   :59.00                     
##  Sleep.Duration  Quality.of.Sleep Physical.Activity.Level  Stress.Level  
##  Min.   :5.800   Min.   :4.000    Min.   :30.00           Min.   :3.000  
##  1st Qu.:6.400   1st Qu.:6.000    1st Qu.:45.00           1st Qu.:4.000  
##  Median :7.200   Median :7.000    Median :60.00           Median :5.000  
##  Mean   :7.132   Mean   :7.313    Mean   :59.17           Mean   :5.385  
##  3rd Qu.:7.800   3rd Qu.:8.000    3rd Qu.:75.00           3rd Qu.:7.000  
##  Max.   :8.500   Max.   :9.000    Max.   :90.00           Max.   :8.000  
##  BMI.Category       Blood.Pressure       Heart.Rate     Daily.Steps   
##  Length:374         Length:374         Min.   :65.00   Min.   : 3000  
##  Class :character   Class :character   1st Qu.:68.00   1st Qu.: 5600  
##  Mode  :character   Mode  :character   Median :70.00   Median : 7000  
##                                        Mean   :70.17   Mean   : 6817  
##                                        3rd Qu.:72.00   3rd Qu.: 8000  
##                                        Max.   :86.00   Max.   :10000  
##  Sleep.Disorder    
##  Length:374        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

In order to make it easier to work with the data, we will now proceed to change the names of the variables.

names(data) <- c("id", "gender", "age", "occupation", "s_duration", "s_quality",
                 "physical_activity", "stress_level", "bmi", "blood_pres",
                 "heart_rate", "daily_steps", "s_disorder")
names(data)
##  [1] "id"                "gender"            "age"              
##  [4] "occupation"        "s_duration"        "s_quality"        
##  [7] "physical_activity" "stress_level"      "bmi"              
## [10] "blood_pres"        "heart_rate"        "daily_steps"      
## [13] "s_disorder"

In the dataset, there are variables that might be highly correlated with each other, for example, the quality of sleep and the duration of sleep. We shall now address their correlation.

cor(data[c(5,6,7,8,11,12)])
##                    s_duration   s_quality physical_activity stress_level
## s_duration         1.00000000  0.88321300        0.21236031  -0.81102303
## s_quality          0.88321300  1.00000000        0.19289645  -0.89875203
## physical_activity  0.21236031  0.19289645        1.00000000  -0.03413446
## stress_level      -0.81102303 -0.89875203       -0.03413446   1.00000000
## heart_rate        -0.51645489 -0.65986473        0.13697098   0.67002646
## daily_steps       -0.03953254  0.01679141        0.77272305   0.18682895
##                    heart_rate daily_steps
## s_duration        -0.51645489 -0.03953254
## s_quality         -0.65986473  0.01679141
## physical_activity  0.13697098  0.77272305
## stress_level       0.67002646  0.18682895
## heart_rate         1.00000000 -0.03030858
## daily_steps       -0.03030858  1.00000000

As expected, s_duration and s_quality are highly correlated with each other, physical_activity and daily_steps are also similar. Note that stress_level is also highly, albeit negatively, correlated with the quality of sleep. We will now begin to explore the dataset while using one variable of each pair.

We will begin to see whether sleep quality have anything to do with each person’s gender and their occupation.

data %>% group_by(gender) %>% summarise(avg = mean(s_duration))
## # A tibble: 2 × 2
##   gender   avg
##   <chr>  <dbl>
## 1 Female  7.23
## 2 Male    7.04
data %>% group_by(occupation) %>% summarise(avg = mean(s_duration)) %>% arrange(by = avg)
## # A tibble: 11 × 2
##    occupation             avg
##    <chr>                <dbl>
##  1 Sales Representative  5.9 
##  2 Scientist             6   
##  3 Salesperson           6.40
##  4 Teacher               6.69
##  5 Software Engineer     6.75
##  6 Manager               6.9 
##  7 Doctor                6.97
##  8 Nurse                 7.06
##  9 Accountant            7.11
## 10 Lawyer                7.41
## 11 Engineer              7.99

It appears that there are almost no difference between the average duration of sleep between Male and Female in this case. However, based on the occupation of each person, there might be significant difference in the duration of their sleep. This deviation might be used in latter analysis.

Next, we will now begin to analyze the dataset to check whether sleep quality will affect our blood pressure. After the exploratory data analysis, we will use s_duration, physical_activity and stress_level, together with gender and occupation in this process. Note that we might come back to s_quality if needed.

To work with the blood pressure data, first, we will need to modify it. Based on the values of the blood_pres variable, we will create systolic and diastolic variables. Systolic blood pressure is the force of the blood flow when blood is pumped out the heart. Meanwhile diastolic blood pressure is measured when the heart is filling with blood. Both of these values are used to diagnose whether someone has high blood pressure.

#---Split blood_pres into systolic and diastolic variables
data <- data %>% separate(blood_pres, into = c("systolic", "diastolic"),
                  sep = "/", remove = FALSE)
#---Change them into numeric values for further usage
data$systolic <- as.numeric(data$systolic)
data$diastolic <- as.numeric(data$diastolic)

We will now categorize the observations based on their blood pressure values. The categories are Normal, Elevated, Hypertension Stage 1, Hypertension Stage 2 and Crisis.

data$blood_pres_c <- rep("Other", nrow(data))
data$blood_pres_c <- with(data, ifelse(systolic > 180 | diastolic > 120, "Crisis",
                    ifelse(systolic >= 140 | diastolic >= 90, "Stage 2",
                           ifelse(systolic >= 130 | diastolic >= 80, "Stage 1",
                                  ifelse(systolic >= 120 & diastolic < 80, "Elevated",
                                         "Normal")))))
unique(data$blood_pres_c)
## [1] "Stage 1"  "Stage 2"  "Normal"   "Elevated"

In our data, there are no observation with “Crisis” category, which aligns with the range of values in both the systolic and diastolic variables (the highest value of systolic is 142, while the highest value of diastolic is 95).

data %>% ggplot(aes(x = blood_pres_c, fill = blood_pres_c)) + geom_bar()

Note that in our dataset, the number of individuals belong to each category is not equal. There are more than 250 people with stage 1 category, while barely anyone belong to the elevated group. We should keep this result in mind as the analysis of elevated group might not be accurate. We will now see whether there is a correlation between the sleep duration and blood pressure. The result of the analysis will be visualized below:

data %>% ggplot(aes(x = blood_pres_c, y = s_duration, fill = blood_pres_c)) + 
  geom_bar(stat = "summary", fun = mean)

data %>% ggplot(aes(x = blood_pres_c, y = s_duration, fill = blood_pres_c)) + 
  geom_boxplot()

As we can see, at first, it seems like there is little correlation between the blood pressure and the duration of sleep (on average, the amount of sleep for each category is around 7 hours). However, when we use box plot on the data, the median is much lower for individuals with stage 2 hypertension (around 6.6 hours of sleep), 0.7 hours less than the other categories. The box plot also points out that other factors might affect the blood pressure, as a significant number still have enough sleep every night. Now we want to check the quality of sleep instead of the duration.

data %>% 
  ggplot(aes(x = blood_pres_c, y = s_quality, fill = blood_pres_c)) + 
  geom_bar(stat = "summary", fun = mean)

data %>% 
  ggplot(aes(x = blood_pres_c, y = s_quality, fill = blood_pres_c)) + 
  geom_boxplot()

It appears that the quality of sleep is a more accurate measure, as stage 1 and stage 2 categories both have lower average and lower median values. Next, we want to see what effect does stress have on blood pressure. Stress is known as a factor for higher blood pressure.

data %>% 
  ggplot(aes(x = blood_pres_c, y = stress_level, fill = blood_pres_c)) + 
  geom_bar(stat = "summary", fun = mean)

data %>% 
  ggplot(aes(x = blood_pres_c, y = stress_level, fill = blood_pres_c)) + 
  geom_boxplot()

As we can see from the plots above, individuals with lower blood pressure tend to have lower level of stress. However, the high fluctuation for stage 2 hypertension in the box plot indicates the presence of other factors. Lastly, we want to see if physical activity has any effect on the blood pressure.

data %>% 
  ggplot(aes(x = blood_pres_c, y = physical_activity, fill = blood_pres_c)) + 
  geom_bar(stat = "summary", fun = mean)

data %>% 
  ggplot(aes(x = blood_pres_c, y = physical_activity, fill = blood_pres_c)) + 
  geom_boxplot()

Unexpectedly, people belong to the stage 2 category exercise the most out of the four groups. The result might point out that we need a moderate amount of physical activity everyday to stay healthy.

Now that we have seen the correlation between blood pressure and some key factors, we will now dive further into gender, occupations and age. First is gender.

data %>% 
  ggplot(aes(x = gender, fill = gender)) +
  geom_bar() + facet_grid(.~blood_pres_c) + 
  theme(axis.text.x = element_blank(), axis.title.x = element_blank())

We can see that most of the male individuals in the dataset have stage 1 hypertension. Female, on ther other hand, is more well-distributed. However, a significant number belong to the stage 2 hypertension, a lot higher than male. Next is occupation.

data %>% ggplot(aes(x = occupation, fill = occupation)) +
  geom_bar() + facet_grid(.~blood_pres_c) + 
  theme(axis.text.x = element_blank(), axis.title.x = element_blank())

The bar chart above shows interesting results. Accountant seems to have the easiest time out of all occupation, as most of them have normal blood pressure. A significant number of doctor, engineer, lawyer and salesperson belong to the stage 1 group. Surprisingly, most of those in stage 2 category are nurses and teachers. The result might indicate that the amount of workload and responsibility might also affect the blood pressure of each individual. Last is the age variable. Here, we will split them into several bins, within which is 5 years of age.

data %>% ggplot(aes(x = age, fill = "red")) + 
  geom_histogram(binwidth = 5) + facet_grid(.~blood_pres_c)

The result is expected. The group with normal blood pressure tends to be younger than the group with stage 2 hypertension. However, unexpectedly, within the stage 1 category, the distribution of age seem to be slightly lean to the younger side of the chart.

Conclusion

  1. Sleep seems to have an effect on blood pressure. However, the evidences are not clear enough and more observations are required to draw a better conclusion. Note that the quality of sleep appears to be a better factor than the duration.
  2. Stress level and physical activities both have an effect on blood pressure.
  3. Evidences have pointed out that, while the factors affect the blood pressure, there might be other reasons that assign individuals to the stage 2 category.
  4. Most males belong to the stage 1 hypertension. Females are better distributed. However, a significant number of people in the stage 2 category are female.
  5. Younger people tend to have lower blood pressure. People of the age of 30 or above have high risk of getting high blood pressure diagnosis.
  6. People with certain occupations might have higher chance of belonging to one of the four category above.

Future Questions

  1. What factor might give people higher risk of getting stage 2 hypertension?
  2. Why do certain occupations have higher risk of high blood pressure? Is it because of the workload or the responsibility that they have?
  3. Why is there a difference in distribution of the gender variable? Is it because of the culture, or other social factors?