R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Introduction

In today’s data-driven world, wearable health devices (such as the Apple Watch) are helping people better understand their lifestyles. They continuously record metrics like steps, calorie consumption, and sleep duration, providing valuable data for reflecting on our health.

I chose “Activity and Sleep Relationship” as my topic because it’s relevant to everyone’s life and reflects the balance between exercise and rest. Studying the link between activity levels and sleep not only reveals underlying patterns in healthy habits but also helps us understand how behavioral data can improve quality of life. Furthermore, this topic aligns perfectly with the concept of data storytelling. I hope that by integrating data from different sources, exploring relationships between variables, and presenting the results in an interactive visualization, viewers can intuitively understand the story behind the data. This project demonstrates how raw, open data can be transformed into a meaningful visual narrative to discover subtle yet valuable patterns in human behavior.

Ultimately, this research aims to explore whether more active days (higher step count, higher energy expenditure) are associated with longer or higher-quality sleep.

Data and Methods

The dataset used in this project is FitBit Fitness Tracker Data, sourced from Kaggle’s open data platform. This data was originally collected by Fitabase from the devices of approximately 30 Fitbit users, covering approximately one month of continuous recording from 2016 to 2017. All data has been anonymized, and participants voluntarily share their activity and sleep information for teaching and research purposes. This dataset provides a detailed record of people’s daily exercise and sleep patterns, including metrics such as step count, distance walked, sedentary time, energy expenditure, heart rate, and minute-by-minute sleep records. Because this data contains both individual and time series characteristics, it is ideal for studying the balance between activity and rest.

For this project, I primarily used two core files:

dailyActivity_merged.csv — summarizes each user’s overall daily activity level, including total steps (TotalSteps), total distance walked (TotalDistance), calories burned (Calories), and activity time at different intensities. This file reflects the user’s overall daily activity status.

minuteSleep_merged.csv — Records sleep logs in minute units. Each record represents a minute spent “asleep.” By aggregating these records, we can calculate each user’s total sleep time (in hours) per day and analyze their nightly sleep patterns.

Combining these two files allows us to compare the relationship between daily activity intensity and sleep duration at the individual level. By merging based on user ID and date, we can explore whether users tend to get longer or better-quality sleep on days with more steps and higher energy expenditure.

Analysis & Visualisation

# Import data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)

# Reading Data
daily <- read.csv("/Users/mac/Desktop/dailyActivity_merged.csv")
sleep <- read.csv("/Users/mac/Desktop/minuteSleep_merged.csv")
head(daily)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    3/25/2016      11004          7.11            7.11
## 2 1503960366    3/26/2016      17609         11.55           11.55
## 3 1503960366    3/27/2016      12736          8.53            8.53
## 4 1503960366    3/28/2016      13231          8.93            8.93
## 5 1503960366    3/29/2016      12041          7.85            7.85
## 6 1503960366    3/30/2016      10970          7.16            7.16
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               2.57                     0.46
## 2                        0               6.92                     0.73
## 3                        0               4.66                     0.16
## 4                        0               3.19                     0.79
## 5                        0               2.16                     1.09
## 6                        0               2.36                     0.51
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                4.07                       0                33
## 2                3.91                       0                89
## 3                3.71                       0                56
## 4                4.95                       0                39
## 5                4.61                       0                28
## 6                4.29                       0                30
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  12                  205              804     1819
## 2                  17                  274              588     2154
## 3                   5                  268              605     1944
## 4                  20                  224             1080     1932
## 5                  28                  243              763     1886
## 6                  13                  223             1174     1820
head(sleep)
##           Id                 date value       logId
## 1 1503960366 3/13/2016 2:39:30 AM     1 11114919637
## 2 1503960366 3/13/2016 2:40:30 AM     1 11114919637
## 3 1503960366 3/13/2016 2:41:30 AM     1 11114919637
## 4 1503960366 3/13/2016 2:42:30 AM     1 11114919637
## 5 1503960366 3/13/2016 2:43:30 AM     1 11114919637
## 6 1503960366 3/13/2016 2:44:30 AM     1 11114919637
# Date Conversion
daily$ActivityDate <- mdy(daily$ActivityDate)
sleep$date <- as.POSIXct(sleep$date, format="%m/%d/%Y %I:%M:%S %p")

# Extracting the date part
sleep$Date <- as.Date(sleep$date)

As you can see, the daily dataset has 15 columns (e.g., total steps, total distance, calories burned), while the sleep dataset has four columns ( id, date, value, logId).Both datasets share the id field, which will be used to merge them later. The date field is now in a consistent format (YYYY-MM-DD), allowing the two datasets to be accurately merged by ID and date later.

After importing and cleaning the datasets, the two tables can be aggregated and merged. This ensures consistency in the time format and enables us to analyze the relationship between daily physical activity (steps and calories) and total sleep duration for the same person on the same day.

# Aggregated sleep data
library(dplyr)

sleep_daily <- sleep %>%
  group_by(Id, Date = as.Date(date)) %>%
  summarise(SleepMinutes = sum(value) / 60)
## `summarise()` has grouped output by 'Id'. You can override using the `.groups`
## argument.
# Combined daily activity and sleep
merged <- daily %>%
  left_join(sleep_daily, by = c("Id" = "Id", "ActivityDate" = "Date")) %>%
  drop_na(SleepMinutes)

head(merged)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-03-25      11004          7.11            7.11
## 2 1503960366   2016-03-26      17609         11.55           11.55
## 3 1503960366   2016-03-27      12736          8.53            8.53
## 4 1503960366   2016-03-28      13231          8.93            8.93
## 5 1503960366   2016-03-30      10970          7.16            7.16
## 6 1503960366   2016-03-31      12256          7.86            7.86
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               2.57                     0.46
## 2                        0               6.92                     0.73
## 3                        0               4.66                     0.16
## 4                        0               3.19                     0.79
## 5                        0               2.36                     0.51
## 6                        0               2.29                     0.49
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                4.07                       0                33
## 2                3.91                       0                89
## 3                3.71                       0                56
## 4                4.95                       0                39
## 5                4.29                       0                30
## 6                5.04                       0                33
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  12                  205              804     1819
## 2                  17                  274              588     2154
## 3                   5                  268              605     1944
## 4                  20                  224             1080     1932
## 5                  13                  223             1174     1820
## 6                  12                  239              820     1889
##   SleepMinutes
## 1     8.183333
## 2     9.216667
## 3     1.350000
## 4     6.850000
## 5     6.083333
## 6     5.550000

After data aggregation and merging, we obtained a neatly structured daily dataset.It simultaneously displays each user’s activity level and sleep duration on the same day, laying the foundation for subsequent analysis of whether activity levels affect sleep.

Scatter plot

library(ggplot2)

ggplot(merged, aes(x = TotalSteps, y = SleepMinutes)) +
  geom_point(color = "#2C7BB6", alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "#D7191C") +
  labs(title = "Relationship Between Steps and Sleep Duration",
       x = "Total Steps per Day",
       y = "Sleep Duration (hours)")
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot shows the relationship between daily step count and sleep duration. A slight downward trend is visible on the regression line, but the overall correlation is weak, indicating that higher activity levels do not necessarily lead to longer sleep. While some days with high step counts are accompanied by moderate sleep duration, the overall fluctuation is significant, suggesting that sleep duration is also influenced by other factors, such as stress, sleep habits, or work schedules.

line chart

ggplot(merged, aes(x = ActivityDate)) +
  geom_line(aes(y = TotalSteps, color = "Steps")) +
  geom_line(aes(y = SleepMinutes*1000, color = "Sleep (scaled)")) +
  scale_y_continuous(sec.axis = sec_axis(~./1000, name="Sleep Duration (hours)")) +
  labs(title="Daily Activity and Sleep Trends",
       y="Steps", x="Date") +
  theme_minimal()

This line chart shows the daily trends in steps (blue line) and sleep duration (red line, scaled) over a month. As you can see, both activity levels and sleep duration fluctuate significantly, but their peaks and troughs are not always synchronized. This indicates that on certain days, higher step counts may correspond to slightly longer sleep durations, but overall, the patterns are not stable.

Box plot

merged$Weekday <- weekdays(merged$ActivityDate)
merged$WeekType <- ifelse(merged$Weekday %in% c("Saturday", "Sunday"), "Weekend", "Weekday")

ggplot(merged, aes(x = WeekType, y = SleepMinutes, fill = WeekType)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Sleep Duration: Weekday vs Weekend",
       y = "Sleep Duration (hours)", x = "") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()

The box plot illustrates the difference in sleep duration distribution between weekdays and weekends. It shows that the median sleep duration on weekends is slightly higher than on weekdays, and the distribution is wider. This indicates that subjects typically sleep longer on weekends, with greater fluctuations in sleep duration. Outliers in the plot represent extreme sleep durations (too short or too long), which may be related to irregular sleep patterns, staying up late, or catching up on sleep.

Conclution

This study used open Fitbit health tracking data to analyze the relationship between daily activity levels and sleep duration. Three visualization analyses revealed that while activity and sleep are both important components of a healthy lifestyle, the direct relationship between them is relatively weak and unstable.

The scatter plot showed that days with more steps did not necessarily correspond to longer sleep duration, exhibiting a slight negative correlation overall.

The line plot showed that activity levels and sleep duration fluctuated independently over time, with their peaks and troughs not clearly synchronized.

The box plot revealed that participants generally slept longer and with greater fluctuations on weekends, indicating that lifestyle rhythm (the difference between weekdays and weekends) has a more significant impact on sleep.

In summary, daily activity levels are not the primary determinant of sleep duration. Sleep is more likely influenced by multiple factors, including psychological stress, sleep patterns, and lifestyle rhythm.

Future work could integrate additional Fitbit metrics such as resting heart rate or sleep quality scores to deepen understanding of personal wellness patterns.

Reference

Arash, N. (2021). FitBit Fitness Tracker Data (Fitabase Data 3.12.16–4.11.16) [Dataset]. Kaggle. https://www.kaggle.com/datasets/arashnic/fitbit