Bellabeat case study using R

Introduction

This report is part of the capstone project for the Google Data Analytics Certificate. The analysis is conducted using the R programming language within RStudio Desktop.

Scenario

I am a junior data analyst working in the marketing team of Bellabeat, a high-tech company that manufactures health-focused smart products for women. Bellabeat’s co-founder and Chief Creative Officer, Urška Sršen, believes that analyzing smart device usage data could uncover valuable growth opportunities. She is particularly interested in understanding how consumers use non-Bellabeat smart devices, such as fitness trackers, to inform Bellabeat’s product and marketing strategies.

Ask

I have been tasked with analyzing publicly available Fitbit Fitness Tracker Data to better understand how consumers engage with their smart devices. The objective is to identify usage trends and behavioral patterns that can guide Bellabeat’s future marketing strategy, enhance product development, and improve alignment with user lifestyles.

This analysis will explore: Daily and hourly activity patterns Consistency of device usage over time Engagement with features such as step tracking and sleep monitoring

The final deliverable will be a data-driven report and presentation to the Bellabeat executive team, including strategic recommendations.

According to the project brief, my analysis will be guided by the following key questions: What are some trends in smart device usage? How could these trends apply to Bellabeat customers? How could these trends help influence Bellabeat’s marketing strategy?

Prepare

This project uses publicly available Fitbit Fitness Tracker Data https://www.kaggle.com/datasets/arashnic/fitbit, which contains anonymized fitness data collected from 34 users over a 31-day period (March 12 to April 12, 2016). The dataset includes records on daily and hourly activity, sleep, calories burned, steps taken, and heart rate.

All files were downloaded from Kaggle and placed in the working directory for RStudio. The working directory was identified using the getwd() command.

All data files are in comma-separated values (.CSV) format.

Code: Install and Load Required Packages

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(readr)
library(ggplot2)
library(ggrepel)

Note: If any packages were not yet installed, the install.packages() function was used before loading them.

Load Data Files into RStudio

Although this analysis focuses on step tracking and sleep monitoring, all Fitbit data files were initially loaded. Some datasets (e.g., for calories, METs, and heart rate) were later excluded from the analysis.

Code: read data files to RStudio

minute_steps <- read.csv("minuteStepsNarrow_merged.csv")
minute_calories <- read.csv("minuteCaloriesNarrow_merged.csv")
minute_intensities <- read.csv("minuteIntensitiesNarrow_merged.csv")
minute_mets <- read.csv("minuteMETsNarrow_merged.csv")
minute_sleep <- read.csv("minuteSleep_merged.csv")
daily_activity <- read.csv("dailyActivity_merged.csv")
hourly_calories <- read.csv("hourlyCalories_merged.csv")
hourly_intensities <- read.csv("hourlyIntensities_merged.csv")
hourly_steps <- read.csv("hourlySteps_merged.csv")
heartrate_seconds <- read.csv("heartrate_seconds_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")

Process

Clean and Prepare data for analysis.

All 11 data files provided in the Fitbit dataset on Kaggle were loaded into RStudio for initial inspection. Note: The Kaggle page seems to reference additional data files that are no longer available for download.

Minute-level data
- minuteStepsNarrow_merged.csv
- minuteCaloriesNarrow_merged.csv
- minuteIntensitiesNarrow_merged.csv
- minuteMETsNarrow_merged.csv
- minuteSleep_merged.csv
Hourly-level data
- hourlySteps_merged.csv
- hourlyCalories_merged.csv
- hourlyIntensities_merged.csv
Daily summary data
- dailyActivity_merged.csv
Other files
- heartrate_seconds_merged.csv
- weightLogInfo_merged.csv

The following are my comments about these files, grouped by categories:

Step tracking
Step data was available for all users at both minute and hourly levels. Since hourly data provides sufficient detail to assess usage patterns across time, minute-level step data was not used. Hourly steps were used to analyze user behavior and engagement with the Fitbit.
Intensity tracking
While intensity data captures light, moderate, and vigorous activity levels, it was highly correlated with step data and added little additional insight. For simplicity and clarity, intensity data was excluded from visualizations.
Sleep tracking
Minute-level sleep data was available only for a subset of users and not consistently across days. However, it was retained and used to explore engagement with sleep-tracking features.
Calory and MET tracking Both metrics are model-driven outputs based on user movement, not direct interactions with the Fitbit. They were excluded because they do not reflect how users actively used the device.
Daily summary data The dailyActivity_merged.csv file appeared to be truncated, with most users having data only for April. Due to this inconsistency, this file was not used.
Weight tracking
Only 8 users had any weight entries, and even those were sparse. This file was excluded due to insufficient data for trend analysis.
Heart rate tracking
Although potentially useful, heart rate data was incomplete and unevenly distributed, with only 14 users represented. It was excluded for lack of broad user coverage.

Final Datasets Used for Analysis

The following two files were found to be the most complete and reflective of how users interacted with their Fitbit devices:

hourlySteps_merged.csv – for analyzing step activity across time

minuteSleep_merged.csv – for examining user engagement with sleep tracking

hourly_steps

# View structure and inspect for missing values
str(hourly_steps)

## 'data.frame':    24084 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr  "3/12/2016 12:00:00 AM" "3/12/2016 1:00:00 AM" "3/12/2016 2:00:00 AM" "3/12/2016 3:00:00 AM" ...
##  $ StepTotal   : int  0 0 0 0 0 0 0 0 0 8 ...

colSums(is.na(hourly_steps))

##           Id ActivityHour    StepTotal 
##            0            0            0

head(hourly_steps)

##           Id          ActivityHour StepTotal
## 1 1503960366 3/12/2016 12:00:00 AM         0
## 2 1503960366  3/12/2016 1:00:00 AM         0
## 3 1503960366  3/12/2016 2:00:00 AM         0
## 4 1503960366  3/12/2016 3:00:00 AM         0
## 5 1503960366  3/12/2016 4:00:00 AM         0
## 6 1503960366  3/12/2016 5:00:00 AM         0

# Check for duplicate records by user and timestamp
sum(duplicated(hourly_steps[, c("Id", "ActivityHour")]))

## [1] 0

# Convert ActivityHour from string to POSIXct format for date-time handling
hourly_steps$ActivityHour <- parse_date_time(hourly_steps$ActivityHour, orders = "mdy IMS p")

# Extract useful time-based components
hourly_steps <- hourly_steps %>%
  mutate(
    date = as.Date(ActivityHour),          # Just the date (YYYY-MM-DD)
    hour = as.numeric(hour(ActivityHour)),             # Extract hour (0–23)
    day_of_week = weekdays(ActivityHour)   # Extract weekday name (e.g., "Monday")
  )

minute_sleep

# Examine structure and check for missing values
str(minute_sleep)

## 'data.frame':    198559 obs. of  4 variables:
##  $ Id   : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date : chr  "3/13/2016 2:39:30 AM" "3/13/2016 2:40:30 AM" "3/13/2016 2:41:30 AM" "3/13/2016 2:42:30 AM" ...
##  $ value: int  1 1 1 1 1 1 2 2 1 1 ...
##  $ logId: num  1.11e+10 1.11e+10 1.11e+10 1.11e+10 1.11e+10 ...

colSums(is.na(minute_sleep))

##    Id  date value logId 
##     0     0     0     0

head(minute_sleep)

##           Id                 date value       logId
## 1 1503960366 3/13/2016 2:39:30 AM     1 11114919637
## 2 1503960366 3/13/2016 2:40:30 AM     1 11114919637
## 3 1503960366 3/13/2016 2:41:30 AM     1 11114919637
## 4 1503960366 3/13/2016 2:42:30 AM     1 11114919637
## 5 1503960366 3/13/2016 2:43:30 AM     1 11114919637
## 6 1503960366 3/13/2016 2:44:30 AM     1 11114919637

minute_sleep <- minute_sleep %>%
  mutate(
    # Parse the original date-time string into a POSIXct datetime object.
    # Using tz = "UTC" avoids issues with daylight saving time gaps.
    datetime = mdy_hms(date, tz = "UTC"),

    # Calculate the number of minutes since midnight for each timestamp.
    # This is useful for plotting sleep patterns over the course of the night.
    time_in_minutes = hour(datetime) * 60 + minute(datetime) + second(datetime) / 60
  )

📌 Note on DST Handling

The date column includes local timestamps from the U.S. daylight saving time (DST) transition on March 13, 2016.
To avoid issues with nonexistent local times (e.g., 2:30 AM, which is skipped during the “spring forward” shift), we parsed these timestamps using UTC.

We then calculated time_in_minutes as the number of minutes past midnight, treating time as a continuous numeric variable.

This approach ensures a smooth analysis of sleep patterns, avoiding distortions caused by DST-related time gaps.

# Step 1: Count duplicates based on key identifying columns
sum(duplicated(minute_sleep[, c("Id", "logId", "datetime")]))

## [1] 525

# Step 2: Remove duplicates while keeping only the first occurrence
minute_sleep <- minute_sleep %>%
  distinct(Id, logId, datetime, value, .keep_all = TRUE)

Analyze

Step Analysis

1. Daily Step Pattern

# Summarize average steps per day of the week
daily_step_pattern <- hourly_steps %>%
  group_by(day_of_week) %>%                      # group data by weekday name
  summarise(avg_steps = mean(StepTotal),  # compute average steps (ignoring missing values)
            .groups = "drop")

# Plot average step count by weekday
ggplot(daily_step_pattern, aes(x = day_of_week, y = avg_steps)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Average Step Count by Day of Week",
    x = "Day of Week",
    y = "Average Steps"
  ) +
  theme_minimal()

Users maintained consistent Fitbit usage throughout the week, with only minor fluctuations in average step count by day. This indicates steady engagement across weekdays and weekends, suggesting that most users wore their devices regularly as part of their daily routines.

2. Hourly Step Pattern (All Days)

hourly_steps_pattern_total <- hourly_steps %>%
  group_by(hour) %>%                    # Group by hour of the day
  summarise(avg_steps = mean(StepTotal), .groups = "drop")  # Compute average

# Plot: Average hourly steps (aggregated over all days)
ggplot(hourly_steps_pattern_total, aes(x = hour, y = avg_steps)) +
  geom_col(fill = "steelblue") +        # Bar chart for each hour
  labs(title = "Average Step Count by Hour (All Days Combined)",
       x = "Hour of Day", y = "Average Steps") +
  theme_minimal()

The hourly step data reveals that most users walked primarily during the daytime between 8 AM and 8 PM. Distinct step patterns—such as sharp morning and evening peaks—suggest habitual routines like commuting, work breaks, or exercise. Some users show sparse or concentrated usage, possibly indicating the device was worn only during workouts or specific activities.

3. Hourly Step Pattern by Day of Week

hourly_steps_pattern <- hourly_steps %>%
  group_by(day_of_week, hour) %>%               # Group by day and hour
  summarise(avg_steps = mean(StepTotal),        # Calculate average steps
            .groups = "drop")

# Plot: Hourly step pattern for each day of the week
ggplot(hourly_steps_pattern, aes(x = hour, y = avg_steps)) +
  geom_col(fill = "steelblue") +                # Use bars to show average steps
  facet_wrap(~ day_of_week, ncol = 2) +         # Separate chart for each weekday
  labs(title = "Average Step Count by Hour and Day of Week",
       x = "Hour of Day", y = "Average Steps") +
  theme_minimal()

Users’ walk pattern differ slightly from week end to weekdays. During week days, many users show routine activity patterns (e.g., morning/evening peaks), aligned with daily habits like commuting or workouts.

4. User-level Hourly Averages

# Count unique active days per user
user_day_counts <- hourly_steps %>%
  group_by(Id) %>%
  summarise(n_days = n_distinct(date), .groups = "drop")

# Compute average hourly steps per user, and attach number of active days
hourly_avg_steps <- hourly_steps %>%
  group_by(Id, hour) %>%
  summarise(avg_steps = mean(StepTotal), .groups = "drop") %>%
  left_join(user_day_counts, by = "Id")  # Add active day counts

# Create custom labels for each user panel: "user_id\nn=days"
user_labels <- setNames(paste0(user_day_counts$Id, "\nn=", user_day_counts$n_days),
                        user_day_counts$Id)

# Plot: Average hourly steps per user with red line at 250 step threshold
ggplot(hourly_avg_steps, aes(x = hour, y = avg_steps)) +
  geom_point(color = "blue", size = 0.5) +
  geom_line(color = "darkgreen") +
  geom_hline(yintercept = 250, color = "red", linetype = "dashed") +
  facet_wrap(~Id, scales = "free_y", labeller = labeller(Id = user_labels)) +
  labs(
    title = "Hourly Average Steps by User Across All Days",
    subtitle = "Red dashed line shows 250 steps threshold",
    x = "Hour of Day", y = "Average Steps"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, NA))  # Let y-scale grow as needed

Users exhibit diverse activity patterns, with wide variation in hourly step counts. While some users consistently exceed the 250-step/hour threshold during active periods (e.g., mornings or afternoons), others remain well below it throughout the day. This suggests differences in lifestyle, routines, and possibly Fitbit usage commitment, highlighting the value of personalized recommendations or interventions.

Sleep Analysis

We excluded any sleep sessions shorter than 2 hours, as these are likely fragmented naps.

1. Sleep pattern

# Step 1: Filter valid sleep logs with ≥ 2 hours asleep
valid_logs <- minute_sleep %>%
  group_by(logId) %>%
  summarise(asleep_minutes = sum(value == 1), .groups = "drop") %>%
  filter(asleep_minutes >= 120) %>%
  pull(logId)

# Step 2: Keep only valid sleep logs and calculate hours since sleep start
minute_sleep_filtered <- minute_sleep %>%
  filter(logId %in% valid_logs) %>%
  group_by(logId) %>%
  arrange(datetime) %>%
  mutate(
    minutes_since_start = as.numeric(difftime(datetime, min(datetime), units = "mins")),
    hours_since_start = minutes_since_start / 60
  ) %>%
  ungroup()

# Step 3: Recalculate average sleep intensity by time since sleep start (quarter-hour buckets)
avg_intensity <- minute_sleep_filtered %>%
  group_by(hours_since_start = floor(hours_since_start * 4) / 4) %>%
  summarise(avg_value = mean(value, na.rm = TRUE), .groups = "drop")

# Step 4: Plot average intensity
ggplot(avg_intensity, aes(x = hours_since_start, y = avg_value)) +
  geom_line(color = "purple") +
  labs(
    title = "Average Sleep Intensity by Time Since Sleep Start",
    subtitle = "Only sessions with ≥ 2 hours asleep (value == 1)",
    x = "Hours Since Sleep Start",
    y = "Average Sleep Intensity"
  ) +
  theme_minimal()

The average sleep intensity curve—based on valid sessions of at least 2 hours asleep (value = 1)—reveals a typical sleep architecture pattern. Intensity drops sharply at the start, reflecting the transition into deeper sleep stages. This is followed by a stable mid-sleep plateau, then a gradual rise near the end, often associated with lighter sleep or restlessness before waking. The pattern aligns with expected physiological rhythms and supports the reliability of Fitbit’s sleep-tracking data when the device is consistently worn.

2. average sleep duration per user

Early-morning sleep logs were shifted to the prior night to better represent user sleep cycles.

# Assign early morning sleep (before noon) to the previous night
minute_sleep <- minute_sleep %>%
  mutate(sleep_day = as.Date(datetime - days(hour(datetime) < 12)))

# Step 1: Filter out short sessions (<2 hours asleep)
valid_logs <- minute_sleep %>%
  group_by(logId) %>%
  summarise(asleep_minutes = sum(value == 1)) %>%
  filter(asleep_minutes >= 120) %>%
  pull(logId)

# Keep only valid sessions
minute_sleep_filtered <- minute_sleep %>%
  filter(logId %in% valid_logs)

# Summarize daily sleep per user (filtered data)
daily_sleep_summary <- minute_sleep_filtered %>%
  group_by(Id, sleep_day) %>%
  summarise(
    TotalSleepRecords = n_distinct(logId),
    TotalMinutesAsleep = sum(value == 1),
    TotalTimeInBed = n(),
    .groups = "drop"
  ) %>%
  rename(date = sleep_day)

# Compute average sleep and mark outliers
sleep_summary <- daily_sleep_summary %>%
  group_by(Id) %>%
  summarise(
    avg_minutes_asleep = mean(TotalMinutesAsleep, na.rm = TRUE),
    recorded_days = n_distinct(date)
  ) %>%
  mutate(
    avg_hours_asleep = avg_minutes_asleep / 60,
    outlier = avg_minutes_asleep < 200 | avg_minutes_asleep > 500,
    outlier = factor(outlier, levels = c(FALSE, TRUE))
  )

# Plot average sleep duration per user
ggplot(sleep_summary, aes(x = reorder(as.factor(Id), avg_hours_asleep),
                          y = avg_hours_asleep, fill = outlier)) +
  geom_col() +
  geom_hline(yintercept = 8, linetype = "dashed", color = "darkgreen") +
  coord_flip(clip = "off") +
  geom_text(aes(label = paste0("n=", recorded_days)),
            hjust = -0.1, size = 3) +
  scale_fill_manual(values = c("FALSE" = "steelblue", "TRUE" = "red"),
                    name = "Outlier") +
  labs(title = "Average Sleep Duration per User",
       subtitle = "Dashed line = 8 hours (filtered for sleep logs ≥ 2 hours)",
       x = "User ID",
       y = "Avg Hours Asleep") +
  theme_minimal() +
  theme(plot.margin = margin(5, 30, 5, 5))

Most users have between 20–30 nights of sleep data, indicating regular nighttime usage of the Fitbit device. The majority average between 6.5 and 8 hours of sleep per night, suggesting consistent overnight wear. Red bars indicate users with unusually low or high average sleep duration, flagged as potential outliers.

User segmentation

1. User engagement

User-level engagement summaries were built by counting the number of active step days and valid sleep nights per user, which supported our later segmentation.

# --- Step 1: Prepare Sleep Summary ---
sleep_summary <- minute_sleep %>%
  group_by(Id, logId) %>%                         # Each sleep episode per user
  summarise(sleep_minutes = n(), .groups = "drop") %>%
  filter(sleep_minutes >= 120) %>%                # Only keep sleep episodes ≥ 2 hours
  group_by(Id) %>%
  summarise(nights_with_sleep_data = n(), .groups = "drop")  # Count valid sleep nights

# --- Step 2: Prepare Steps Summary ---
step_summary <- hourly_steps %>%
  group_by(Id, date) %>%                          # Aggregate steps per user per day
  summarise(steps_per_day = sum(StepTotal), .groups = "drop") %>%
  group_by(Id) %>%
  summarise(days_with_steps_data = n(), .groups = "drop")    # Count days with steps

# --- Step 3: Combine Summaries ---
user_engagement <- full_join(sleep_summary, step_summary, by = "Id") %>%
  replace_na(list(nights_with_sleep_data = 0, days_with_steps_data = 0))  # Fill NAs with 0

# --- Step 4: View Result ---
print(user_engagement)

## # A tibble: 34 × 3
##            Id nights_with_sleep_data days_with_steps_data
##         <dbl>                  <int>                <int>
##  1 1503960366                     25                   31
##  2 1644430081                      2                   30
##  3 1844505072                      2                   32
##  4 1927972279                     35                   32
##  5 2022484408                      1                   32
##  6 2026352035                     31                   32
##  7 2347167796                     28                   32
##  8 3977333714                     16                   32
##  9 4020332650                     19                   32
## 10 4319703577                     25                   28
## # ℹ 24 more rows

2. User segmentation

We selected 25 days as a meaningful threshold to reflect consistent usage during a typical 30-day month.

# --- Add user segments ---
user_engagement <- user_engagement %>%
  mutate(
    segment = case_when(
      nights_with_sleep_data >= 25 & days_with_steps_data >= 25 ~ "Highly Engaged",
      nights_with_sleep_data >= 25 & days_with_steps_data < 25  ~ "Sleep-Focused",
      nights_with_sleep_data < 25  & days_with_steps_data >= 25 ~ "Steps-Focused",
      TRUE                                                      ~ "Low Engagement"
    )
  )

# --- Plot engagement scatterplot ---
ggplot(user_engagement, aes(x = days_with_steps_data, y = nights_with_sleep_data, color = segment)) +
  geom_point(size = 3) +
  geom_vline(xintercept = 25, linetype = "dashed", color = "gray40") +
  geom_hline(yintercept = 25, linetype = "dashed", color = "gray40") +
  ggrepel::geom_text_repel(aes(label = Id),
                           size = 2.5,
                           max.overlaps = 50,
                           segment.color = "gray60",
                           segment.size = 0.2,
                           box.padding = 0.3,
                           point.padding = 0.2) +
  annotate("text", x = 10, y = 35, label = "Sleep-Focused", color = "gray50", size = 3) +
  annotate("text", x = 30, y = 5, label = "Steps-Focused", color = "gray50", size = 3) +
  annotate("text", x = 30, y = 35, label = "Highly Engaged", color = "gray50", size = 3) +
  annotate("text", x = 5, y = 5, label = "Low Engagement", color = "gray50", size = 3) +
  labs(
    title = "Fitbit User Segmentation by Engagement in Sleep and Activity",
    subtitle = "Dashed lines mark 25-day thresholds",
    x = "Days with Steps Data",
    y = "Nights with Sleep Data",
    color = "Segment"
  ) +
  theme_minimal()

Interpretation of User Segmentation Chart This scatter plot reveals three distinct user segments based on their engagement with sleep and activity tracking:

Highly Engaged users (top-right) consistently log both sleep and steps data, indicating strong and balanced use of the Fitbit device. This segment is ideal target for upselling premium features.

Steps-Focused users (bottom-right) mainly engage with step tracking, potentially ignoring sleep features or wearing the device only during the day. This segment might benefit from coaching or challenges.

Low Engagement users (bottom-left) have minimal recorded activity in both categories, suggesting sporadic or trial use. This segment has opportunity for re-engagement or education.

ggplot(user_engagement, aes(x = segment)) +
  geom_bar(fill = "steelblue") +
  labs(title = "User Count per Engagement Segment", x = "Segment", y = "User Count") +
  theme_minimal()

These segments can guide tailored marketing, product feature emphasis, or user education strategies.

Act

📢 Strategic Marketing Recommendations

🔹 User Segmentation for Targeted Messaging Leverage behavioral segmentation (e.g., consistent, partial, sporadic users) to tailor outreach. Promote advanced features to loyal users, and use nudges or tips to re-engage those with low or inconsistent usage.

🔹 Promote Sleep Tracking Strength Position the smart device as more than a step counter by emphasizing reliable sleep tracking features—like Sleep Score and quality trends—to appeal to wellness-focused users.

🔹 Encourage All-Day & Night Wear Use smart reminders (e.g., during morning routines or before bed) and suggest comfort-focused accessories to encourage round-the-clock wear.

🔹 Leverage Routine Patterns Customize goals or prompts based on individual activity rhythms (e.g., targeting morning walkers with sunrise challenges or evening users with wind-down reminders).

Bellabeat case study using R

Mehran Hojati

2025-06-29

Introduction

Scenario

Ask

Prepare

Code: Install and Load Required Packages

Load Data Files into RStudio

Code: read data files to RStudio

Process

Final Datasets Used for Analysis

hourly_steps

minute_sleep

📌 Note on DST Handling

Analyze

Step Analysis

Sleep Analysis

User segmentation

Act

📢 Strategic Marketing Recommendations

Bellabeat case study using R

Mehran Hojati

2025-06-29

Introduction

Scenario

Ask

Prepare

Code: Install and Load Required Packages

Load Data Files into RStudio

Code: read data files to RStudio

Process

Final Datasets Used for Analysis

hourly_steps

minute_sleep

📌 Note on DST Handling

Analyze

Step Analysis

Sleep Analysis

User segmentation

Share

Main Insights

✅ Conclusions

Limitations

Act

📢 Strategic Marketing Recommendations