As part of the Google Data Analytics Professional Certificate, I completed a capstone case study designed to demonstrate the full data analysis process. I chose the Bellabeat case study, How Can a Wellness Technology Company Play It Smart?, because I was especially interested in the intersection of health, fitness, and consumer behavior. Bellabeat is a wellness technology company that develops health-focused smart products for women. For this project, I analyzed smart device usage data from Fitbit users to identify behavioral patterns in activity and sleep that could help inform Bellabeat’s product strategy and marketing approach. Rather than simply describing user trends, this analysis focuses on generating data-driven recommendations that Bellabeat could use to better position its products, improve user engagement, and support healthier daily habits.
Bellabeat is a high-tech wellness company that develops smart products designed to help women monitor and improve their health. These products track key lifestyle metrics such as physical activity, sleep, and overall wellness through an integrated mobile app. As a junior data analyst on the Bellabeat marketing analytics team, I was tasked with analyzing smart device usage data to better understand how consumers engage with fitness trackers. The goal of this analysis is to uncover behavioral patterns that can help inform Bellabeat’s product strategy and marketing decisions.
Some key products included in this dataset include:
– Bellabeat App: an app that connects to the line of smart wellness products which provides users with health data related to activity, sleep, stress, menstrual cycle, and mindfulness habits.
– Bellabeat Membership: a subscription-based membership program that gives users 24/7 access to personalized guidance on health and wellness habits – Leaf: a wellness tracker that can be worn and connects to the Bellabeat app (similar to a FitBit)
– Time: a wellness watch that allows users to track activity, sleep, and stress and connects to the Bellabeat app
– Spring: a water bottle that tracks daily water intake and connects to the app
The objective of this analysis is to examine smart device usage data to identify behavioral patterns in user activity and sleep. These insights will be used to develop data-driven recommendations that can inform Bellabeat’s product strategy and marketing approach.
How can Bellabeat use smart device usage patterns to improve user engagement and inform product and marketing decisions?
What trends exist in users’ daily activity levels (steps, calories, sedentary behavior)?
How do user engagement patterns vary throughout the week?
What insights can be drawn from users’ sleep behavior?
Bellabeat Co-Founders: Urska Srsen and Sando Mur
Bellabeat Marketing Analytics Team
The dataset used in this analysis contains smart device usage data collected from Fitbit users. This dataset was provided as part of the Google Data Analytics Professional Certificate capstone project and is commonly used to explore patterns in health and wellness behavior.
The data includes daily metrics such as: - Step Count - Calories Burned - Activity Intensity Levels (ex. sedentary, lightly active, very active) - Sleep Duration and Time in Bed
These datasets provide insight into how users engage with wearable fitness devices across both physical activity and sleep behaviors.
It is important to note that this dataset represents Fitbit users rather than Bellabeat customers directly. As a result, the findings may not fully reflect Bellabeat’s specific user base. Additionally, the dataset covers a limited time period and may not capture long-term behavioral trends.
To prepare the data for analysis, I cleaned and standardized the Fitbit dataset in R. The goal of this step was to ensure that the data was formatted consistently, remove records that were unlikely to reflect actual device usage, and create new variables that would support more meaningful analysis of user behavior.
# Load Packages
library(tidyverse)
library(lubridate)
library(janitor)
library(skimr)
library(ggplot2)
# Uploading and Storing Data
daily_activity <- read.csv("Bellabeat Fitbit Data/dailyActivity_merged.csv")
daily_calories <- read.csv("Bellabeat Fitbit Data/dailyCalories_merged.csv")
daily_steps <- read.csv("Bellabeat Fitbit Data/dailySteps_merged.csv")
daily_sleep <- read.csv("Bellabeat Fitbit Data/sleepDay_merged.csv")
I began by loading the daily activity, calories, steps, and sleep datasets that would be used throughout the analysis.
# Inspect Data
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(daily_calories)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775…
glimpse(daily_steps)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ StepTotal <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054…
glimpse(daily_sleep)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
# Search for Missing Values
colSums(is.na(daily_activity))
## Id ActivityDate TotalSteps
## 0 0 0
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 0 0 0
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 0 0 0
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 0 0 0
## LightlyActiveMinutes SedentaryMinutes Calories
## 0 0 0
colSums(is.na(daily_calories))
## Id ActivityDay Calories
## 0 0 0
colSums(is.na(daily_steps))
## Id ActivityDay StepTotal
## 0 0 0
colSums(is.na(daily_sleep))
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 0 0 0 0
## TotalTimeInBed
## 0
I reviewed the structure of each dataset to confirm the available variables and check for formatting issues. I also checked for missing values to identify any obvious data quality concerns before cleaning.
# Clean column names first
daily_activity <- daily_activity %>% clean_names()
daily_calories <- daily_calories %>% clean_names()
daily_steps <- daily_steps %>% clean_names()
daily_sleep <- daily_sleep %>% clean_names()
# Check column names
names(daily_activity)
## [1] "id" "activity_date"
## [3] "total_steps" "total_distance"
## [5] "tracker_distance" "logged_activities_distance"
## [7] "very_active_distance" "moderately_active_distance"
## [9] "light_active_distance" "sedentary_active_distance"
## [11] "very_active_minutes" "fairly_active_minutes"
## [13] "lightly_active_minutes" "sedentary_minutes"
## [15] "calories"
names(daily_calories)
## [1] "id" "activity_day" "calories"
names(daily_steps)
## [1] "id" "activity_day" "step_total"
names(daily_sleep)
## [1] "id" "sleep_day" "total_sleep_records"
## [4] "total_minutes_asleep" "total_time_in_bed"
# Convert to proper date formats
daily_activity$activity_date <- as.Date(as.character(daily_activity$activity_date), format = "%m/%d/%Y")
daily_calories$activity_day <- as.Date(as.character(daily_calories$activity_day), format = "%m/%d/%Y")
daily_steps$activity_day <- as.Date(as.character(daily_steps$activity_day), format = "%m/%d/%Y")
daily_sleep$sleep_day <- as.POSIXct(
as.character(daily_sleep$sleep_day),
format = "%m/%d/%Y %I:%M:%S %p"
)
daily_sleep$date <- as.Date(daily_sleep$sleep_day)
# Quick check
head(daily_activity$activity_date)
## [1] "2016-04-12" "2016-04-13" "2016-04-14" "2016-04-15" "2016-04-16"
## [6] "2016-04-17"
head(daily_calories$activity_day)
## [1] "2016-04-12" "2016-04-13" "2016-04-14" "2016-04-15" "2016-04-16"
## [6] "2016-04-17"
head(daily_steps$activity_day)
## [1] "2016-04-12" "2016-04-13" "2016-04-14" "2016-04-15" "2016-04-16"
## [6] "2016-04-17"
head(daily_sleep$date)
## [1] "2016-04-12" "2016-04-13" "2016-04-15" "2016-04-16" "2016-04-17"
## [6] "2016-04-19"
A key part of the cleaning process was converting date fields into proper date format. In the raw files, date values were stored as character strings, which would make time-based analysis more difficult. Standardizing these fields made it possible to analyze activity and sleep trends by day.
# Create Derived Variables
daily_activity$day_of_week <- weekdays(daily_activity$activity_date)
daily_calories$day_of_week <- weekdays(daily_calories$activity_day)
daily_steps$day_of_week <- weekdays(daily_steps$activity_day)
daily_sleep$day_of_week <- weekdays(daily_sleep$date)
daily_sleep$time_taken_to_sleep <- daily_sleep$total_time_in_bed - daily_sleep$total_minutes_asleep
Next, I created new variables that would help support the analysis. A ‘day_of_week’ field was added to each dataset so that usage patterns could be compared across the week. I also created a ‘time_taken_to_sleep’ variable by subtracting total minutes asleep from total time in bed, which provided a siple estimate of how long it took users to fall asleep.
# Remove Unusable Records
cleaned_daily_activity <- daily_activity %>%
filter(calories > 0, total_distance > 0)
nrow(daily_activity)
## [1] 940
nrow(cleaned_daily_activity)
## [1] 862
I removed activity records where calories burned or total distance were recorded as zero. These records likely represent days when the device was not worn or was not used consistently, and including them could distort the analysis of actual user behavior.
# Check for Duplicates
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_calories))
## [1] 0
sum(duplicated(daily_steps))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
# Remove Duplicates
daily_sleep <- distinct(daily_sleep)
Finally, I identified duplicate records across multiple datasets, indicating that some daily observations had been recorded more than once. Because duplicate rows can artificially inflate summary statistics and distort trends, I removed all duplicate entries before proceeding with the analysis. A follow-up check confirmed that no duplicate records remained, ensuring that each row represents a unique observation and improving the overall reliability of the results.
# Preview Cleaned Data
glimpse(cleaned_daily_activity)
## Rows: 862
## Columns: 16
## $ id <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ activity_date <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
## $ total_steps <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…
## $ day_of_week <chr> "Tuesday", "Wednesday", "Thursday", "Friday…
glimpse(daily_sleep)
## Rows: 410
## Columns: 8
## $ id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1…
## $ sleep_day <dttm> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, …
## $ total_sleep_records <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ total_minutes_asleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430,…
## $ total_time_in_bed <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449,…
## $ date <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, …
## $ day_of_week <chr> "Tuesday", "Wednesday", "Friday", "Saturday", "Su…
## $ time_taken_to_sleep <int> 19, 23, 30, 27, 12, 16, 17, 39, 23, 19, 46, 29, 2…
After cleaning, the datasets were ready for analysis. At this stage, the data has standardized date fields, added weekday variables, a derived sleep measure, and a cleaned activity table that better reflected actual user behavior.
To begin the analysis, I examined overall daily activity levels to better understand how users engage with their fitness devices. Looking at summary statistics first provides a general picture of step count, calorie burn, and sedentary beheavior across the dataset.
cleaned_daily_activity %>%
summarise(
average_steps = round(mean(total_steps, na.rm = TRUE), 0),
median_steps = round(median(total_steps, na.rm = TRUE), 0),
average_calories = round(mean(calories, na.rm = TRUE), 0),
average_sedentary_minutes = round(mean(sedentary_minutes, na.rm = TRUE), 0)
)
## average_steps median_steps average_calories average_sedentary_minutes
## 1 8329 8054 2362 955
Users averaged roughly 8,000 to 8,500 steps per day, which falls below the commonly referenced 10,000 step benchmark. This suggests that even among smart device users, many individuals are not consistently reaching higher daily activity levels. At the same time, the data show a substantial amount of sedentary time, indicating that inactivity remains an important part of overall user behavior. Together, these findings suggest that Bellabeat has an opportunity to position its products not only as activity trackers, but as tools that encourage healthier daily routines and more consistent movement.
To better understand how physical activity translates into measurable outcomes, I examined the relationship between total daily steps and calories burned.
ggplot(cleaned_daily_activity, aes(x = total_steps, y = calories)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between Daily Steps and Calories Burned",
x = "Total Steps",
y = "Calories Burned"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The plot shows a clear positive relationship between daily step count and calories burned, confirming that higher levels of physical activity are associated with greater energy expenditure. This reinforces the value of step count as a simple and effective metric for tracking overall activity. At the same time, the variation in calorie burn for similar step counts suggests differences in activity intensity, indicating that not all movement contributes equally to overall health outcomes. For Bellabeat, this presents an opportunity to emphasize step tracking as a core feature while also incorporating goal-setting, progress tracking, and activity-based incentives. Encouraging users to increase both their daily steps and activity intenstiy could help drive more meaningful improvements in overall wellbeing.
To better understand how user engagement varies over time, I analyzed activity levels across different days of the week. Identifying these patterns can help reveal when users are most and least active.
weekday_activity <- cleaned_daily_activity %>%
group_by(day_of_week) %>%
summarise(
avg_steps = mean(total_steps, na.rm = TRUE),
avg_calories = mean(calories, na.rm = TRUE)
)
# Order days properly
weekday_activity$day_of_week <- factor(
weekday_activity$day_of_week,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
)
ggplot(weekday_activity, aes(x = day_of_week, y = avg_steps)) +
geom_col(fill = "blue") +
labs(
title = "Average Daily Steps by Day of Week",
x = "Day of Week",
y = "Average Steps"
) +
theme_minimal()+
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
The data show clear variation in activity levels across the week. Users tend to be the most active on Tuesdays and Saturdays, while activity levels are lowest on Fridays and Sundays. Interestingly, activity does not steadily increase toward the weekend, suggesting that user behavior is influenced by both weekday routines and weekend habits rather than a simple workweek vs. weekend pattern. For Bellabeat, this presents an opportunity to target lower-activity days, particularly Fridays and Sundays, with reminders, motivational prompts, or personalized activity goals. By encouraging engagement during these lower periods, Bellabeat could help users build more consistent activity habits across the entire week. Additionally, the relatively higher activity on certain days suggests that Bellabeat could reinforce positive behavior through milestone tracking or rewards on high-engagement days, helping users stay motivated and maintain momentum.
In addition to physical activity, sleep is a key component of overall wellness. I analyzed users’ sleep data to better understand patterns in sleep duration, efficiency, and behavior.
# Sleep Summary
daily_sleep %>%
summarise(
avg_minutes_asleep = mean(total_minutes_asleep, na.rm = TRUE),
avg_time_in_bed = mean(total_time_in_bed, na.rm = TRUE),
avg_time_to_sleep = mean(time_taken_to_sleep, na.rm = TRUE)
)
## avg_minutes_asleep avg_time_in_bed avg_time_to_sleep
## 1 419.1732 458.4829 39.30976
Users averaged approximately 419 minutes of sleep per night (about 7 hours), which is slightly below the commonly recommended 8 hours for adults. This suggests that even among smart device users, many individuals may not be consistently achieving optimal sleep duration. On average, users spent about 458 minutes in bed, indicating a gap of roughly 39 minutes between time in bed and time asleep. This difference highlights variability in sleep efficiency, suggesting that users may spend a significant amount of time attempting to fall asleep rather than actually resting. For Bellabeat, this presents an opportunity to emphasize not only sleep duration, but also sleep quality. By providing insights into sleep efficiency and offering recommendations to improve bedtime routines, Bellabeat can help users develop healthier and more effective sleep habits.
ggplot(daily_sleep, aes(x = time_taken_to_sleep)) +
geom_histogram(fill = "blue", bins = 30) +
labs(
title = "Distribution of Time Taken to Fall Asleep",
x = "Minutes to Fall Asleep",
y = "Count"
) +
theme_minimal()
The distribution of time taken to fall asleep shows that most users fall asleep within approximately 10 to 40 minutes. However, there is a long tail of users who take signficantly longer, with some taking over 100 minutes to fall asleep. This variation suggests that while many users experience relatively efficent sleep onset, a subset of users may struggle with falling asleep, indicating potential issues with sleep quality or nighttime routines. The presence of extreme outliers may also reflect inconsistent device usage or unusual sleep patterns. For Bellabeat, this presents an opportunity to provide more personalized insights and recommendations related to sleep behavior. Featuers such as bedtime reminders, relaxation guidance, or sleep habit tracking could help users improve both sleep efficency and overall sleep quality. By addressing not only how long users sleep, but how easily they fall asleep, Bellabeat can offer a more comprehensive and differentiated approach to wellness tracking.
Based on the analysis of user activity and sleep behavior, several opportunities emerge for Bellabeat to improve user engagement and product positioning.
Since step count is strongly associated with calories burned and overall activity, Bellabeat should emphasize step tracking as a core feature. Personalized step goals, progress tracking, and milestone-based rewards can help motivate users to increase their daily activity levels.
User activity varies across the week, with noticeable drops on certain days such as Fridays and Sundays, Bellabeat can leverage this insight by implementing targeted notifications, reminders, or challenges on lower-activity days to encourage more consistent engagement.
Because activity levels fluctuate throughout the week, Bellabeat should focus on helping users build consistent routines. Features such as weekly activity summaries, streak tracking, or habit-based goals could reinforce long-term behavior change.
Users average slightly below recommended sleep levels and show variability in sleep efficiency. Bellabeat can enhance its value proposition by providing more detailed sleep insights, including sleep duration trends, time-to-fall-asleep tracking, and personalized recommendations for improving sleep quality.
The variation in both activity and sleep behavior suggests that users have different needs and habits. Bellabeat could differentiate itself by offering personalized insights and recommendations tailored to individual user patterns, helping users make more meaningful improvements in their overall wellness.
This analysis explored smart device usage data to better understand patterns in user activity and sleep behavior, with the goal of informing Bellabeat’s product and marketing strategy. By examining trends in daily steps, calorie expenditure, weekly activity patterns, and sleep habits, several key insights emerged about how users engage with wellness technology. The findings suggest that while users are actively engaging with fitness trackers, many are not consistently reaching optimal activity or sleep levels. Step count proved to be a strong and accessible indicator of overall activity, while variations in weekly behavior highlighted opportunities for more targeted engagement strategies. Additionally, sleep analysis revealed not only slightly below-recommended sleep duration, but also meaningful differences in sleep efficiency, suggesting that users may benefit from more comprehensive support around nighttime routines. These insights reinforce the idea that wearable devices can play a critical role in helping users build healthier habits—but only when paired with actionable feedback and personalized guidance. For Bellabeat, this presents an opportunity to move beyond passive tracking and position its products as tools that actively support behavior change. By focusing on step-based engagement, habit formation, and personalized wellness insights, Bellabeat can enhance user experience, improve long-term engagement, and better differentiate itself in the competitive wellness technology market. Ultimately, leveraging data-driven insights in this way allows Bellabeat to deliver more value to its users while supporting healthier, more sustainable lifestyles.