This case study examines how Bellabeat, a high-tech wellness company specializing in health-focused products for women, can leverage smart device usage data to inform marketing strategy and drive growth. As a junior data analyst on Bellabeat’s marketing team, I was tasked with analyzing non-Bellabeat smart device data to identify consumer behavior patterns and apply these insights to Bellabeat’s products.
Business Objective: Uncover trends in smart device usage that can inform Bellabeat’s marketing strategy and product development, ultimately positioning the company for expanded market share in the competitive wellness technology space.
Methodology: Following the data analysis lifecycle (Ask-Prepare-Process-Analyze-Share-Act), this project utilizes FitBit Fitness Tracker Data from 33 users to examine activity patterns, sleep behaviors, and device engagement. The analysis focuses on translating general fitness tracker insights into actionable recommendations specifically for Bellabeat’s product ecosystem.
Key Deliverables:
Bellabeat Product Ecosystem & Analysis Focus
Analysis Overview: 33 FitBit users, 814 activity days, 382 sleep records analyzed through Bellabeat’s women’s wellness lens.
Top 3 Business Insights:
Primary Recommendations:
Data Source: FitBit Fitness Tracker Data (CC0: Public Domain, available on Kaggle). This dataset contains personal fitness tracker data from a maximum of 33 Fitbit users, with significant variation in user participation across different metrics.
Data Credibility (ROCCC Analysis):
Verdict: This dataset has significant limitations. It should be used to identify potential trends, but any conclusions must be framed with these limitations in mind.
The FitBit Fitness Tracker Data included two folders containing data from March 12, 2016 to April 11, 2016 and from April 12, 2016 to May 12, 2016. I decided to use the April 12, 2016 to May 12, 2016 dataset (mturkfitbit_export_4.12.16-5.12.16) as it represented the most recent data available and contained more files for my analysis. Following this selection, I proceeded to create a data dictionary by examining each file in the chosen folder and reviewing each column within those files. This process helped me develop a better understanding of the data in each table, how they relate to one another, and their relevance to the business task.
Primary Datasets Used for Analysis:
Supplementary Data:
Excluded Datasets:
Key Metrics Tracked:
For the complete data dictionary with detailed column descriptions and analysis rationale, view the full Google Sheets document here.
Image to the left is Bellabeat’s IVY+ and image to the right is Bellabeat’s Leaf Urban.
Data Limitations & Opportunities:
Recommended Features:
After loading the necessary packages (tidyverse,
ggplot2, janitor, lubridate, and
lm.beta), I loaded the datasets and counted the number of
unique users to understand the data coverage, alongside checking the
date range.
# Load datasets
daily_activity <- read.csv('dailyActivity_merged.csv')
sleep_day <- read.csv('sleepDay_merged.csv')
sleep_minutes <- read.csv('minuteSleep_merged.csv')
# Count unique users across datasets
cat("Unique users in daily_activity:", n_distinct(daily_activity$Id), "\n")## Unique users in daily_activity: 33
## Unique users in sleep_day: 24
## Unique users in sleep_minutes: 24
# Checking the date range
cat("Date range:", min(daily_activity$ActivityDate), "to",
max(daily_activity$ActivityDate), "\n") ## Date range: 4/12/2016 to 5/9/2016
Key Finding: The sleep datasets have only 24 unique users compared to 33 in the activity data, indicating I’ll need to analyze sleep patterns separately. It is also important to note that the daily activity data spans from 4/12/2016 to 5/9/2016 with 940 initial rows.
## Rows where distances sum perfectly: 737 out of 940
## Percentage of perfect matches: 78.4 %
Distance Column Integrity: After comprehensive
validation, 78.4% of distance calculations (sum of
VeryActiveDistance, ModeratelyActiveDistance,
LightActiveDistance, SedentaryActiveDistance
matching TotalDistance) matched perfectly, with only 2.6%
showing significant discrepancies. For
detailed validation code and visualizations, see Appendix
A.
Tracker Distance Comparison: Only 15 out of 940
rows showed discrepancies between TrackerDistance and
TotalDistance columns.
Activity Minutes Validation: Analysis revealed
52.6% of days have incomplete activity data (sum of
VeryActiveMinutes, FairlyActiveMinutes,
LightlyActiveMinutes and SedentaryMinutes not
equaling 1440 daily minutes), suggesting regular device removal. Complete validation methodology available in
Appendix A.
Non-Wear Period Identification: Only 0.85% of records indicated potential device non-wear, confirming strong user engagement. Detailed analysis in Appendix A.
Data Cleaning Decision: Based on validation results, removed redundant distance columns (TrackerDistance, individual distance components) to simplify analysis while documenting data quality considerations.
After validating the distance columns, I proceeded with comprehensive data cleaning:
# Clean and filter the daily activity data
daily_clean <- daily_activity %>%
clean_names() %>%
mutate(
activity_date = as.Date(activity_date, format = "%m/%d/%Y"),
total_recorded_minutes = very_active_minutes + fairly_active_minutes +
lightly_active_minutes + sedentary_minutes
) %>%
select(-tracker_distance, -logged_activities_distance) %>%
filter(
total_steps > 1000, # Remove days with minimal activity (potential non-wear)
total_steps < 50000, # Remove extreme outliers
calories > 1300, # Remove unrealistic low calorie burns
calories < 5000, # Remove extreme high calorie burns
sedentary_minutes <= 1440 # Ensure valid minute counts
)
cat("Cleaned dataset:", nrow(daily_clean), "rows from", n_distinct(daily_clean$id), "users\n")## Cleaned dataset: 821 rows from 33 users
Cleaning Actions:
tracker_distance,logged_activities_distance)total_recorded_minutes for device wear time
analysisI implemented the final cleaning steps to create the analysis-ready dataset:
# Apply final cleaning transformations
daily_final <- daily_clean %>%
# Remove redundant distance columns
select(-sedentary_active_distance, -very_active_distance,
-moderately_active_distance, -light_active_distance) %>%
# Remove confirmed non-wear records
filter(!(sedentary_minutes == 1440 & total_steps > 1000)) %>%
# Add data quality classification
mutate(
data_quality_flag = case_when(
total_recorded_minutes < 1200 ~ "Low recording time",
total_recorded_minutes < 1400 ~ "Moderate recording time",
abs(total_recorded_minutes - 1440) <= 5 ~ "Full day recording",
TRUE ~ "Partial day recording"
)
)
cat("Final cleaned dataset:", nrow(daily_final), "rows from",
n_distinct(daily_final$id), "users\n")## Final cleaned dataset: 814 rows from 33 users
Data Quality Assessment:
quality_summary <- daily_final %>%
count(data_quality_flag) %>%
mutate(percentage = n / sum(n) * 100)
print("Data Recording Completeness Summary:")## [1] "Data Recording Completeness Summary:"
## data_quality_flag n percentage
## 1 Full day recording 382 46.9287469
## 2 Low recording time 382 46.9287469
## 3 Moderate recording time 47 5.7739558
## 4 Partial day recording 3 0.3685504
Visual Insight: The visualization reveals that 46.9% of days have full recording coverage, while an equal percentage (46.9%) have low recording time (<20 hours), highlighting the importance of accounting for device wear patterns in analysis.
final_validation <- daily_final %>%
summarise(
unique_users = n_distinct(id),
total_observation_days = n(),
avg_days_per_user = round(n() / n_distinct(id), 1),
avg_steps = round(mean(total_steps), 0),
pct_meeting_goal = round(mean(total_steps >= 8000) * 100, 1)
)
print("Final Cleaned Dataset Summary:")## [1] "Final Cleaned Dataset Summary:"
## unique_users total_observation_days avg_days_per_user avg_steps
## 1 33 814 24.7 8708
## pct_meeting_goal
## 1 52.8
Final Dataset Characteristics:
Note: The step goal was set at 8,000 instead of the traditional 10,000 to reflect more attainable targets for general wellness.
The daily activity dataset has been successfully cleaned and validated, resulting in a high-quality dataset ready for analysis. Key cleaning decisions included:
TrackerDistance, individual distance components)The final dataset maintains data integrity while realistically representing real-world device usage patterns, providing a solid foundation for uncovering meaningful insights about user behavior.
Following the cleaning of the daily activity dataset, I proceeded to
clean and integrate the sleep data from both the SleepDay
and MinuteSleep datasets. This process involved date
standardization, duplicate removal, and data validation to ensure
high-quality sleep metrics for analysis.
I began by examining the date ranges and standardizing datetime formats across both sleep datasets:
# Clean and format SleepDay dataset
sleep_day_clean <- sleep_day %>%
clean_names() %>%
mutate(
sleep_datetime = as.POSIXct(sleep_day, format = "%m/%d/%Y %I:%M:%S %p"),
sleep_date = as.Date(sleep_datetime),
sleep_time = format(sleep_datetime, "%H:%M:%S")
) %>%
select(-sleep_day,-sleep_time,-sleep_datetime)
cat("Sleep Day date range:", as.character(min(sleep_day_clean$sleep_date)), "to",
as.character(max(sleep_day_clean$sleep_date)), "\n") ## Sleep Day date range: 2016-04-12 to 2016-05-12
# Clean and format MinuteSleep dataset
sleep_minutes_clean <- sleep_minutes %>%
clean_names() %>%
mutate(
minute_datetime = as.POSIXct(date, format = "%m/%d/%Y %I:%M:%S %p"),
minute_date = as.Date(minute_datetime),
minute_time = format(minute_datetime, "%H:%M:%S")
) %>%
select(-date)
cat("Sleep Minutes date range:", as.character(min(sleep_minutes_clean$minute_date)), "to",
as.character(max(sleep_minutes_clean$minute_date)), "\n") ## Sleep Minutes date range: 2016-04-12 to 2016-05-12
## Duplicate records in SleepDay data: 3
## Duplicate records in MinuteSleep data: 543
## Duplicate analysis - identical values: 543 out of 543 duplicate groups
## Sleep sessions after session-level aggregation: 466
## Sleep sessions after final aggregation: 412
## [1] "Final Sleep Data Integration Analysis:"
## total_days days_with_both_data days_with_only_daily days_with_only_minute
## 1 416 406 4 6
## pct_days_with_both pct_perfect_sleep_match
## 1 97.59615 95.07389
Dataset Aggregation: Consolidated minute-level
sleep data into daily summaries, revealing 412 sleep sessions with 2
additional records not present in the daily_activity
dataset. Aggregation methodology detailed in
Appendix B.
Integration Success: Successfully merged daily and minute-level sleep datasets with strong alignment (97.6% data overlap) and high consistency (95.1% perfect match rate). Detailed integration analysis and comparison methodology available in Appendix C.
Strategic Value: The integration confirms both datasets contribute unique value, with minute-level data providing additional sleep records and daily data serving as a reliable baseline for analysis.
After preparing both sleep datasets, I combined them using an approach that picks the best data from each source. This ensures I have the most accurate sleep information possible.
sleep_final <- sleep_day_clean %>%
full_join(sleep_minutes_daily_aggregated, by = c("id", "sleep_date" = "minute_date")) %>%
mutate(
# Use the best available sleep minutes
final_minutes_asleep = case_when(
# When I have both and they're close, use daily data (more reliable)
!is.na(total_minutes_asleep.x) & !is.na(total_minutes_asleep.y) &
abs(total_minutes_asleep.x - total_minutes_asleep.y) <= 30 ~ total_minutes_asleep.x,
# When I only have daily data
!is.na(total_minutes_asleep.x) & is.na(total_minutes_asleep.y) ~ total_minutes_asleep.x,
# When I only have minute data
is.na(total_minutes_asleep.x) & !is.na(total_minutes_asleep.y) ~ total_minutes_asleep.y,
# When they differ significantly, use daily data (more conservative)
TRUE ~ total_minutes_asleep.x
),
# Use the best available time in bed
final_time_in_bed = case_when(
!is.na(total_time_in_bed) & !is.na(total_minutes_recorded) &
abs(total_time_in_bed - total_minutes_recorded) <= 30 ~ total_time_in_bed,
!is.na(total_time_in_bed) & is.na(total_minutes_recorded) ~ total_time_in_bed,
is.na(total_time_in_bed) & !is.na(total_minutes_recorded) ~ total_minutes_recorded,
TRUE ~ total_time_in_bed
),
# Enhanced sleep metrics
sleep_efficiency_final = (final_minutes_asleep / final_time_in_bed) * 100,
restless_percentage = ifelse(!is.na(total_minutes_restless),
(total_minutes_restless / final_time_in_bed) * 100, NA),
# Data source tracking
data_source = case_when(
is.na(total_minutes_asleep.y) ~ "Daily only", # No minute data available
is.na(total_minutes_asleep.x) ~ "Minute only", # No daily data available
abs(total_minutes_asleep.x - total_minutes_asleep.y) <= 10 ~ "Daily (verified)", # Both datasets agree (within 10 minutes)
TRUE ~ "Combined" # Both datasets disagree → used daily data as a fallback
),
# Sleep quality categories
sleep_duration_hours = final_minutes_asleep / 60,
sleep_quality = case_when(
sleep_efficiency_final >= 85 ~ "Excellent",
sleep_efficiency_final >= 70 ~ "Good",
sleep_efficiency_final >= 50 ~ "Fair",
TRUE ~ "Poor"
)
)After combining the datasets, I looked at sleep quality to understand how well users sleep. Sleep quality measures how much time in bed is actually spent sleeping.
Key Findings: 91.6% of users get excellent quality sleep. This means they spend 85% or more of their time in bed actually asleep.
Sleep Quality/Efficiency Analysis Note: There are two ways to calculate sleep quality metrics:
We use the second method (record counting) from now on for consistency with other categorical analysis and because it more intuitively represents what portion of sleep sessions were high quality.
To finish preparing the sleep data, I removed unrealistic records and kept only the most useful columns for analysis.
sleep_final <- sleep_final %>%
filter(
# Reasonable sleep values
final_minutes_asleep >= 180, # More than 3 hours of sleep
final_minutes_asleep <= 720, # No more than 12 hours asleep
final_time_in_bed <= 960, # No more than 16 hours in bed
sleep_efficiency_final <= 100, # Efficiency can't exceed 100%
) %>%
# Select final columns
select(
id, sleep_date, first_sleep_start,
# Core sleep metrics
total_minutes_asleep = final_minutes_asleep,
total_time_in_bed = final_time_in_bed,
sleep_efficiency = sleep_efficiency_final,
# Enhanced metrics from minute data
total_minutes_restless,
total_minutes_awake,
restless_percentage,
total_sleep_sessions,
# Sleep categories
sleep_duration_hours,
sleep_quality,
# Data quality
data_source,
total_sleep_records
)
cat("Sleep Records After Final Cleaning:", nrow(sleep_final), "\n")## Sleep Records After Final Cleaning: 382
## Unique Users After Cleaning: 19
Final Sleep Dataset Composition
This analysis addresses four key business questions to inform Bellabeat’s strategy:
Methodology: Using cleaned data from 33 users (814 activity days, 382 sleep records) with correlation analysis, user segmentation, and pattern recognition applied through Bellabeat’s women’s wellness lens.
Daily_Final Dataset AnalysisI began by examining the daily activity patterns to establish baseline metrics for user behavior.
activity_summary <- daily_final %>%
summarise(
users = n_distinct(id),
avg_daily_steps = mean(total_steps),
avg_daily_calories = mean(calories),
avg_sedentary_hours = mean(sedentary_minutes) / 60,
pct_meeting_goal = mean(total_steps >= 8000) * 100
)
print("Bellabeat Key Metrics:")## [1] "Bellabeat Key Metrics:"
## users avg_daily_steps avg_daily_calories avg_sedentary_hours pct_meeting_goal
## 1 33 8708.053 2405.07 15.83493 52.82555
This initial summary revealed several key patterns that guided my deeper analysis. I focused on answering these business-critical questions:
The following sections explore these questions to provide actionable insights for Bellabeat.
How Do Steps and Activity Minutes Correlate With Distance Covered?
Understanding how different activity metrics relate helps Bellabeat design better tracking features and user goals.
## Correlation between steps and distance: 0.981
## `geom_smooth()` using formula = 'y ~ x'
Key Findings: The near-perfect correlation (0.981) confirms that step counting accurately measures distance covered. Most users take fewer than 15,000 steps daily, with a small group showing significantly higher activity levels.
Understanding user walking patterns by analyzing users steps per km helps Bellabeat set realistic distance-based goals.
daily_final <- daily_final %>%
mutate(steps_per_km = total_steps / total_distance)
# Summary of steps per km
steps_km_summary <- daily_final %>%
summarize(
avg_steps_per_km = mean(steps_per_km),
median_steps_per_km = median(steps_per_km),
sd_steps_per_km = sd(steps_per_km)
)
print(steps_km_summary)## avg_steps_per_km median_steps_per_km sd_steps_per_km
## 1 1422.761 1465.387 114.5059
Business Application: The average of 1,423 steps per kilometer helps Bellabeat estimate that 8,000 steps equals approximately 5.6 km - an appropriate distance goal for users to cover daily.
To understand how different activity levels contribute to distance covered, I analyzed the relationship between various activity intensities and total distance. This helps Bellabeat understand which types of movement most impact overall activity metrics.
To simplify the analysis, I combined the three active minute
categories (very_active_minutes,
fairly_active_minutes, lightly_active_minutes)
into a single active_minutes metric. This approach provides a clearer
picture of overall activity time, especially since
sedentary_minutes naturally shows an negative relationship
with distance covered.
daily_final <- daily_final %>%
mutate(active_minutes = very_active_minutes + fairly_active_minutes + lightly_active_minutes)
minutes_distance_cor <- cor(daily_final$active_minutes, daily_final$total_distance)
cat("Correlation between active minutes and distance:", round(minutes_distance_cor, 3), "\n") ## Correlation between active minutes and distance: 0.611
## `geom_smooth()` using formula = 'y ~ x'
Notes: The moderate positive correlation (0.611) shows that increased activity time generally leads to greater distance covered, though activity intensity also plays a role.
Strategic Recommendation: Bellabeat can motivate users by emphasizing that consistent daily activity - regardless of intensity - contributes significantly to overall distance goals.
What percentage of users consistently meet activity goals (8,000 steps)?
Understanding how many users meet activity targets (8,000 steps for this analysis) helps Bellabeat design appropriate goal structures and motivation systems.
goal_analysis <- daily_final %>%
summarize(
pct_meeting_8000_steps = mean(total_steps >= 8000) * 100,
pct_meeting_10000_steps = mean(total_steps >= 10000) * 100,
pct_less_than_8000_steps = mean(total_steps < 8000) * 100
)
print("Step Goal Achievement:")## [1] "Step Goal Achievement:"
## pct_meeting_8000_steps pct_meeting_10000_steps pct_less_than_8000_steps
## 1 52.82555 36.97789 47.17445
Notes: While over half of all days (52.8%) meet the 8,000-step goal, there’s a significant drop to 37% for the traditional 10,000-step target, indicating the 8,000-step goal may be more achievable for most users.
Understanding how consistently users meet the 8,000-step daily goal helps Bellabeat target users with appropriate motivation strategies. By analyzing the percentage of days each user achieves this target, we can identify distinct behavioral patterns that inform personalized engagement approaches.
user_consistency <- daily_final %>%
group_by(id) %>%
summarize(
days_above_8000 = mean(total_steps >= 8000) * 100,
consistency_category = case_when(
days_above_8000 >= 80 ~ "Highly Consistent",
days_above_8000 >= 50 ~ "Moderately Consistent",
days_above_8000 >= 20 ~ "Occasionally Active",
TRUE ~ "Rarely Active"
)
)
consistency_summary <- user_consistency %>%
count(consistency_category) %>%
mutate(percentage = n / sum(n) * 100)
print("User Consistency Patterns:")## [1] "User Consistency Patterns:"
## # A tibble: 4 × 3
## consistency_category n percentage
## <chr> <int> <dbl>
## 1 Highly Consistent 8 24.2
## 2 Moderately Consistent 9 27.3
## 3 Occasionally Active 8 24.2
## 4 Rarely Active 8 24.2
Key Finding: The data reveals four distinct user groups based on their consistency in achieving the 8,000-step daily target:
Evidence-Based Strategic Recommendations
Targeted User Engagement
Women-Specific Differentiators
What are the relationships between steps, distance, and calories burned?
Understanding how activity metrics translate to calories burned helps Bellabeat design effective fitness tracking and goal-setting features for users.
To quantify these relationships, I analyzed the correlation between different activity measures and calorie expenditure:
calorie_correlations <- daily_final %>%
summarize(
steps_calories_cor = cor(total_steps, calories),
distance_calories_cor = cor(total_distance, calories),
active_minutes_calories_cor = cor(active_minutes, calories)
)
print("Calorie Burn Correlations:")## [1] "Calorie Burn Correlations:"
## steps_calories_cor distance_calories_cor active_minutes_calories_cor
## 1 0.5275046 0.6008586 0.3636684
Key Findings: Distance shows the strongest relationship with calorie burn (0.60), followed by steps (0.53). Since active minutes showed a weaker correlation (0.36), I focused the analysis on steps and distance for clearer insights.
To understand how step counts translate to energy expenditure, I categorized users by activity level and calculated average calories burned for each group:
steps_calories <- daily_final %>%
mutate(step_category = case_when(
total_steps < 5000 ~ "Low (<5,000)",
total_steps >= 5000 & total_steps < 8000 ~ "Medium (5,000-8,000)",
total_steps >= 8000 & total_steps < 10000 ~ "High (8,000-10,000)",
total_steps >= 10000 ~ "Very High (10,000+)"
)) %>%
group_by(step_category) %>%
summarize(
avg_calories = mean(calories),
n_days = n()
) %>%
mutate(step_category = factor(step_category,
levels = c("Low (<5,000)", "Medium (5,000-8,000)",
"High (8,000-10,000)", "Very High (10,000+)")))
print(steps_calories)## # A tibble: 4 × 3
## step_category avg_calories n_days
## <fct> <dbl> <int>
## 1 High (8,000-10,000) 2490. 129
## 2 Low (<5,000) 1943. 182
## 3 Medium (5,000-8,000) 2266. 202
## 4 Very High (10,000+) 2741. 301
Note: Each step-level increase brings significant calorie burn improvements, with the most dramatic jump occurring when users progress from low to medium activity levels.
Since distance showed the strongest correlation with calories, I analyzed how different distance ranges impact energy expenditure:
distance_calories <- daily_final %>%
mutate(distance_category = case_when(
total_distance < 3.5 ~ "Short (<3.5 km)",
total_distance >= 3.5 & total_distance < 5.5 ~ "Medium (3.5-5.5 km)",
total_distance >= 5.5 & total_distance < 7.0 ~ "Long (5.5-7.0 km)",
total_distance >= 7.0 ~ "Very Long (7.0+ km)"
)) %>%
group_by(distance_category) %>%
summarize(
avg_calories = mean(calories),
n_days = n()
) %>%
mutate(distance_category = factor(distance_category,
levels = c("Short (<3.5 km)", "Medium (3.5-5.5 km)",
"Long (5.5-7.0 km)", "Very Long (7.0+ km)")))
print(distance_calories)## # A tibble: 4 × 3
## distance_category avg_calories n_days
## <fct> <dbl> <int>
## 1 Long (5.5-7.0 km) 2410. 139
## 2 Medium (3.5-5.5 km) 2206. 186
## 3 Short (<3.5 km) 1919. 187
## 4 Very Long (7.0+ km) 2826. 302
Note: Distance proves to be the most reliable predictor of calorie expenditure, with users covering 7+ km burning nearly 50% more calories than those covering less than 3.5 km.
Strategic Recommendations:
When and Why Do Users Stop Wearing Their Devices?
Analyzing device wear time provides crucial insights into user
engagement and helps Bellabeat optimize product design and marketing
strategies. I examined how long users typically wear their devices by
analyzing the total_recorded_minutes metric across all
observation days.
wear_time_analysis <- daily_final %>%
summarize(
avg_wear_hours = mean(total_recorded_minutes) / 60,
median_wear_hours = median(total_recorded_minutes) / 60,
pct_full_day = mean(total_recorded_minutes >= 1380) * 100, # 23+ hours
pct_most_of_day = mean(total_recorded_minutes >= 1080) * 100, # 18+ hours
pct_half_day = mean(total_recorded_minutes >= 720) * 100, # 12+ hours
pct_low_wear = mean(total_recorded_minutes < 720) * 100 # <12 hours
)
print("Device Wear Time Patterns:")## [1] "Device Wear Time Patterns:"
## avg_wear_hours median_wear_hours pct_full_day pct_most_of_day pct_half_day
## 1 20.1794 22.35 48.0344 57.49386 98.5258
## pct_low_wear
## 1 1.474201
Initial Insight: The data shows strong device engagement, with nearly all users (98.5%) wearing their devices for at least 12 hours daily, and over half (57.5%) maintaining wear time exceeding 18 hours.
To provide clearer segmentation, I created distinct wear time categories that prevent double-counting and offer a comprehensive view of user behavior patterns:
wear_categories <- daily_final %>%
mutate(wear_category = case_when(
total_recorded_minutes >= 1380 ~ "Full Day (23+ hrs)",
total_recorded_minutes >= 1080 ~ "Most of Day (18-23 hrs)",
total_recorded_minutes >= 720 ~ "Half Day (12-18 hrs)",
total_recorded_minutes >= 360 ~ "Partial Day (6-12 hrs)",
TRUE ~ "Low Wear (<6 hrs)"
)) %>%
count(wear_category) %>%
mutate(percentage = n / sum(n) * 100) %>%
mutate(wear_category = factor(wear_category,
levels = c("Full Day (23+ hrs)", "Most of Day (18-23 hrs)",
"Half Day (12-18 hrs)", "Partial Day (6-12 hrs)",
"Low Wear (<6 hrs)")))
print(wear_categories)## wear_category n percentage
## 1 Full Day (23+ hrs) 391 48.034398
## 2 Half Day (12-18 hrs) 334 41.031941
## 3 Most of Day (18-23 hrs) 77 9.459459
## 4 Partial Day (6-12 hrs) 12 1.474201
Key Findings: Device usage patterns reveal two dominant wear-time behaviors:
Business Implication: This segmentation suggests different charging habits and feature usage, informing battery optimization and notification strategies.
Notes: The complete absence of days with low wear time (<6 hours) indicates strong user commitment to device usage, suggesting the fitness trackers successfully integrate into daily routines without causing significant wear fatigue.
Strategic Implications: This high engagement level provides Bellabeat with reliable data collection and presents opportunities to enhance features for both casual and power users through differentiated battery optimization and notification strategies.
How Does User Activity Vary by Day of the Week?
This analysis examines how activity levels fluctuate throughout the week, helping Bellabeat identify optimal timing for challenges and motivational content. I analyzed weekly patterns across three key metrics: steps, calories burned, and distance covered.
# Adding the days of the week to the table
daily_final <- daily_final %>%
mutate(day_of_week = weekdays(activity_date),
day_of_week = factor(day_of_week,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")))
# Analyze activity by the days of week
weekly_patterns <- daily_final %>%
group_by(day_of_week) %>%
summarize(
avg_steps = mean(total_steps),
avg_calories = mean(calories),
avg_distance = mean(total_distance),
n_days = n()
)
print(weekly_patterns)## # A tibble: 7 × 5
## day_of_week avg_steps avg_calories avg_distance n_days
## <fct> <dbl> <dbl> <dbl> <int>
## 1 Monday 8844. 2403. 6.30 104
## 2 Tuesday 9191. 2458. 6.59 131
## 3 Wednesday 8480. 2380. 6.16 133
## 4 Thursday 8822. 2380. 6.33 121
## 5 Friday 8208. 2393. 5.85 114
## 6 Saturday 9206. 2453. 6.61 109
## 7 Sunday 8138. 2365. 5.90 102
Daily Step Pattern
Key Insight: Saturday and Tuesday show the highest step counts, indicating these are peak activity days for users. The variation in daily step counts reveals clear weekly rhythms that Bellabeat can leverage for targeted engagement.
Note: The number of observation days varies by weekday, which may influence these patterns. However, the consistent trends provide valuable insights for timing challenges and motivation strategies.
Daily Calories Pattern
Pattern Analysis: Calorie expenditure follows a similar weekly pattern to step counts, with Tuesday showing slightly higher energy expenditure than Saturday. This consistency across metrics strengthens the reliability of observed weekly trends.
Daily Distance Pattern
Consistent Weekly Rhythm: All three metrics reveal a clear pattern, with Wednesday, Friday, and Sunday consistently showing the lowest activity levels. Monday and Thursday display moderate activity that falls between the highest and lowest performing days. This consistency across steps, calories, and distance indicates genuine behavioral patterns rather than random fluctuations.
Strategic Implications:
Sleep_Final Dataset AnalysisBuilding on the activity analysis, I examined sleep patterns to understand user rest and recovery behaviors. This analysis provides crucial insights for Bellabeat’s sleep tracking features and wellness recommendations.
sleep_summary <- sleep_final %>%
summarise(
users = n_distinct(id),
avg_sleep_hours = mean(total_minutes_asleep) / 60,
avg_sleep_efficiency = mean(sleep_efficiency),
pct_excellent_sleep = mean(sleep_quality == "Excellent") * 100
)
print(sleep_summary)## users avg_sleep_hours avg_sleep_efficiency pct_excellent_sleep
## 1 19 7.212871 91.86517 92.67016
This baseline summary revealed key sleep patterns that guided my deeper investigation. I focused on answering these critical questions about user sleep behaviors:
Understanding Sleep Adequacy and Quality
I began by analyzing whether users meet recommended sleep guidelines (7-8 hours of sleep) and how efficiently they sleep while in bed.
sleep_metrics <- sleep_final %>%
summarize(
# Sleep duration
avg_sleep_minutes = mean(total_minutes_asleep),
avg_sleep_hours = mean(sleep_duration_hours),
median_sleep_hours = median(sleep_duration_hours),
# Sleep efficiency (time asleep vs time in bed)
avg_sleep_efficiency = mean(sleep_efficiency),
median_sleep_efficiency = median(sleep_efficiency),
# Comparison to recommended ranges
pct_meeting_7_hours = mean(sleep_duration_hours >= 7) * 100, # 7+ hours
pct_meeting_8_hours = mean(sleep_duration_hours >= 8) * 100, # 8+ hours
pct_below_7_hours = mean(sleep_duration_hours < 7) * 100, # <7 hours
pct_above_8_hours = mean(sleep_duration_hours > 8) * 100 # >8 hours
)
print("Sleep Duration and Efficiency Metrics:")## [1] "Sleep Duration and Efficiency Metrics:"
## avg_sleep_minutes avg_sleep_hours median_sleep_hours avg_sleep_efficiency
## 1 432.7723 7.212871 7.291667 91.86517
## median_sleep_efficiency pct_meeting_7_hours pct_meeting_8_hours
## 1 94.31843 58.37696 28.79581
## pct_below_7_hours pct_above_8_hours
## 1 41.62304 28.27225
Note: These percentages use inclusive thresholds (≥7 hours, ≥8 hours) and represent all 382 sleep records in our cleaned dataset.
We see that 58.38% of days meet or exceed the 7-hour minimum recommendation, while 28.80% achieve 8+ hours. Additionally, 41.62% fall below 7 hours and 28.27% exceed 8 hours of sleep.
Important Context: These categories are not mutually exclusive (days with exactly 7 hours count in both “meeting 7 hours” and “not below 7 hours”).
To better visualize sleep patterns with non-overlapping categories, I classified sleep duration into relevant ranges:
sleep_duration_categories <- sleep_final %>%
mutate(sleep_category = case_when(
total_minutes_asleep < 360 ~ "Very Short (<6 hrs)",
total_minutes_asleep >= 360 & total_minutes_asleep < 420 ~ "Short (6-7 hrs)",
total_minutes_asleep >= 420 & total_minutes_asleep < 480 ~ "Recommended (7-8 hrs)",
total_minutes_asleep >= 480 & total_minutes_asleep < 540 ~ "Long (8-9 hrs)",
total_minutes_asleep >= 540 ~ "Very Long (9+ hrs)"
)) %>%
count(sleep_category) %>%
mutate(percentage = n / sum(n) * 100) %>%
mutate(sleep_category = factor(sleep_category,
levels = c("Very Short (<6 hrs)", "Short (6-7 hrs)",
"Recommended (7-8 hrs)", "Long (8-9 hrs)",
"Very Long (9+ hrs)")))Sleep Duration Distribution:
Key Insight: While most users meet basic sleep requirements, only 29.6% achieve the optimal 7-8 hour range, indicating significant opportunity for sleep duration improvement.
Sleep Efficiency Analysis
After analyzing sleep duration, I examined sleep efficiency because time in bed doesn’t equal quality sleep.
Sleep efficiency - the percentage of time in bed actually spent asleep - provides deeper insights into sleep quality beyond simple duration metrics. While sleep duration tells us how long users sleep, efficiency reveals how well they sleep during that time.
sleep_efficiency_analysis <- sleep_final %>%
mutate(sleep_efficiency = total_minutes_asleep / total_time_in_bed * 100) %>%
filter(sleep_efficiency <= 100) %>% # Remove unrealistic values >100%
mutate(efficiency_category = case_when(
sleep_efficiency >= 85 ~ "Excellent (85%+)",
sleep_efficiency >= 75 ~ "Good (75-85%)",
sleep_efficiency >= 65 ~ "Fair (65-75%)",
TRUE ~ "Poor (<65%)"
)) %>%
count(efficiency_category) %>%
mutate(percentage = n / sum(n) * 100) %>%
mutate(efficiency_category = factor(efficiency_category,
levels = c("Excellent (85%+)", "Good (75-85%)",
"Fair (65-75%)", "Poor (<65%)")))
print("Sleep Efficiency Categories:")## [1] "Sleep Efficiency Categories:"
## efficiency_category n percentage
## 1 Excellent (85%+) 354 92.6701571
## 2 Fair (65-75%) 7 1.8324607
## 3 Good (75-85%) 3 0.7853403
## 4 Poor (<65%) 18 4.7120419
Key Insight: 92.7% of analyzed days achieve excellent sleep quality (≥85% efficiency), revealing a crucial distinction for Bellabeat’s product strategy:
Strategic Implication: Bellabeat should focus on helping users prioritize sufficient sleep time rather than improving sleep quality.
Strategic Sleep Recommendations:
Sleep Efficiency Analysis Note: There are two ways to calculate sleep quality metrics:
We use the second method (record counting) for consistency with other categorical analysis and because it more intuitively represents what portion of sleep sessions were high quality.
Understanding Sleep Disruption Patterns
Restless sleep provides insights into sleep quality beyond simple duration metrics, revealing potential issues with sleep maintenance and depth.
restlessness_analysis <- sleep_final %>%
mutate(
total_sleep_time = total_minutes_asleep,
restlessness_percentage = (total_minutes_restless / total_time_in_bed) * 100
) %>%
filter(restlessness_percentage <= 100) %>% # Remove unrealistic values
mutate(restlessness_category = case_when(
restlessness_percentage < 10 ~ "Excellent (<10%)",
restlessness_percentage >= 10 & restlessness_percentage < 20 ~ "Good (10-20%)",
restlessness_percentage >= 20 & restlessness_percentage < 30 ~ "Fair (20-30%)",
restlessness_percentage >= 30 ~ "Poor (30%+)"
))
# Summarize restlessness distribution
restlessness_summary <- restlessness_analysis %>%
count(restlessness_category) %>%
mutate(percentage = n / sum(n) * 100) %>%
mutate(restlessness_category = factor(restlessness_category,
levels = c("Excellent (<10%)", "Good (10-20%)",
"Fair (20-30%)", "Poor (30%+)")))
print("Sleep Restlessness Distribution:")## [1] "Sleep Restlessness Distribution:"
## restlessness_category n percentage
## 1 Excellent (<10%) 335 87.696335
## 2 Fair (20-30%) 5 1.308901
## 3 Good (10-20%) 20 5.235602
## 4 Poor (30%+) 22 5.759162
Sleep Restlessness Insight:
Business Implication: While most users enjoy restful sleep, a small but significant segment needs targeted sleep quality improvement features.
Note: Sleep efficiency and restlessness use different categorization criteria, which explains the variation in excellent sleep percentages between analyses.
Understanding Individual Sleep Behavior Stability
To address the question of sleep consistency, I analyzed how consistently individual users maintain their sleep patterns night after night. This reveals whether users have stable sleep routines or experience significant night-to-night variations.
user_sleep_consistency <- sleep_final %>%
group_by(id) %>%
summarize(
avg_sleep_hours = mean(sleep_duration_hours),
sleep_std_dev = sd(sleep_duration_hours),
consistency_score = (1 - (sleep_std_dev / avg_sleep_hours)) * 100,
consistency_category = case_when(
consistency_score >= 90 ~ "Highly Consistent",
consistency_score >= 80 ~ "Moderately Consistent",
consistency_score >= 70 ~ "Somewhat Consistent",
TRUE ~ "Inconsistent"
)
)
# Summarize consistency patterns
consistency_summary <- user_sleep_consistency %>%
count(consistency_category) %>%
mutate(percentage = n / sum(n) * 100)
print("Sleep Duration Consistency Patterns:")## [1] "Sleep Duration Consistency Patterns:"
## # A tibble: 3 × 3
## consistency_category n percentage
## <chr> <int> <dbl>
## 1 Highly Consistent 2 10.5
## 2 Moderately Consistent 11 57.9
## 3 Somewhat Consistent 6 31.6
Consistency Analysis Results: The data reveals three distinct user segments based on sleep pattern stability:
Strategic Implications:
This analysis shows that while most users (68.4%) maintain moderate to high sleep consistency, nearly one-third would benefit from Bellabeat’s sleep routine stabilization features.
Understanding How Bedtime Affects Sleep Quality
Analyzing sleep timing provides insights into optimal bedtime windows and helps Bellabeat develop personalized sleep schedule recommendations.
sleep_timing_analysis <- sleep_final %>%
filter(!is.na(first_sleep_start)) %>% # Remove records with missing start times
mutate(
# Extract hour from sleep start time
sleep_start_hour = hour(first_sleep_start),
sleep_timing_category = case_when(
sleep_start_hour < 21 ~ "Early (Before 9 PM)",
sleep_start_hour < 23 ~ "Average (9-11 PM)",
sleep_start_hour >= 23 ~ "Late (After 11 PM)"
),
sleep_timing_category = factor(sleep_timing_category,
levels = c("Early (Before 9 PM)", "Average (9-11 PM)", "Late (After 11 PM)"))
) %>%
group_by(sleep_timing_category) %>%
summarize(
avg_sleep_duration = mean(sleep_duration_hours),
avg_efficiency = mean(sleep_efficiency),
n_records = n()
)
print("Sleep Timing Analysis:")## [1] "Sleep Timing Analysis:"
## # A tibble: 3 × 4
## sleep_timing_category avg_sleep_duration avg_efficiency n_records
## <fct> <dbl> <dbl> <int>
## 1 Early (Before 9 PM) 6.70 89.3 162
## 2 Average (9-11 PM) 7.59 93.3 149
## 3 Late (After 11 PM) 7.59 94.7 71
Sleep Timing Insights: The analysis reveals clear patterns between bedtime and sleep outcomes:
Strategic Recommendations:
Note: Four records with incomplete timing data were excluded from this analysis to maintain data quality.
Identifying Rhythms in Sleep Behavior
Understanding how sleep patterns vary throughout the week helps Bellabeat time sleep-focused content and challenges effectively.
sleep_final <- sleep_final %>%
mutate(day_of_week = weekdays(sleep_date),
day_of_week = factor(day_of_week,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")))
# Analyze activity by the days of week
weekly_sleep_patterns <- sleep_final %>%
group_by(day_of_week) %>%
summarize(
avg_sleep_duration = mean(sleep_duration_hours),
avg_sleep_efficiency = mean(sleep_efficiency),
avg_restlessness = mean(restless_percentage),
n_days = n()
)
print(weekly_sleep_patterns)## # A tibble: 7 × 5
## day_of_week avg_sleep_duration avg_sleep_efficiency avg_restlessness n_days
## <fct> <dbl> <dbl> <dbl> <int>
## 1 Monday 7.11 92.0 7.04 43
## 2 Tuesday 6.73 91.1 7.94 63
## 3 Wednesday 7.32 92.8 6.24 65
## 4 Thursday 7.02 92.2 7.15 60
## 5 Friday 7.05 92.2 6.85 52
## 6 Saturday 7.32 91.6 7.50 50
## 7 Sunday 8.09 91.0 8.08 49
Before analyzing daily patterns, I examined the relationships between key sleep metrics:
sleep_activity_correlation <- sleep_final %>%
summarize(
duration_quality_cor = cor(total_minutes_asleep, total_time_in_bed ),
duration_restless_cor = cor(total_time_in_bed, total_minutes_restless),
duration_efficiency_cor = cor(total_time_in_bed, sleep_efficiency)
)
print(sleep_activity_correlation)## duration_quality_cor duration_restless_cor duration_efficiency_cor
## 1 0.8912864 0.134657 0.008248568
Note: there is a large positive correlation (0.89) between total time asleep and total time in bed. So as total time in bed increases, total time asleep increases. There is little correlation between total time in bed & total time restless and total time in bed & sleep efficiency.
Sleep Duration by Weekday
The analysis reveals moderate variation in sleep duration throughout the week:
Strategic Insight: While sleep duration fluctuates, the variation is relatively modest, suggesting users maintain somewhat consistent sleep schedules regardless of weekday.
Sleep Restlessness by Weekday
Restlessness shows the following weekly pattern:
Sleep Efficiency by Weekday
Sleep efficiency demonstrates remarkable consistency:
Strategic Advantage: This reliability allows Bellabeat to build sleep features that work consistently, while personalizing for individual cycle-related sleep pattern variations.
This analysis demonstrates that while working with third-party fitness data has limitations, valuable insights emerge when viewed through Bellabeat’s women’s wellness lens. The key differentiator lies not in the raw activity metrics, but in how Bellabeat can:
By translating these general fitness patterns into women-centric applications, Bellabeat can strengthen its market position through data-informed personalization that truly understands and adapts to female wellness needs across different life stages and consistency levels.
Purpose: Comprehensive data validation to ensure analysis reliability across distance calculations, activity minutes, and device wear patterns.
Key Findings:
Distance Validation
Methodology: Compared sum of individual distance
components (VeryActiveDistance,
ModeratelyActiveDistance, LightActiveDistance,
SedentaryActiveDistance) against
TotalDistance.
# Validate distance column sums
distance_check <- daily_activity %>%
mutate(
calculated_total = VeryActiveDistance + ModeratelyActiveDistance +
LightActiveDistance + SedentaryActiveDistance,
difference = TotalDistance - calculated_total,
matches_perfectly = abs(difference) < 0.01 # Allow tiny rounding differences
)
# Check matching accuracy
cat("Rows where distances sum perfectly:", sum(distance_check$matches_perfectly),
"out of", nrow(distance_check), "\n")
perfect_match_rate <- mean(distance_check$matches_perfectly) * 100
cat("Percentage of perfect matches:", round(perfect_match_rate, 2), "%\n")Data Quality Finding: 78.4% of records (737/940) showed perfect alignment between distance components and total distance, indicating generally reliable tracking with minor inconsistencies requiring documentation.
Discrepancy Analysis: A histogram visualization reveals the distribution of calculation differences, showing most variances are minimal and clustered near zero.
Visual Analysis: The histogram reveals clustered near-zero discrepancies, indicating mostly minor variances, with a small number of significant outliers requiring attention.
Significant Discrepancy Identification:
# Large discrepancies
large_diff <- distance_check %>% filter(abs(difference) > 0.1) %>% nrow()
cat("Large discrepancies (>0.1 km):", large_diff, "rows\n")## Large discrepancies (>0.1 km): 24 rows
Key Insights:
Tracker Distance Validation
I compared the TrackerDistance and
TotalDistance columns to assess data consistency:
# Compare tracker vs total distance
tracker_diff <- daily_activity %>%
mutate(diff = abs(TrackerDistance - TotalDistance) > 0.01) %>%
summarise(discrepant_rows = sum(diff))
cat("TrackerDistance discrepancies:", tracker_diff$discrepant_rows, "/ 940 rows\n")## TrackerDistance discrepancies: 15 / 940 rows
Validation Outcome: Minimal discrepancies (15/940
rows, 1.6%) confirm high data consistency, supporting the decision to
remove the redundant TrackerDistance column while
documenting this minor data quality note.
Activity Minutes Validation
Verified completeness of daily activity recording by checking if sum of activity minutes equals 1440 (total minutes per day):
# Validate minute sums
minutes_validation <- daily_clean %>%
mutate(
total_active_minutes = very_active_minutes + fairly_active_minutes +
lightly_active_minutes + sedentary_minutes,
minutes_difference = 1440 - total_active_minutes,
minutes_match_perfectly = minutes_difference == 0,
minutes_match_closely = abs(minutes_difference) <= 5 # 5-minute tolerance
)
# Calculate validation results
minutes_difference <- minutes_validation %>%
filter(!minutes_match_perfectly, !minutes_match_closely)
validation_failed_pct <- round(nrow(minutes_difference) / nrow(daily_clean) * 100, 2)
cat("Rows failing minute validation:", nrow(minutes_difference), "/", nrow(daily_clean),
"(", validation_failed_pct, "%)\n")## Rows failing minute validation: 432 / 821 ( 52.62 %)
Finding: 52.62% of rows have activity minutes that don’t sum to 1440, even with a 5-minute tolerance, indicating significant device non-wear time.
Analysis of Missing Minutes
To understand the pattern of missing data:
# Analyze missing minutes pattern
minute_pattern_analysis <- minutes_validation %>%
mutate(missing_minutes = 1440 - total_active_minutes) %>%
summarise(
avg_missing_minutes = mean(missing_minutes),
median_missing_minutes = median(missing_minutes),
max_missing_minutes = max(missing_minutes),
min_missing_minutes = min(missing_minutes),
pct_with_missing_minutes = mean(missing_minutes > 0) * 100
)
cat("Missing Minutes Analysis:\n")## Missing Minutes Analysis:
## avg_missing_minutes median_missing_minutes max_missing_minutes
## 1 227.2814 82 1060
## min_missing_minutes pct_with_missing_minutes
## 1 0 52.86236
Key Insight: The average of 227 missing minutes suggests users typically wear their devices for about 20 hours per day, with 52.86% of days having some missing activity data.
Visualizing Missing Minutes Distribution
Visual Analysis: The distribution shows most days have moderate amounts of missing data, with a concentration around 200-300 missing minutes, consistent with partial device wear throughout the day.
Data Quality Decision
Based on this analysis:
Decision: I will retain all records but document this data limitation, as it reflects real-world usage patterns where users don’t wear devices 24/7.
Identifying Non-Wear Periods
I conducted additional validation to identify potential device non-wear periods by examining records with 1440 sedentary minutes:
# Identify potential non-wear periods
sedentary_1440 <- daily_clean %>%
filter(sedentary_minutes == 1440)
cat("Potential non-wear records (1440 sedentary minutes):", nrow(sedentary_1440), "\n") ## Potential non-wear records (1440 sedentary minutes): 7
## Percentage of total data: 0.85 %
# Analyze these suspicious records
sedentary_1440_summary <- sedentary_1440 %>%
summarise(
avg_steps = mean(total_steps),
avg_calories = mean(calories),
zero_steps_count = sum(total_steps == 0),
zero_calories_count = sum(calories == 0)
)
print("Analysis of potential non-wear records:")## [1] "Analysis of potential non-wear records:"
## avg_steps avg_calories zero_steps_count zero_calories_count
## 1 7182 2689.143 0 0
Key Finding: Only 7 records (0.85% of data) showed 1440 sedentary minutes with minimal activity, indicating likely device non-wear. These records were removed to maintain data quality.
Duplicate Detection and Removal
To ensure data integrity, I conducted duplicate checks and removed the duplicated rows.
# Check for duplicates in SleepDay data
sleep_day_duplicates <- sleep_day_clean %>%
group_by(id, sleep_date) %>%
summarise(n_records = n(), .groups = 'drop') %>%
filter(n_records > 1)
cat("Duplicate records in SleepDay data:", nrow(sleep_day_duplicates), "\n")
# Remove SleepDay duplicates
duplicate_rows <- sleep_day_clean %>%
group_by(id, sleep_date) %>%
filter(n() > 1) %>%
arrange(id, sleep_date) # The duplicates are a match so I will remove them
sleep_day_deduped <- sleep_day_clean %>%
distinct(id, sleep_date, .keep_all = TRUE)
sleep_day_clean <- sleep_day_deduped
# Check for duplicates in MinuteSleep data
sleep_minute_duplicate <- sleep_minutes_clean %>%
group_by(id,minute_datetime,log_id) %>%
summarise(n_records = n(), .groups = 'drop') %>%
filter(n_records > 1)
cat("Duplicate records in MinuteSleep data:", nrow(sleep_minute_duplicate), "\n")
# Analyze duplicate patterns in minute data
duplicate_rows_minute <- sleep_minutes_clean %>%
group_by(id, minute_datetime, log_id) %>%
filter(n() > 1) %>%
arrange(id, minute_datetime, log_id)
duplicate_analysis <- duplicate_rows_minute %>%
group_by(id, minute_datetime) %>%
summarise(
n_rows = n(),
same_value = n_distinct(value) == 1,
same_minute_datetime = n_distinct(minute_datetime) == 1,
.groups = 'drop'
)
cat("Duplicate analysis - identical values:", sum(duplicate_analysis$same_value),
"out of", nrow(duplicate_analysis), "duplicate groups\n")
# Remove MinuteSleep duplicates
sleep_minute_deduped <- sleep_minutes_clean %>%
distinct(id, minute_datetime, .keep_all = TRUE)
# 543 duplicated rows were successfully removed
sleep_minutes_clean <- sleep_minute_dedupedData Quality Findings:
Aggregating MinuteSleep Data
After cleaning both sleep datasets, I proceeded to aggregate the
minute-level sleep data to create daily summaries that could be
effectively merged with the SleepDay dataset.
# Aggregate by sleep session to analyze individual sleep periods
sleep_minutes_daily <- sleep_minutes_clean %>%
group_by(id, minute_date, log_id) %>%
summarise(
minutes_asleep = sum(value == 1),
minutes_restless = sum(value == 2),
minutes_awake = sum(value == 3),
total_minutes_recorded = n(),
sleep_start = min(minute_datetime),
sleep_end = max(minute_datetime),
sleep_efficiency = (minutes_asleep / total_minutes_recorded) * 100,
.groups = 'drop'
) %>%
mutate(
hours_asleep = minutes_asleep / 60,
hours_restless = minutes_restless / 60,
hours_awake = minutes_awake / 60,
total_hours_recorded = total_minutes_recorded / 60
)
cat("Sleep sessions after session-level aggregation:", nrow(sleep_minutes_daily), "\n")Initial Finding: The session-level aggregation
revealed 466 sleep sessions compared to only 410 daily records in the
SleepDay dataset. This discrepancy indicated that users
often have multiple sleep sessions (e.g., naps) recorded throughout the
day, requiring further aggregation.
To create comparable daily summaries, I aggregated all sleep sessions for each user on each day.
Sleep State Classification: Based on Fitbit’s sleep tracking methodology, the sleep state values are defined as follows:
This classification system allows for precise analysis of sleep quality and patterns throughout the night.
# Aggregate all sleep sessions by user and day
sleep_minutes_daily_aggregated <- sleep_minutes_clean %>%
group_by(id, minute_date) %>%
summarise(
# Sum across all sleep sessions for the day
total_minutes_asleep = sum(value == 1),
total_minutes_restless = sum(value == 2),
total_minutes_awake = sum(value == 3),
# Count total sleep sessions and recording time
total_sleep_sessions = n_distinct(log_id),
total_minutes_recorded = n(),
# Sleep efficiency (daily overall)
sleep_efficiency = (total_minutes_asleep / total_minutes_recorded) * 100,
# Sleep timing (earliest start, latest end)
first_sleep_start = min(minute_datetime),
last_sleep_end = max(minute_datetime),
.groups = 'drop'
) %>%
mutate(
# Convert to hours for business reporting
total_hours_asleep = total_minutes_asleep / 60,
total_hours_restless = total_minutes_restless / 60,
total_hours_awake = total_minutes_awake / 60,
total_hours_recorded = total_minutes_recorded / 60
)
cat("Sleep sessions after final aggregation:", nrow(sleep_minutes_daily_aggregated), "\n")Key Insight: The daily aggregation consolidated 412
sleep sessions into 412 daily records, revealing 2 additional daily
sleep records not present in the original SleepDay
dataset.
Sleep Data Comparison
To ensure data quality and determine the optimal integration strategy, I conducted a comparison between the two sleep data sources:
sleep_comparison <- sleep_day_clean %>%
full_join(sleep_minutes_daily_aggregated, by = c("id", "sleep_date" = "minute_date")) %>%
# 'x' being sleep_day and 'y' being minute_sleep
mutate(
# Check how well the data matches
has_daily_data = !is.na(total_minutes_asleep.x),
has_minute_data = !is.na(total_minutes_asleep.y),
# Compare sleep minutes between sources
sleep_minutes_diff = ifelse(has_daily_data & has_minute_data,
total_minutes_asleep.x - total_minutes_asleep.y, NA),
sleep_minutes_match = abs(sleep_minutes_diff) <= 10, # Allow 10-minute difference
# Compare total time in bed
total_time_diff = ifelse(has_daily_data & has_minute_data,
total_time_in_bed - total_minutes_recorded, NA),
total_time_match = abs(total_time_diff) <= 10
)
# Generate coverage and consistency summary
coverage_summary <- sleep_comparison %>%
summarise(
total_days = n(),
days_with_both_data = sum(has_daily_data & has_minute_data),
days_with_only_daily = sum(has_daily_data & !has_minute_data),
days_with_only_minute = sum(!has_daily_data & has_minute_data),
pct_days_with_both = mean(has_daily_data & has_minute_data) * 100 ,
pct_perfect_sleep_match = mean(sleep_minutes_match, na.rm = TRUE) * 100
)
print("Final Sleep Data Integration Analysis:")
print(coverage_summary)Data Integration Assessment:
The comprehensive data validation processes documented in this appendix ensure the analytical integrity of our findings. While the FitBit dataset exhibits some expected real-world usage inconsistencies, the thorough cleaning and validation procedures confirm data reliability for Bellabeat’s strategic decision-making.
All documented limitations have been considered in the analysis interpretation, and the final dataset provides a robust foundation for the business insights and recommendations presented in this report.
Fuller, D., Colwell, E., Low, J., Orychock, K., Tobin, M. A., Simango, B., … & Mean, A. (2020). Reliability and validity of commercially available wearable devices for measuring steps, energy expenditure, and heart rate: Systematic review. JMIR mHealth and uHealth, 8(9), e18694. https://doi.org/10.2196/18694
Santos-Longhurst, A. (2018, November 27). How many steps do people take per day on average? Healthline. https://www.healthline.com/health/average-steps-per-day
Kahn, J. (2021). Sleep doctor explains your restless sleep and how to fix it. Rise Science. https://www.risescience.com/blog/restless-sleep
Cleveland Clinic. (2023, June 9). Burn calories by reading this article! Health Essentials. https://health.clevelandclinic.org/calories-burned-in-a-day
The Munich Eye. (2025, April 13). Evaluating fitness tracker scores: Are they trustworthy? The Munich Eye. https://themunicheye.com/evaluating-fitness-tracker-scores-17289
FitBit Fitness Tracker Data. (2016). Kaggle Dataset. Mobius. https://www.kaggle.com/datasets/arashnic/fitbit/data
Analysis conducted: November 2025