1 Project Overview: Bellabeat Smart Device Analysis

This case study examines how Bellabeat, a high-tech wellness company specializing in health-focused products for women, can leverage smart device usage data to inform marketing strategy and drive growth. As a junior data analyst on Bellabeat’s marketing team, I was tasked with analyzing non-Bellabeat smart device data to identify consumer behavior patterns and apply these insights to Bellabeat’s products.

Business Objective: Uncover trends in smart device usage that can inform Bellabeat’s marketing strategy and product development, ultimately positioning the company for expanded market share in the competitive wellness technology space.

Methodology: Following the data analysis lifecycle (Ask-Prepare-Process-Analyze-Share-Act), this project utilizes FitBit Fitness Tracker Data from 33 users to examine activity patterns, sleep behaviors, and device engagement. The analysis focuses on translating general fitness tracker insights into actionable recommendations specifically for Bellabeat’s product ecosystem.

Key Deliverables:

Comprehensive analysis of activity and sleep patterns.
Data-driven insights into user consistency and engagement.
Strategic recommendations for Bellabeat’s marketing approach.
Product-specific application of findings.

Bellabeat Product Ecosystem & Analysis Focus

Ivy+ - Advanced health tracker with women-specific biometrics and cycle integration
Leaf - Discreet wellness jewelry for holistic activity and sleep tracking
Time - Elegant hybrid watch with core wellness metrics
App - Central platform for personalized coaching and insights

1.1 Executive Summary: Key Insights & Recommendations

Analysis Overview: 33 FitBit users, 814 activity days, 382 sleep records analyzed through Bellabeat’s women’s wellness lens.

Top 3 Business Insights:

Activity Segmentation: 4 distinct user groups (24% highly consistent, 24% rarely active) → enables targeted engagement
Sleep Paradox: 92.7% efficiency but 41.6% below recommended duration → opportunity for sleep duration features
Weekly Rhythms: Tuesday/Saturday activity peaks with stable sleep → optimal timing for challenges

Primary Recommendations:

Implement tiered goal system for different consistency levels
Launch “7-hour sleep challenge” targeting sleep duration gap
Develop cycle-aware features leveraging women’s health focus

1.3 The FitBit Fitness Tracker Data

Data Source: FitBit Fitness Tracker Data (CC0: Public Domain, available on Kaggle). This dataset contains personal fitness tracker data from a maximum of 33 Fitbit users, with significant variation in user participation across different metrics.

Data Credibility (ROCCC Analysis):

Reliable: Limited sample size with inconsistent user participation - maximum 33 users in activity data, but only 24 in sleep data and as few as 8 in weight tracking
Original: Data comes from consented users, but it’s aggregated by a third party (Mobius).
Comprehensive: The data is limited to Fitbit users and may not represent the broader population or Bellabeat’s target audience (women) specifically. Key metrics for Bellabeat, such as the menstrual cycle, are absent. Data also lacks information on the demographic of sample.
Current: The data is from 2016, and fitness tracking habits may have evolved.
Cited: The source is cited via Kaggle.

Verdict: This dataset has significant limitations. It should be used to identify potential trends, but any conclusions must be framed with these limitations in mind.

1.3.1 Data Dictionary

The FitBit Fitness Tracker Data included two folders containing data from March 12, 2016 to April 11, 2016 and from April 12, 2016 to May 12, 2016. I decided to use the April 12, 2016 to May 12, 2016 dataset (mturkfitbit_export_4.12.16-5.12.16) as it represented the most recent data available and contained more files for my analysis. Following this selection, I proceeded to create a data dictionary by examining each file in the chosen folder and reviewing each column within those files. This process helped me develop a better understanding of the data in each table, how they relate to one another, and their relevance to the business task.

1.3.1.1 Data Dictionary Summary

Primary Datasets Used for Analysis:

dailyActivity_merged.csv - Main dataset containing steps, distance, activity minutes (intensities), and calories
sleepDay_merged.csv - Daily sleep duration and time in bed
minuteSleep_merged.csv - Minute-level sleep quality (asleep/restless/awake)

Supplementary Data:

hourlyIntensities_merged.csv - Hourly activity patterns
heartrate_seconds_merged.csv - Heart rate monitoring (limited users)
minuteMETsNarrow_merged.csv - Metabolic equivalent of task values

Excluded Datasets:

Daily intensities, steps and calories data sets as data was already included in the daily activity table
Weight data excluded due to insufficient sample size (8 users) and inconsistent logging
Minute-level activity data deemed too granular for high-level analysis
Duplicate datasets in wide format excluded in favor of narrow formats

Key Metrics Tracked:

Activity: Steps, distance, active minutes, calories burned
Sleep: Duration, efficiency, restlessness, time in bed
Intensity: Very/fairly/light/sedentary activity levels
Engagement: Device wear time, data consistency

For the complete data dictionary with detailed column descriptions and analysis rationale, view the full Google Sheets document here.

Image to the left is Bellabeat’s IVY+ and image to the right is Bellabeat’s Leaf Urban.

1.4 Women’s Wellness Considerations

Data Limitations & Opportunities:

Current dataset lacks menstrual cycle tracking → Bellabeat’s competitive advantage
Sleep patterns may vary with hormonal cycles → opportunity for cycle-aware sleep recommendations
Activity consistency challenges may relate to energy fluctuations → personalized goal adjustment

Recommended Features:

Cycle-synced activity recommendations
Hormonal phase-aware sleep optimization
Women-specific goal setting (accounting for energy variations)

1.5 Cleaning The Dataset

1.5.1 Daily Activity Dataset

After loading the necessary packages (tidyverse, ggplot2, janitor, lubridate, and lm.beta), I loaded the datasets and counted the number of unique users to understand the data coverage, alongside checking the date range.

# Load datasets
daily_activity <- read.csv('dailyActivity_merged.csv')
sleep_day <- read.csv('sleepDay_merged.csv')
sleep_minutes <- read.csv('minuteSleep_merged.csv')

# Count unique users across datasets
cat("Unique users in daily_activity:", n_distinct(daily_activity$Id), "\n")

## Unique users in daily_activity: 33

cat("Unique users in sleep_day:", n_distinct(sleep_day$Id), "\n")

## Unique users in sleep_day: 24

cat("Unique users in sleep_minutes:", n_distinct(sleep_minutes$Id), "\n")

## Unique users in sleep_minutes: 24

# Checking the date range
cat("Date range:", min(daily_activity$ActivityDate), "to", 
    max(daily_activity$ActivityDate), "\n")

## Date range: 4/12/2016 to 5/9/2016

Key Finding: The sleep datasets have only 24 unique users compared to 33 in the activity data, indicating I’ll need to analyze sleep patterns separately. It is also important to note that the daily activity data spans from 4/12/2016 to 5/9/2016 with 940 initial rows.

1.5.1.1 Data Validation Summary

## Rows where distances sum perfectly: 737 out of 940

## Percentage of perfect matches: 78.4 %

Distance Column Integrity: After comprehensive validation, 78.4% of distance calculations (sum of VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, SedentaryActiveDistance matching TotalDistance) matched perfectly, with only 2.6% showing significant discrepancies. For detailed validation code and visualizations, see Appendix A.
Tracker Distance Comparison: Only 15 out of 940 rows showed discrepancies between TrackerDistance and TotalDistance columns.
Activity Minutes Validation: Analysis revealed 52.6% of days have incomplete activity data (sum of VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes and SedentaryMinutes not equaling 1440 daily minutes), suggesting regular device removal. Complete validation methodology available in Appendix A.
Non-Wear Period Identification: Only 0.85% of records indicated potential device non-wear, confirming strong user engagement. Detailed analysis in Appendix A.
Data Cleaning Decision: Based on validation results, removed redundant distance columns (TrackerDistance, individual distance components) to simplify analysis while documenting data quality considerations.

1.5.1.2 Data Cleaning and Filtering

After validating the distance columns, I proceeded with comprehensive data cleaning:

# Clean and filter the daily activity data
daily_clean <- daily_activity %>% 
  clean_names() %>% 
  mutate(
    activity_date = as.Date(activity_date, format = "%m/%d/%Y"),
    total_recorded_minutes = very_active_minutes + fairly_active_minutes + 
                           lightly_active_minutes + sedentary_minutes
  ) %>% 
  select(-tracker_distance, -logged_activities_distance) %>%  
  filter(
    total_steps > 1000,    # Remove days with minimal activity (potential non-wear)
    total_steps < 50000,   # Remove extreme outliers
    calories > 1300,       # Remove unrealistic low calorie burns
    calories < 5000,       # Remove extreme high calorie burns
    sedentary_minutes <= 1440  # Ensure valid minute counts
  )

cat("Cleaned dataset:", nrow(daily_clean), "rows from", n_distinct(daily_clean$id), "users\n")

## Cleaned dataset: 821 rows from 33 users

Cleaning Actions:

Standardized column names and date formats
Removed redundant columns (tracker_distance,logged_activities_distance)
Filtered unrealistic values and outliers based on research
Added total_recorded_minutes for device wear time analysis

1.5.1.3 Final Data Cleaning Implementation

I implemented the final cleaning steps to create the analysis-ready dataset:

# Apply final cleaning transformations
daily_final <- daily_clean %>%
  # Remove redundant distance columns
  select(-sedentary_active_distance, -very_active_distance, 
         -moderately_active_distance, -light_active_distance) %>%
  # Remove confirmed non-wear records
  filter(!(sedentary_minutes == 1440 & total_steps > 1000)) %>%
  # Add data quality classification
  mutate(
    data_quality_flag = case_when(
      total_recorded_minutes < 1200 ~ "Low recording time",
      total_recorded_minutes < 1400 ~ "Moderate recording time", 
      abs(total_recorded_minutes - 1440) <= 5 ~ "Full day recording",
      TRUE ~ "Partial day recording"
    )
  )

cat("Final cleaned dataset:", nrow(daily_final), "rows from", 
    n_distinct(daily_final$id), "users\n")

## Final cleaned dataset: 814 rows from 33 users

Data Quality Assessment:

quality_summary <- daily_final %>%
  count(data_quality_flag) %>%
  mutate(percentage = n / sum(n) * 100)

print("Data Recording Completeness Summary:")

## [1] "Data Recording Completeness Summary:"

print(quality_summary)

##         data_quality_flag   n percentage
## 1      Full day recording 382 46.9287469
## 2      Low recording time 382 46.9287469
## 3 Moderate recording time  47  5.7739558
## 4   Partial day recording   3  0.3685504

Visual Insight: The visualization reveals that 46.9% of days have full recording coverage, while an equal percentage (46.9%) have low recording time (<20 hours), highlighting the importance of accounting for device wear patterns in analysis.

1.5.1.4 Final Dataset Summary

final_validation <- daily_final %>%
  summarise(
    unique_users = n_distinct(id),
    total_observation_days = n(),
    avg_days_per_user = round(n() / n_distinct(id), 1),
    avg_steps = round(mean(total_steps), 0),
    pct_meeting_goal = round(mean(total_steps >= 8000) * 100, 1)
  )

print("Final Cleaned Dataset Summary:")

## [1] "Final Cleaned Dataset Summary:"

print(final_validation)

##   unique_users total_observation_days avg_days_per_user avg_steps
## 1           33                    814              24.7      8708
##   pct_meeting_goal
## 1             52.8

Final Dataset Characteristics:

33 users with 814 observation days
Average of 24.7 days per user
8708 average daily steps
52.8% of days meeting the 8,000-step goal

Note: The step goal was set at 8,000 instead of the traditional 10,000 to reflect more attainable targets for general wellness.

1.5.1.5 Daily Activity Cleaning Conclusion

The daily activity dataset has been successfully cleaned and validated, resulting in a high-quality dataset ready for analysis. Key cleaning decisions included:

Removed redundant columns (TrackerDistance, individual distance components)
Filtered unrealistic values (extreme steps, calories, non-wear periods)
Added data quality flags to account for varying device wear times
Documented all limitations for transparent analysis

The final dataset maintains data integrity while realistically representing real-world device usage patterns, providing a solid foundation for uncovering meaningful insights about user behavior.

1.5.2 Sleep Data Cleaning (SleepDay & MinuteSleep)

Following the cleaning of the daily activity dataset, I proceeded to clean and integrate the sleep data from both the SleepDay and MinuteSleep datasets. This process involved date standardization, duplicate removal, and data validation to ensure high-quality sleep metrics for analysis.

1.5.2.1 Initial Data Assessment and Date Formatting

I began by examining the date ranges and standardizing datetime formats across both sleep datasets:

# Clean and format SleepDay dataset
sleep_day_clean <- sleep_day %>%
  clean_names() %>%
  mutate(
    sleep_datetime = as.POSIXct(sleep_day, format = "%m/%d/%Y %I:%M:%S %p"),
    sleep_date = as.Date(sleep_datetime),
    sleep_time = format(sleep_datetime, "%H:%M:%S")
  ) %>%
  select(-sleep_day,-sleep_time,-sleep_datetime) 

cat("Sleep Day date range:", as.character(min(sleep_day_clean$sleep_date)), "to", 
    as.character(max(sleep_day_clean$sleep_date)), "\n")

## Sleep Day date range: 2016-04-12 to 2016-05-12

# Clean and format MinuteSleep dataset
sleep_minutes_clean <- sleep_minutes %>%
  clean_names() %>%
  mutate(
    minute_datetime = as.POSIXct(date, format = "%m/%d/%Y %I:%M:%S %p"),
    minute_date = as.Date(minute_datetime),
    minute_time = format(minute_datetime, "%H:%M:%S")
  ) %>%
  select(-date)

cat("Sleep Minutes date range:", as.character(min(sleep_minutes_clean$minute_date)), "to", 
    as.character(max(sleep_minutes_clean$minute_date)), "\n")

## Sleep Minutes date range: 2016-04-12 to 2016-05-12

1.5.2.2 Sleep Data Quality & Integration

## Duplicate records in SleepDay data: 3

## Duplicate records in MinuteSleep data: 543

## Duplicate analysis - identical values: 543 out of 543 duplicate groups

Duplicate Management: Removed 3 duplicate sleep records and 543 duplicate minute-level entries through comprehensive data validation. Detailed duplicate analysis and removal process available in Appendix B.

1.5.2.3 Sleep Dataset Integration

## Sleep sessions after session-level aggregation: 466

## Sleep sessions after final aggregation: 412

## [1] "Final Sleep Data Integration Analysis:"

##   total_days days_with_both_data days_with_only_daily days_with_only_minute
## 1        416                 406                    4                     6
##   pct_days_with_both pct_perfect_sleep_match
## 1           97.59615                95.07389

Dataset Aggregation: Consolidated minute-level sleep data into daily summaries, revealing 412 sleep sessions with 2 additional records not present in the daily_activity dataset. Aggregation methodology detailed in Appendix B.
Integration Success: Successfully merged daily and minute-level sleep datasets with strong alignment (97.6% data overlap) and high consistency (95.1% perfect match rate). Detailed integration analysis and comparison methodology available in Appendix C.
Strategic Value: The integration confirms both datasets contribute unique value, with minute-level data providing additional sleep records and daily data serving as a reliable baseline for analysis.

1.5.2.4 Merging The Datasets

After preparing both sleep datasets, I combined them using an approach that picks the best data from each source. This ensures I have the most accurate sleep information possible.

sleep_final <- sleep_day_clean %>%
  full_join(sleep_minutes_daily_aggregated, by = c("id", "sleep_date" = "minute_date")) %>%
  mutate(
    # Use the best available sleep minutes
    final_minutes_asleep = case_when(
      # When I have both and they're close, use daily data (more reliable)
      !is.na(total_minutes_asleep.x) & !is.na(total_minutes_asleep.y) & 
        abs(total_minutes_asleep.x - total_minutes_asleep.y) <= 30 ~ total_minutes_asleep.x,
      # When I only have daily data
      !is.na(total_minutes_asleep.x) & is.na(total_minutes_asleep.y) ~ total_minutes_asleep.x,
      # When I only have minute data
      is.na(total_minutes_asleep.x) & !is.na(total_minutes_asleep.y) ~ total_minutes_asleep.y,
      # When they differ significantly, use daily data (more conservative)
      TRUE ~ total_minutes_asleep.x
    ),
    
    # Use the best available time in bed
    final_time_in_bed = case_when(
      !is.na(total_time_in_bed) & !is.na(total_minutes_recorded) & 
        abs(total_time_in_bed - total_minutes_recorded) <= 30 ~ total_time_in_bed,
      !is.na(total_time_in_bed) & is.na(total_minutes_recorded) ~ total_time_in_bed,
      is.na(total_time_in_bed) & !is.na(total_minutes_recorded) ~ total_minutes_recorded,
      TRUE ~ total_time_in_bed
    ),
    
    # Enhanced sleep metrics
    sleep_efficiency_final = (final_minutes_asleep / final_time_in_bed) * 100,
    restless_percentage = ifelse(!is.na(total_minutes_restless),
                                 (total_minutes_restless / final_time_in_bed) * 100, NA),
    
    # Data source tracking
    data_source = case_when(
      is.na(total_minutes_asleep.y) ~ "Daily only", # No minute data available
      is.na(total_minutes_asleep.x) ~ "Minute only", # No daily data available
      abs(total_minutes_asleep.x - total_minutes_asleep.y) <= 10 ~ "Daily (verified)", # Both datasets agree (within 10 minutes)
      TRUE ~ "Combined" # Both datasets disagree → used daily data as a fallback
    ),
    
    # Sleep quality categories
    sleep_duration_hours = final_minutes_asleep / 60,
    sleep_quality = case_when(
      sleep_efficiency_final >= 85 ~ "Excellent",
      sleep_efficiency_final >= 70 ~ "Good",
      sleep_efficiency_final >= 50 ~ "Fair",
      TRUE ~ "Poor"
    )
  )

1.5.2.5 Sleep Quality Overview

After combining the datasets, I looked at sleep quality to understand how well users sleep. Sleep quality measures how much time in bed is actually spent sleeping.

Key Findings: 91.6% of users get excellent quality sleep. This means they spend 85% or more of their time in bed actually asleep.

Sleep Quality/Efficiency Analysis Note: There are two ways to calculate sleep quality metrics:

Average of sleep efficiency percentages: 91.87%
Percentage of records meeting “Excellent” threshold: 92.67%

We use the second method (record counting) from now on for consistency with other categorical analysis and because it more intuitively represents what portion of sleep sessions were high quality.

1.5.2.6 Final Sleep Data Cleaning

To finish preparing the sleep data, I removed unrealistic records and kept only the most useful columns for analysis.

  sleep_final <- sleep_final %>%
  filter(
    # Reasonable sleep values
    final_minutes_asleep >= 180,      # More than 3 hours of sleep
    final_minutes_asleep <= 720,      # No more than 12 hours asleep
    final_time_in_bed <= 960,         # No more than 16 hours in bed
    sleep_efficiency_final <= 100,    # Efficiency can't exceed 100%
  ) %>%
  # Select final columns
  select(
    id, sleep_date, first_sleep_start,
    
    # Core sleep metrics
    total_minutes_asleep = final_minutes_asleep,
    total_time_in_bed = final_time_in_bed,
    sleep_efficiency = sleep_efficiency_final,
    
    # Enhanced metrics from minute data
    total_minutes_restless,
    total_minutes_awake,
    restless_percentage,
    total_sleep_sessions,
    
    # Sleep categories
    sleep_duration_hours,
    sleep_quality,
    
    # Data quality
    data_source,
    total_sleep_records
  )

cat("Sleep Records After Final Cleaning:", nrow(sleep_final), "\n")

## Sleep Records After Final Cleaning: 382

cat("Unique Users After Cleaning:", n_distinct(sleep_final$id), "\n")

## Unique Users After Cleaning: 19

Final Sleep Dataset Composition

382 sleep records after all cleaning and filtering
19 unique users with sleep data
Enhanced columns: sleep efficiency, restless percentage, sleep quality categories

1.6 Analyzing The Datasets: Business Questions Framework

This analysis addresses four key business questions to inform Bellabeat’s strategy:

Activity Optimization: How can we improve goal achievement and user consistency?
Sleep Enhancement: What opportunities exist for sleep duration and quality improvement?
Energy Efficiency: How do steps, distance, and calories relate for optimal tracking?
Engagement Patterns: When and how do users interact with their devices?

Methodology: Using cleaned data from 33 users (814 activity days, 382 sleep records) with correlation analysis, user segmentation, and pattern recognition applied through Bellabeat’s women’s wellness lens.

1.6.1 `Daily_Final` Dataset Analysis

I began by examining the daily activity patterns to establish baseline metrics for user behavior.

activity_summary <- daily_final %>%
  summarise(
    users = n_distinct(id),
    avg_daily_steps = mean(total_steps),
    avg_daily_calories = mean(calories),
    avg_sedentary_hours = mean(sedentary_minutes) / 60,
    pct_meeting_goal = mean(total_steps >= 8000) * 100
  )

print("Bellabeat Key Metrics:")

## [1] "Bellabeat Key Metrics:"

print(activity_summary)

##   users avg_daily_steps avg_daily_calories avg_sedentary_hours pct_meeting_goal
## 1    33        8708.053            2405.07            15.83493         52.82555

This initial summary revealed several key patterns that guided my deeper analysis. I focused on answering these business-critical questions:

Activity Efficiency: How do steps and activity minutes relate to distance covered?
Goal Achievement: What percentage of users consistently meet activity goals?
Energy Expenditure: What are the relationships between steps, distance, and calories burned?
Device Usage: When and why do users stop wearing their devices?
Weekly Activity Patterns: How does user activity vary by day of the week?

The following sections explore these questions to provide actionable insights for Bellabeat.

1.6.1.1 Activity Efficiency: Steps, Time, and Distance Relationships

How Do Steps and Activity Minutes Correlate With Distance Covered?

Understanding how different activity metrics relate helps Bellabeat design better tracking features and user goals.

1.6.1.1.1 Total Steps and Total Distance Covered Relationship

## Correlation between steps and distance: 0.981

## `geom_smooth()` using formula = 'y ~ x'

Key Findings: The near-perfect correlation (0.981) confirms that step counting accurately measures distance covered. Most users take fewer than 15,000 steps daily, with a small group showing significantly higher activity levels.

1.6.1.1.2 Walking Efficiency Analysis

Understanding user walking patterns by analyzing users steps per km helps Bellabeat set realistic distance-based goals.

daily_final <- daily_final %>%
  mutate(steps_per_km = total_steps / total_distance)

# Summary of steps per km
steps_km_summary <- daily_final %>%
  summarize(
    avg_steps_per_km = mean(steps_per_km),
    median_steps_per_km = median(steps_per_km),
    sd_steps_per_km = sd(steps_per_km)
  )

print(steps_km_summary)

##   avg_steps_per_km median_steps_per_km sd_steps_per_km
## 1         1422.761            1465.387        114.5059

Business Application: The average of 1,423 steps per kilometer helps Bellabeat estimate that 8,000 steps equals approximately 5.6 km - an appropriate distance goal for users to cover daily.

1.6.1.1.3 Activity Minutes and Distance Covered Relationship

To understand how different activity levels contribute to distance covered, I analyzed the relationship between various activity intensities and total distance. This helps Bellabeat understand which types of movement most impact overall activity metrics.

To simplify the analysis, I combined the three active minute categories (very_active_minutes, fairly_active_minutes, lightly_active_minutes) into a single active_minutes metric. This approach provides a clearer picture of overall activity time, especially since sedentary_minutes naturally shows an negative relationship with distance covered.

daily_final <- daily_final %>% 
  mutate(active_minutes = very_active_minutes + fairly_active_minutes + lightly_active_minutes)

minutes_distance_cor <- cor(daily_final$active_minutes, daily_final$total_distance)
cat("Correlation between active minutes and distance:", round(minutes_distance_cor, 3), "\n")

## Correlation between active minutes and distance: 0.611

## `geom_smooth()` using formula = 'y ~ x'

Notes: The moderate positive correlation (0.611) shows that increased activity time generally leads to greater distance covered, though activity intensity also plays a role.

Strategic Recommendation: Bellabeat can motivate users by emphasizing that consistent daily activity - regardless of intensity - contributes significantly to overall distance goals.

1.6.1.2 Goal Achievement & User Consistency

What percentage of users consistently meet activity goals (8,000 steps)?

Understanding how many users meet activity targets (8,000 steps for this analysis) helps Bellabeat design appropriate goal structures and motivation systems.

1.6.1.2.1 Steps Goal Analysis & Visualization

  goal_analysis <- daily_final %>%
  summarize(
    pct_meeting_8000_steps = mean(total_steps >= 8000) * 100,
    pct_meeting_10000_steps = mean(total_steps >= 10000) * 100,
    pct_less_than_8000_steps = mean(total_steps < 8000) * 100
  )

print("Step Goal Achievement:")

## [1] "Step Goal Achievement:"

print(goal_analysis)

##   pct_meeting_8000_steps pct_meeting_10000_steps pct_less_than_8000_steps
## 1               52.82555                36.97789                 47.17445

Notes: While over half of all days (52.8%) meet the 8,000-step goal, there’s a significant drop to 37% for the traditional 10,000-step target, indicating the 8,000-step goal may be more achievable for most users.

1.6.1.3 Long Term User Consistency

Understanding how consistently users meet the 8,000-step daily goal helps Bellabeat target users with appropriate motivation strategies. By analyzing the percentage of days each user achieves this target, we can identify distinct behavioral patterns that inform personalized engagement approaches.

user_consistency <- daily_final %>%
  group_by(id) %>%
  summarize(
    days_above_8000 = mean(total_steps >= 8000) * 100,
    consistency_category = case_when(
      days_above_8000 >= 80 ~ "Highly Consistent",
      days_above_8000 >= 50 ~ "Moderately Consistent", 
      days_above_8000 >= 20 ~ "Occasionally Active",
      TRUE ~ "Rarely Active"
    )
  )

consistency_summary <- user_consistency %>%
  count(consistency_category) %>%
  mutate(percentage = n / sum(n) * 100)

print("User Consistency Patterns:")

## [1] "User Consistency Patterns:"

print(consistency_summary)

## # A tibble: 4 × 3
##   consistency_category      n percentage
##   <chr>                 <int>      <dbl>
## 1 Highly Consistent         8       24.2
## 2 Moderately Consistent     9       27.3
## 3 Occasionally Active       8       24.2
## 4 Rarely Active             8       24.2

Key Finding: The data reveals four distinct user groups based on their consistency in achieving the 8,000-step daily target:

8 users (24.2%) - Highly Consistent (≥80% of days)
9 users (27.3%) - Moderately Consistent (50-79% of days)
8 users (24.2%) - Occasionally Active (20-49% of days)
8 users (24.2%) - Rarely Active (<20% of days)

Evidence-Based Strategic Recommendations

Targeted User Engagement

Struggling Users (47%): Implement gentle, progressive goal systems with cycle-aware adjustments for energy fluctuations
Consistent Performers (24.2%): Develop advanced challenges (10,000+ steps) with women-focused reward systems
Moderate Users (27.3%): Create consistency-building features with social support elements

Women-Specific Differentiators

Cycle Integration: Leverage Bellabeat’s unique position to address hormonal energy and sleep pattern variations
Progressive Milestones: Implement distance-based goals (3.5km → 7+ km) based on strong calorie correlation (r=0.60)
Community Building: Create female-focused challenge groups aligned with natural Tuesday/Saturday activity peaks

1.6.1.4 Energy Expenditure Efficiency: Steps, Distance and Calories Burned Relationship

What are the relationships between steps, distance, and calories burned?

Understanding how activity metrics translate to calories burned helps Bellabeat design effective fitness tracking and goal-setting features for users.

To quantify these relationships, I analyzed the correlation between different activity measures and calorie expenditure:

  calorie_correlations <- daily_final %>%
  summarize(
    steps_calories_cor = cor(total_steps, calories),
    distance_calories_cor = cor(total_distance, calories),
    active_minutes_calories_cor = cor(active_minutes, calories)
  )

print("Calorie Burn Correlations:")

## [1] "Calorie Burn Correlations:"

print(calorie_correlations)

##   steps_calories_cor distance_calories_cor active_minutes_calories_cor
## 1          0.5275046             0.6008586                   0.3636684

Key Findings: Distance shows the strongest relationship with calorie burn (0.60), followed by steps (0.53). Since active minutes showed a weaker correlation (0.36), I focused the analysis on steps and distance for clearer insights.

1.6.1.4.1 Total Steps and Calories Burned Relationship

To understand how step counts translate to energy expenditure, I categorized users by activity level and calculated average calories burned for each group:

steps_calories <- daily_final %>%
  mutate(step_category = case_when(
    total_steps < 5000 ~ "Low (<5,000)",
    total_steps >= 5000 & total_steps < 8000 ~ "Medium (5,000-8,000)",
    total_steps >= 8000 & total_steps < 10000 ~ "High (8,000-10,000)",
    total_steps >= 10000 ~ "Very High (10,000+)"
  )) %>%
  group_by(step_category) %>%
  summarize(
    avg_calories = mean(calories),
    n_days = n()
  ) %>%
  mutate(step_category = factor(step_category, 
                                levels = c("Low (<5,000)", "Medium (5,000-8,000)", 
                                           "High (8,000-10,000)", "Very High (10,000+)")))
print(steps_calories)

## # A tibble: 4 × 3
##   step_category        avg_calories n_days
##   <fct>                       <dbl>  <int>
## 1 High (8,000-10,000)         2490.    129
## 2 Low (<5,000)                1943.    182
## 3 Medium (5,000-8,000)        2266.    202
## 4 Very High (10,000+)         2741.    301

Note: Each step-level increase brings significant calorie burn improvements, with the most dramatic jump occurring when users progress from low to medium activity levels.

1.6.1.4.2 Total Distance and Calories Burned Relationship

Since distance showed the strongest correlation with calories, I analyzed how different distance ranges impact energy expenditure:

distance_calories <- daily_final %>%
  mutate(distance_category = case_when(
    total_distance < 3.5 ~ "Short (<3.5 km)",
    total_distance >= 3.5 & total_distance < 5.5 ~ "Medium (3.5-5.5 km)",
    total_distance >= 5.5 & total_distance < 7.0 ~ "Long (5.5-7.0 km)",
    total_distance >= 7.0 ~ "Very Long (7.0+ km)"
  )) %>%
  group_by(distance_category) %>%
  summarize(
    avg_calories = mean(calories),
    n_days = n()
  ) %>%
  mutate(distance_category = factor(distance_category,
                                    levels = c("Short (<3.5 km)", "Medium (3.5-5.5 km)",
                                               "Long (5.5-7.0 km)", "Very Long (7.0+ km)")))
print(distance_calories)

## # A tibble: 4 × 3
##   distance_category   avg_calories n_days
##   <fct>                      <dbl>  <int>
## 1 Long (5.5-7.0 km)          2410.    139
## 2 Medium (3.5-5.5 km)        2206.    186
## 3 Short (<3.5 km)            1919.    187
## 4 Very Long (7.0+ km)        2826.    302

Note: Distance proves to be the most reliable predictor of calorie expenditure, with users covering 7+ km burning nearly 50% more calories than those covering less than 3.5 km.

Strategic Recommendations:

Beginner Focus: Target users in the “Short” distance category with “5,000-step first goals” to help them experience quick calorie-burn improvements
Progressive Challenges: Implement distance-based milestone challenges (3.5km → 5.5km → 7+ km) rather than step-count only goals
Marketing Emphasis: Highlight distance tracking as the primary metric for calorie burn estimation in Bellabeat’s app interface
Advanced Engagement: Create exclusive “7+ km club” recognition for high-performing users seeking maximum results

1.6.1.5 Device Engagement

When and Why Do Users Stop Wearing Their Devices?

Analyzing device wear time provides crucial insights into user engagement and helps Bellabeat optimize product design and marketing strategies. I examined how long users typically wear their devices by analyzing the total_recorded_minutes metric across all observation days.

  wear_time_analysis <- daily_final %>%
  summarize(
    avg_wear_hours = mean(total_recorded_minutes) / 60,
    median_wear_hours = median(total_recorded_minutes) / 60,
    pct_full_day = mean(total_recorded_minutes >= 1380) * 100,  # 23+ hours
    pct_most_of_day = mean(total_recorded_minutes >= 1080) * 100,  # 18+ hours
    pct_half_day = mean(total_recorded_minutes >= 720) * 100,  # 12+ hours
    pct_low_wear = mean(total_recorded_minutes < 720) * 100  # <12 hours
  )

print("Device Wear Time Patterns:")

## [1] "Device Wear Time Patterns:"

print(wear_time_analysis)

##   avg_wear_hours median_wear_hours pct_full_day pct_most_of_day pct_half_day
## 1        20.1794             22.35      48.0344        57.49386      98.5258
##   pct_low_wear
## 1     1.474201

Initial Insight: The data shows strong device engagement, with nearly all users (98.5%) wearing their devices for at least 12 hours daily, and over half (57.5%) maintaining wear time exceeding 18 hours.

To provide clearer segmentation, I created distinct wear time categories that prevent double-counting and offer a comprehensive view of user behavior patterns:

wear_categories <- daily_final %>%
  mutate(wear_category = case_when(
    total_recorded_minutes >= 1380 ~ "Full Day (23+ hrs)",
    total_recorded_minutes >= 1080 ~ "Most of Day (18-23 hrs)",
    total_recorded_minutes >= 720 ~ "Half Day (12-18 hrs)", 
    total_recorded_minutes >= 360 ~ "Partial Day (6-12 hrs)",
    TRUE ~ "Low Wear (<6 hrs)"
  )) %>%
  count(wear_category) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  mutate(wear_category = factor(wear_category,
                                levels = c("Full Day (23+ hrs)", "Most of Day (18-23 hrs)",
                                           "Half Day (12-18 hrs)", "Partial Day (6-12 hrs)", 
                                           "Low Wear (<6 hrs)")))

print(wear_categories)

##             wear_category   n percentage
## 1      Full Day (23+ hrs) 391  48.034398
## 2    Half Day (12-18 hrs) 334  41.031941
## 3 Most of Day (18-23 hrs)  77   9.459459
## 4  Partial Day (6-12 hrs)  12   1.474201

Key Findings: Device usage patterns reveal two dominant wear-time behaviors:

41.0% of users maintain half-day wear (12-18 hours), likely removing devices for charging overnight
48.0% of users demonstrate near-continuous engagement (23+ hours), indicating sleep tracking adoption

Business Implication: This segmentation suggests different charging habits and feature usage, informing battery optimization and notification strategies.

Notes: The complete absence of days with low wear time (<6 hours) indicates strong user commitment to device usage, suggesting the fitness trackers successfully integrate into daily routines without causing significant wear fatigue.

Strategic Implications: This high engagement level provides Bellabeat with reliable data collection and presents opportunities to enhance features for both casual and power users through differentiated battery optimization and notification strategies.

1.6.1.6 Weekly Activity Patterns

How Does User Activity Vary by Day of the Week?

This analysis examines how activity levels fluctuate throughout the week, helping Bellabeat identify optimal timing for challenges and motivational content. I analyzed weekly patterns across three key metrics: steps, calories burned, and distance covered.

# Adding the days of the week to the table
daily_final <- daily_final %>%
  mutate(day_of_week = weekdays(activity_date),
         day_of_week = factor(day_of_week, 
                              levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
                                         "Friday", "Saturday", "Sunday")))

# Analyze activity by the days of week
weekly_patterns <- daily_final %>%
  group_by(day_of_week) %>%
  summarize(
    avg_steps = mean(total_steps),
    avg_calories = mean(calories),
    avg_distance = mean(total_distance),
    n_days = n()
  )

print(weekly_patterns)

## # A tibble: 7 × 5
##   day_of_week avg_steps avg_calories avg_distance n_days
##   <fct>           <dbl>        <dbl>        <dbl>  <int>
## 1 Monday          8844.        2403.         6.30    104
## 2 Tuesday         9191.        2458.         6.59    131
## 3 Wednesday       8480.        2380.         6.16    133
## 4 Thursday        8822.        2380.         6.33    121
## 5 Friday          8208.        2393.         5.85    114
## 6 Saturday        9206.        2453.         6.61    109
## 7 Sunday          8138.        2365.         5.90    102

Daily Step Pattern

Key Insight: Saturday and Tuesday show the highest step counts, indicating these are peak activity days for users. The variation in daily step counts reveals clear weekly rhythms that Bellabeat can leverage for targeted engagement.

Note: The number of observation days varies by weekday, which may influence these patterns. However, the consistent trends provide valuable insights for timing challenges and motivation strategies.

Daily Calories Pattern

Pattern Analysis: Calorie expenditure follows a similar weekly pattern to step counts, with Tuesday showing slightly higher energy expenditure than Saturday. This consistency across metrics strengthens the reliability of observed weekly trends.

Daily Distance Pattern

Consistent Weekly Rhythm: All three metrics reveal a clear pattern, with Wednesday, Friday, and Sunday consistently showing the lowest activity levels. Monday and Thursday display moderate activity that falls between the highest and lowest performing days. This consistency across steps, calories, and distance indicates genuine behavioral patterns rather than random fluctuations.

Strategic Implications:

Peak Engagement Days: Leverage Tuesday and Saturday’s natural activity peaks for launching new challenges and features
Mid-Week Support: Target Wednesday with gentle motivational boosts to counteract the mid-week activity dip, accounting for potential hormonal energy fluctuations
Weekend Bookends: Use Friday to build momentum for weekend activities and Sunday for recovery planning
Program Timing: Schedule intensive fitness programs to align with natural activity rhythms, with optional cycle-aware adjustments for energy variations
Moderate Day Optimization: Position Monday and Thursday as ideal days for maintenance activities and skill-building content

1.6.2 `Sleep_Final` Dataset Analysis

Building on the activity analysis, I examined sleep patterns to understand user rest and recovery behaviors. This analysis provides crucial insights for Bellabeat’s sleep tracking features and wellness recommendations.

sleep_summary <- sleep_final %>%
  summarise(
    users = n_distinct(id),
    avg_sleep_hours = mean(total_minutes_asleep) / 60,
    avg_sleep_efficiency = mean(sleep_efficiency),
    pct_excellent_sleep = mean(sleep_quality == "Excellent") * 100
  )

print(sleep_summary)

##   users avg_sleep_hours avg_sleep_efficiency pct_excellent_sleep
## 1    19        7.212871             91.86517            92.67016

This baseline summary revealed key sleep patterns that guided my deeper investigation. I focused on answering these critical questions about user sleep behaviors:

Sleep Adequacy: What is the average sleep duration and efficiency compared to recommended 7-8 hours?
Sleep Quality: How much restless sleep do users experience?
Sleep Excellence: What percentage of users achieve “good quality” sleep?
Behavioral Patterns: Are there consistent weekly sleep rhythms?

1.6.2.1 Sleep Insights: Sleep Duration & Efficiency Analysis

Understanding Sleep Adequacy and Quality

I began by analyzing whether users meet recommended sleep guidelines (7-8 hours of sleep) and how efficiently they sleep while in bed.

sleep_metrics <- sleep_final %>%
  summarize(
    # Sleep duration
    avg_sleep_minutes = mean(total_minutes_asleep),
    avg_sleep_hours = mean(sleep_duration_hours),
    median_sleep_hours = median(sleep_duration_hours),
    
    # Sleep efficiency (time asleep vs time in bed)
    avg_sleep_efficiency = mean(sleep_efficiency),
    median_sleep_efficiency = median(sleep_efficiency),
    
    # Comparison to recommended ranges
    pct_meeting_7_hours = mean(sleep_duration_hours >= 7) * 100,  # 7+ hours
    pct_meeting_8_hours = mean(sleep_duration_hours >= 8) * 100,  # 8+ hours
    pct_below_7_hours = mean(sleep_duration_hours < 7) * 100,     # <7 hours
    pct_above_8_hours = mean(sleep_duration_hours > 8) * 100      # >8 hours
  )

print("Sleep Duration and Efficiency Metrics:")

## [1] "Sleep Duration and Efficiency Metrics:"

print(sleep_metrics)

##   avg_sleep_minutes avg_sleep_hours median_sleep_hours avg_sleep_efficiency
## 1          432.7723        7.212871           7.291667             91.86517
##   median_sleep_efficiency pct_meeting_7_hours pct_meeting_8_hours
## 1                94.31843            58.37696            28.79581
##   pct_below_7_hours pct_above_8_hours
## 1          41.62304          28.27225

Note: These percentages use inclusive thresholds (≥7 hours, ≥8 hours) and represent all 382 sleep records in our cleaned dataset.

We see that 58.38% of days meet or exceed the 7-hour minimum recommendation, while 28.80% achieve 8+ hours. Additionally, 41.62% fall below 7 hours and 28.27% exceed 8 hours of sleep.

Important Context: These categories are not mutually exclusive (days with exactly 7 hours count in both “meeting 7 hours” and “not below 7 hours”).

To better visualize sleep patterns with non-overlapping categories, I classified sleep duration into relevant ranges:

sleep_duration_categories <- sleep_final %>%
  mutate(sleep_category = case_when(
    total_minutes_asleep < 360 ~ "Very Short (<6 hrs)",
    total_minutes_asleep >= 360 & total_minutes_asleep < 420 ~ "Short (6-7 hrs)",
    total_minutes_asleep >= 420 & total_minutes_asleep < 480 ~ "Recommended (7-8 hrs)",
    total_minutes_asleep >= 480 & total_minutes_asleep < 540 ~ "Long (8-9 hrs)",
    total_minutes_asleep >= 540 ~ "Very Long (9+ hrs)"
  )) %>%
  count(sleep_category) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  mutate(sleep_category = factor(sleep_category,
                                 levels = c("Very Short (<6 hrs)", "Short (6-7 hrs)", 
                                            "Recommended (7-8 hrs)", "Long (8-9 hrs)", 
                                            "Very Long (9+ hrs)")))

Sleep Duration Distribution:

58.4% of days meet minimum 7-hour recommendation
28.8% of days achieve optimal 8+ hours
41.6% of days fall below recommended 7 hours

Key Insight: While most users meet basic sleep requirements, only 29.6% achieve the optimal 7-8 hour range, indicating significant opportunity for sleep duration improvement.

Sleep Efficiency Analysis

After analyzing sleep duration, I examined sleep efficiency because time in bed doesn’t equal quality sleep.

Sleep efficiency - the percentage of time in bed actually spent asleep - provides deeper insights into sleep quality beyond simple duration metrics. While sleep duration tells us how long users sleep, efficiency reveals how well they sleep during that time.

 sleep_efficiency_analysis <- sleep_final %>%
  mutate(sleep_efficiency = total_minutes_asleep / total_time_in_bed * 100) %>%
  filter(sleep_efficiency <= 100) %>%  # Remove unrealistic values >100%
  mutate(efficiency_category = case_when(
    sleep_efficiency >= 85 ~ "Excellent (85%+)",
    sleep_efficiency >= 75 ~ "Good (75-85%)",
    sleep_efficiency >= 65 ~ "Fair (65-75%)",
    TRUE ~ "Poor (<65%)"
  )) %>%
  count(efficiency_category) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  mutate(efficiency_category = factor(efficiency_category,
                                      levels = c("Excellent (85%+)", "Good (75-85%)",
                                                 "Fair (65-75%)", "Poor (<65%)")))

print("Sleep Efficiency Categories:")

## [1] "Sleep Efficiency Categories:"

print(sleep_efficiency_analysis)

##   efficiency_category   n percentage
## 1    Excellent (85%+) 354 92.6701571
## 2       Fair (65-75%)   7  1.8324607
## 3       Good (75-85%)   3  0.7853403
## 4         Poor (<65%)  18  4.7120419

Key Insight: 92.7% of analyzed days achieve excellent sleep quality (≥85% efficiency), revealing a crucial distinction for Bellabeat’s product strategy:

Sleep Quality: Excellent (92.7% efficiency) - not the primary challenge
Sleep Duration: Needs improvement (41.6% below 7 hours) - the key opportunity

Strategic Implication: Bellabeat should focus on helping users prioritize sufficient sleep time rather than improving sleep quality.

Strategic Sleep Recommendations:

Implement “7-hour sleep challenge” programs for users below recommended duration
Develop bedtime reminder features to improve sleep consistency
Create sleep education content focusing on sleep efficiency improvement
Introduce wind-down routines for users with high restlessness

Sleep Efficiency Analysis Note: There are two ways to calculate sleep quality metrics:

Average of sleep efficiency percentages: 91.87%
Percentage of records meeting “Excellent” threshold: 92.67%

We use the second method (record counting) for consistency with other categorical analysis and because it more intuitively represents what portion of sleep sessions were high quality.

1.6.2.2 Sleep Restlessness Analysis

Understanding Sleep Disruption Patterns

Restless sleep provides insights into sleep quality beyond simple duration metrics, revealing potential issues with sleep maintenance and depth.

 restlessness_analysis <- sleep_final %>%
  mutate(
    total_sleep_time = total_minutes_asleep,
    restlessness_percentage = (total_minutes_restless / total_time_in_bed) * 100
  ) %>%
  filter(restlessness_percentage <= 100) %>%  # Remove unrealistic values
  mutate(restlessness_category = case_when(
    restlessness_percentage < 10 ~ "Excellent (<10%)",
    restlessness_percentage >= 10 & restlessness_percentage < 20 ~ "Good (10-20%)",
    restlessness_percentage >= 20 & restlessness_percentage < 30 ~ "Fair (20-30%)",
    restlessness_percentage >= 30 ~ "Poor (30%+)"
  ))

# Summarize restlessness distribution
restlessness_summary <- restlessness_analysis %>%
  count(restlessness_category) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  mutate(restlessness_category = factor(restlessness_category,
                                        levels = c("Excellent (<10%)", "Good (10-20%)", 
                                                   "Fair (20-30%)", "Poor (30%+)")))

print("Sleep Restlessness Distribution:")

## [1] "Sleep Restlessness Distribution:"

print(restlessness_summary)

##   restlessness_category   n percentage
## 1      Excellent (<10%) 335  87.696335
## 2         Fair (20-30%)   5   1.308901
## 3         Good (10-20%)  20   5.235602
## 4           Poor (30%+)  22   5.759162

Sleep Restlessness Insight:

87.7% of users experience excellent sleep (<10% restless time)
5.8% struggle with poor sleep (>30% restless time)

Business Implication: While most users enjoy restful sleep, a small but significant segment needs targeted sleep quality improvement features.

Note: Sleep efficiency and restlessness use different categorization criteria, which explains the variation in excellent sleep percentages between analyses.

1.6.2.3 User Sleep Consistency Patterns

Understanding Individual Sleep Behavior Stability

To address the question of sleep consistency, I analyzed how consistently individual users maintain their sleep patterns night after night. This reveals whether users have stable sleep routines or experience significant night-to-night variations.

user_sleep_consistency <- sleep_final %>%
  group_by(id) %>%
  summarize(
    avg_sleep_hours = mean(sleep_duration_hours),
    sleep_std_dev = sd(sleep_duration_hours),
    consistency_score = (1 - (sleep_std_dev / avg_sleep_hours)) * 100,
    consistency_category = case_when(
      consistency_score >= 90 ~ "Highly Consistent",
      consistency_score >= 80 ~ "Moderately Consistent",
      consistency_score >= 70 ~ "Somewhat Consistent", 
      TRUE ~ "Inconsistent"
    )
  )

# Summarize consistency patterns
consistency_summary <- user_sleep_consistency %>%
  count(consistency_category) %>%
  mutate(percentage = n / sum(n) * 100)

print("Sleep Duration Consistency Patterns:")

## [1] "Sleep Duration Consistency Patterns:"

print(consistency_summary)

## # A tibble: 3 × 3
##   consistency_category      n percentage
##   <chr>                 <int>      <dbl>
## 1 Highly Consistent         2       10.5
## 2 Moderately Consistent    11       57.9
## 3 Somewhat Consistent       6       31.6

Consistency Analysis Results: The data reveals three distinct user segments based on sleep pattern stability:

Highly Consistent (10.5%): 2 users maintain very stable sleep durations with minimal night-to-night variation
Moderately Consistent (57.9%): 11 users show reasonable sleep routine stability
Somewhat Consistent (31.6%): 6 users experience noticeable variations in sleep duration

Strategic Implications:

For Highly Consistent Users: Leverage their stable patterns for advanced sleep optimization features, including cycle-aware sleep recommendations during hormonal transitions
For Moderately Consistent Users: Provide routine-building tools and consistency challenges that account for monthly energy fluctuations
For Somewhat Consistent Users: Focus on sleep schedule stabilization and habit formation with flexible adjustments for cycle-related sleep changes

This analysis shows that while most users (68.4%) maintain moderate to high sleep consistency, nearly one-third would benefit from Bellabeat’s sleep routine stabilization features.

1.6.2.4 Sleep Timing Pattern

Understanding How Bedtime Affects Sleep Quality

Analyzing sleep timing provides insights into optimal bedtime windows and helps Bellabeat develop personalized sleep schedule recommendations.

sleep_timing_analysis <- sleep_final %>%
  filter(!is.na(first_sleep_start)) %>%  # Remove records with missing start times
  mutate(
    # Extract hour from sleep start time
    sleep_start_hour = hour(first_sleep_start),
    sleep_timing_category = case_when(
      sleep_start_hour < 21 ~ "Early (Before 9 PM)",
      sleep_start_hour < 23 ~ "Average (9-11 PM)", 
      sleep_start_hour >= 23 ~ "Late (After 11 PM)"
    ),
    sleep_timing_category = factor(sleep_timing_category, 
                                   levels = c("Early (Before 9 PM)", "Average (9-11 PM)", "Late (After 11 PM)"))
  ) %>%
  group_by(sleep_timing_category) %>%
  summarize(
    avg_sleep_duration = mean(sleep_duration_hours),
    avg_efficiency = mean(sleep_efficiency),
    n_records = n()
  )

print("Sleep Timing Analysis:")

## [1] "Sleep Timing Analysis:"

print(sleep_timing_analysis)

## # A tibble: 3 × 4
##   sleep_timing_category avg_sleep_duration avg_efficiency n_records
##   <fct>                              <dbl>          <dbl>     <int>
## 1 Early (Before 9 PM)                 6.70           89.3       162
## 2 Average (9-11 PM)                   7.59           93.3       149
## 3 Late (After 11 PM)                  7.59           94.7        71

Sleep Timing Insights: The analysis reveals clear patterns between bedtime and sleep outcomes:

Early Sleepers (Before 9 PM): Experience shorter sleep duration (6.70 hours) with moderate efficiency (89.3%)
Average Sleepers (9-11 PM): Achieve optimal sleep duration (7.59 hours) with high efficiency (93.3%)
Late Sleepers (After 11 PM): Maintain good duration (7.59 hours) with the highest efficiency (94.7%)

Strategic Recommendations:

For Early Sleepers: Develop “sleep maintenance” features to help users stay asleep longer
For Average Sleepers: Reinforce these optimal sleep timing habits through positive feedback
For Late Sleepers: Leverage their high efficiency patterns while ensuring adequate total sleep time

Note: Four records with incomplete timing data were excluded from this analysis to maintain data quality.

1.6.2.5 Weekly Sleep Pattern Analysis

Identifying Rhythms in Sleep Behavior

Understanding how sleep patterns vary throughout the week helps Bellabeat time sleep-focused content and challenges effectively.

 sleep_final <- sleep_final %>%
  mutate(day_of_week = weekdays(sleep_date),
         day_of_week = factor(day_of_week, 
                              levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
                                         "Friday", "Saturday", "Sunday")))

# Analyze activity by the days of week
weekly_sleep_patterns <- sleep_final %>%
  group_by(day_of_week) %>%
  summarize(
    avg_sleep_duration = mean(sleep_duration_hours),
    avg_sleep_efficiency = mean(sleep_efficiency),
    avg_restlessness = mean(restless_percentage),
    n_days = n()
  )

print(weekly_sleep_patterns)

## # A tibble: 7 × 5
##   day_of_week avg_sleep_duration avg_sleep_efficiency avg_restlessness n_days
##   <fct>                    <dbl>                <dbl>            <dbl>  <int>
## 1 Monday                    7.11                 92.0             7.04     43
## 2 Tuesday                   6.73                 91.1             7.94     63
## 3 Wednesday                 7.32                 92.8             6.24     65
## 4 Thursday                  7.02                 92.2             7.15     60
## 5 Friday                    7.05                 92.2             6.85     52
## 6 Saturday                  7.32                 91.6             7.50     50
## 7 Sunday                    8.09                 91.0             8.08     49

Before analyzing daily patterns, I examined the relationships between key sleep metrics:

sleep_activity_correlation <- sleep_final %>%
  summarize(
    duration_quality_cor = cor(total_minutes_asleep, total_time_in_bed ),
    duration_restless_cor = cor(total_time_in_bed, total_minutes_restless),
    duration_efficiency_cor = cor(total_time_in_bed, sleep_efficiency)
  )

print(sleep_activity_correlation)

##   duration_quality_cor duration_restless_cor duration_efficiency_cor
## 1            0.8912864              0.134657             0.008248568

Note: there is a large positive correlation (0.89) between total time asleep and total time in bed. So as total time in bed increases, total time asleep increases. There is little correlation between total time in bed & total time restless and total time in bed & sleep efficiency.

Sleep Duration by Weekday

The analysis reveals moderate variation in sleep duration throughout the week:

Peak sleep: Sunday (8.09 hours) and Wednesday (7.32 hours)
Lowest sleep: Tuesday (6.73 hours)
Weekly range: 1.36 hours difference between highest and lowest days

Strategic Insight: While sleep duration fluctuates, the variation is relatively modest, suggesting users maintain somewhat consistent sleep schedules regardless of weekday.

Sleep Restlessness by Weekday

Restlessness shows the following weekly pattern:

Highest disruption: Sunday (8.08%) and Tuesday (7.94%)
Lowest disruption: Wednesday (6.24%)
Pattern interpretation: No clear systematic weekly pattern, suggesting restlessness may be influenced more by individual lifestyle factors than day of week

Sleep Efficiency by Weekday

Sleep efficiency demonstrates remarkable consistency:

All days maintain excellent levels: Above 90% efficiency
Narrow range: Only 1.8 percentage points between Wednesday’s peak (92.8%) and Sunday’s low (91.0%)
Key finding: Unlike activity patterns, sleep quality remains consistently high throughout the week

Strategic Advantage: This reliability allows Bellabeat to build sleep features that work consistently, while personalizing for individual cycle-related sleep pattern variations.

1.7 Strategic Recommendations & Product Roadmap

1.7.1 Immediate Actions (0-3 Months) - Q1 Implementation

1.7.1.1 Tiered Goal System for Bellabeat App

Insight: 4 distinct user consistency segments (24% highly consistent, 24% rarely active)
Product Implication: One-size-fits-all goals miss engagement opportunities
Action:
- Bellabeat App: Implement adaptive goal setting
  - Struggling users: Gentle progressive goals (5,000 → 7,000 steps)
  - Consistent performers: Advanced challenges (10,000+ steps, 7km+ distance)
- Bellabeat Ivy: Display consistency badges on device screen

1.7.1.2 Sleep Duration Optimization for Bellabeat Leaf

Insight: 92.7% sleep efficiency but 41.6% below recommended 7 hours
Product Implication: Users need help prioritizing sleep time, not quality
Action:
- Bellabeat Leaf: “7-hour sleep challenge” with gentle bedtime reminders
- Bellabeat App: Sleep duration tracking with “consistency score”

1.7.2 Medium-term Initiatives (3-6 Months) - Q2-Q3 Development

1.7.2.1 Cycle-Aware Features - Bellabeat’s Competitive Edge

Insight: Current dataset lacks women’s health metrics - your strategic advantage
Product Implication: Generic fitness tracking misses women’s unique patterns
Action:
- Bellabeat App: Hormonal phase-aware activity recommendations
- Bellabeat Ivy+: Cycle-synced goal adjustment based on energy fluctuations
- All Products: Menstrual cycle integration with sleep/activity patterns

1.7.2.2 Distance-Based Milestone System

Insight: Distance shows strongest calorie correlation (r=0.60) vs steps (r=0.53)
Product Implication: Distance tracking provides more accurate energy estimation
Action:
- Bellabeat App: Prominently feature distance milestones (3.5km, 5.5km, 7km)
- Bellabeat Ivy: Create “5K Club” challenges with distance-based rewards

1.7.3 Long-term Strategy (6-12 Months) - Strategic Differentiation

1.7.3.1 Weekly Rhythm Integration

Insight: Clear Tuesday/Saturday activity peaks with stable sleep throughout week
Product Implication: Timing matters for engagement and habit formation
Action:
- Bellabeat App: Automated challenge scheduling aligned with natural rhythms
- Marketing: “Tuesday Trek” and “Saturday Stroll” campaign launches

1.7.3.2 Advanced Sleep Personalization

Insight: 31.6% of users have variable sleep patterns requiring stabilization
Product Implication: Sleep consistency drives overall wellness
Action:
- Bellabeat Leaf: Smart wake-up alarms based on sleep cycle detection
- Bellabeat App: Sleep schedule coaching for inconsistent users

1.8 Strategic Conclusion

This analysis demonstrates that while working with third-party fitness data has limitations, valuable insights emerge when viewed through Bellabeat’s women’s wellness lens. The key differentiator lies not in the raw activity metrics, but in how Bellabeat can:

Contextualize General Insights within women’s unique health journeys and hormonal cycles
Leverage High Engagement (98.5% device wear time) for continuous, personalized wellness monitoring
Address Sleep Duration Gaps while maintaining excellent sleep quality (92.7% efficiency) messaging
Implement Tiered Personalization based on clear user consistency patterns (four distinct segments)

By translating these general fitness patterns into women-centric applications, Bellabeat can strengthen its market position through data-informed personalization that truly understands and adapts to female wellness needs across different life stages and consistency levels.

1.9 Appendix

1.9.1 Appendix A: Distance Validation

Purpose: Comprehensive data validation to ensure analysis reliability across distance calculations, activity minutes, and device wear patterns.

Key Findings:

Distance Integrity: 78.4% perfect match rate between distance components
Activity Recording: 52.6% of days have incomplete activity minutes (average 227 missing minutes)
Data Quality: Removed 0.85% of records for non-wear periods
Decision: Retained realistic usage patterns while documenting limitations

Distance Validation

Methodology: Compared sum of individual distance components (VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, SedentaryActiveDistance) against TotalDistance.

# Validate distance column sums
distance_check <- daily_activity %>%
  mutate(
    calculated_total = VeryActiveDistance + ModeratelyActiveDistance + 
                      LightActiveDistance + SedentaryActiveDistance,
    difference = TotalDistance - calculated_total,
    matches_perfectly = abs(difference) < 0.01  # Allow tiny rounding differences
  )

# Check matching accuracy
cat("Rows where distances sum perfectly:", sum(distance_check$matches_perfectly), 
    "out of", nrow(distance_check), "\n")

perfect_match_rate <- mean(distance_check$matches_perfectly) * 100
cat("Percentage of perfect matches:", round(perfect_match_rate, 2), "%\n")

Data Quality Finding: 78.4% of records (737/940) showed perfect alignment between distance components and total distance, indicating generally reliable tracking with minor inconsistencies requiring documentation.

Discrepancy Analysis: A histogram visualization reveals the distribution of calculation differences, showing most variances are minimal and clustered near zero.

Visual Analysis: The histogram reveals clustered near-zero discrepancies, indicating mostly minor variances, with a small number of significant outliers requiring attention.

Significant Discrepancy Identification:

# Large discrepancies
large_diff <- distance_check %>% filter(abs(difference) > 0.1) %>% nrow()
cat("Large discrepancies (>0.1 km):", large_diff, "rows\n")

## Large discrepancies (>0.1 km): 24 rows

Key Insights:

Minimal Data Impact: Only 2.6% of records (24 rows) show significant discrepancies (>0.1 km)
Extreme Cases Noted: Some variances represent 100% differences from total distance
High Overall Quality: Majority of differences are negligible (<0.1 km)

Tracker Distance Validation

I compared the TrackerDistance and TotalDistance columns to assess data consistency:

# Compare tracker vs total distance
tracker_diff <- daily_activity %>%
  mutate(diff = abs(TrackerDistance - TotalDistance) > 0.01) %>%
  summarise(discrepant_rows = sum(diff))

cat("TrackerDistance discrepancies:", tracker_diff$discrepant_rows, "/ 940 rows\n")

## TrackerDistance discrepancies: 15 / 940 rows

Validation Outcome: Minimal discrepancies (15/940 rows, 1.6%) confirm high data consistency, supporting the decision to remove the redundant TrackerDistance column while documenting this minor data quality note.

Activity Minutes Validation

Verified completeness of daily activity recording by checking if sum of activity minutes equals 1440 (total minutes per day):

# Validate minute sums
minutes_validation <- daily_clean %>% 
  mutate(
    total_active_minutes = very_active_minutes + fairly_active_minutes + 
                          lightly_active_minutes + sedentary_minutes,
    minutes_difference = 1440 - total_active_minutes,
    minutes_match_perfectly = minutes_difference == 0,
    minutes_match_closely = abs(minutes_difference) <= 5  # 5-minute tolerance
  )

# Calculate validation results
minutes_difference <- minutes_validation %>% 
  filter(!minutes_match_perfectly, !minutes_match_closely)

validation_failed_pct <- round(nrow(minutes_difference) / nrow(daily_clean) * 100, 2)

cat("Rows failing minute validation:", nrow(minutes_difference), "/", nrow(daily_clean), 
    "(", validation_failed_pct, "%)\n")

## Rows failing minute validation: 432 / 821 ( 52.62 %)

Finding: 52.62% of rows have activity minutes that don’t sum to 1440, even with a 5-minute tolerance, indicating significant device non-wear time.

Analysis of Missing Minutes

To understand the pattern of missing data:

# Analyze missing minutes pattern
minute_pattern_analysis <- minutes_validation %>%
  mutate(missing_minutes = 1440 - total_active_minutes) %>%
  summarise(
    avg_missing_minutes = mean(missing_minutes),
    median_missing_minutes = median(missing_minutes),
    max_missing_minutes = max(missing_minutes),
    min_missing_minutes = min(missing_minutes),
    pct_with_missing_minutes = mean(missing_minutes > 0) * 100
  )

cat("Missing Minutes Analysis:\n")

## Missing Minutes Analysis:

print(minute_pattern_analysis)

##   avg_missing_minutes median_missing_minutes max_missing_minutes
## 1            227.2814                     82                1060
##   min_missing_minutes pct_with_missing_minutes
## 1                   0                 52.86236

Key Insight: The average of 227 missing minutes suggests users typically wear their devices for about 20 hours per day, with 52.86% of days having some missing activity data.

Visualizing Missing Minutes Distribution

Visual Analysis: The distribution shows most days have moderate amounts of missing data, with a concentration around 200-300 missing minutes, consistent with partial device wear throughout the day.

Data Quality Decision

Based on this analysis:

52.62% of days have significant missing activity data (>5 minutes)
Average of 227 missing minutes per day
Pattern suggests regular device removal (sleep, charging, etc.)

Decision: I will retain all records but document this data limitation, as it reflects real-world usage patterns where users don’t wear devices 24/7.

Identifying Non-Wear Periods

I conducted additional validation to identify potential device non-wear periods by examining records with 1440 sedentary minutes:

# Identify potential non-wear periods
sedentary_1440 <- daily_clean %>%
  filter(sedentary_minutes == 1440) 

cat("Potential non-wear records (1440 sedentary minutes):", nrow(sedentary_1440), "\n")

## Potential non-wear records (1440 sedentary minutes): 7

cat("Percentage of total data:", round(nrow(sedentary_1440) / nrow(daily_clean) * 100, 2), "%\n")

## Percentage of total data: 0.85 %

# Analyze these suspicious records
sedentary_1440_summary <- sedentary_1440 %>%
  summarise(
    avg_steps = mean(total_steps),
    avg_calories = mean(calories),
    zero_steps_count = sum(total_steps == 0),
    zero_calories_count = sum(calories == 0)
  )

print("Analysis of potential non-wear records:")

## [1] "Analysis of potential non-wear records:"

print(sedentary_1440_summary)

##   avg_steps avg_calories zero_steps_count zero_calories_count
## 1      7182     2689.143                0                   0

Key Finding: Only 7 records (0.85% of data) showed 1440 sedentary minutes with minimal activity, indicating likely device non-wear. These records were removed to maintain data quality.

1.9.2 Appendix B: Sleep Data Duplicate Analysis & Aggregation

Duplicate Detection and Removal

To ensure data integrity, I conducted duplicate checks and removed the duplicated rows.

# Check for duplicates in SleepDay data
sleep_day_duplicates <- sleep_day_clean %>%
  group_by(id, sleep_date) %>%
  summarise(n_records = n(), .groups = 'drop') %>%
  filter(n_records > 1)

cat("Duplicate records in SleepDay data:", nrow(sleep_day_duplicates), "\n")

# Remove SleepDay duplicates
duplicate_rows <- sleep_day_clean %>%
  group_by(id, sleep_date) %>%
  filter(n() > 1) %>%
  arrange(id, sleep_date) # The duplicates are a match so I will remove them

sleep_day_deduped <- sleep_day_clean %>%
  distinct(id, sleep_date, .keep_all = TRUE)

sleep_day_clean <- sleep_day_deduped


# Check for duplicates in MinuteSleep data
sleep_minute_duplicate <- sleep_minutes_clean %>% 
  group_by(id,minute_datetime,log_id) %>% 
  summarise(n_records = n(), .groups = 'drop') %>% 
  filter(n_records > 1)

cat("Duplicate records in MinuteSleep data:", nrow(sleep_minute_duplicate), "\n")

# Analyze duplicate patterns in minute data
duplicate_rows_minute <- sleep_minutes_clean %>%
  group_by(id, minute_datetime, log_id) %>%
  filter(n() > 1) %>%
  arrange(id, minute_datetime, log_id)

duplicate_analysis <- duplicate_rows_minute %>%
  group_by(id, minute_datetime) %>%
summarise(
  n_rows = n(),
  same_value = n_distinct(value) == 1,
  same_minute_datetime = n_distinct(minute_datetime) == 1,
  .groups = 'drop'
) 

cat("Duplicate analysis - identical values:", sum(duplicate_analysis$same_value), 
    "out of", nrow(duplicate_analysis), "duplicate groups\n")

# Remove MinuteSleep duplicates
sleep_minute_deduped <- sleep_minutes_clean %>%
  distinct(id, minute_datetime, .keep_all = TRUE)
# 543 duplicated rows were successfully removed

sleep_minutes_clean <- sleep_minute_deduped

Data Quality Findings:

SleepDay Data: 3 duplicate records identified and removed (identical entries)
MinuteSleep Data: 543 duplicate rows removed from 1,086 duplicate observations
Duplicate Nature: All duplicates represented identical values, confirming safe removal without data loss.

Aggregating MinuteSleep Data

After cleaning both sleep datasets, I proceeded to aggregate the minute-level sleep data to create daily summaries that could be effectively merged with the SleepDay dataset.

# Aggregate by sleep session to analyze individual sleep periods
sleep_minutes_daily <- sleep_minutes_clean %>%
  group_by(id, minute_date, log_id) %>%
  summarise(
    minutes_asleep = sum(value == 1),
    minutes_restless = sum(value == 2),
    minutes_awake = sum(value == 3),
    total_minutes_recorded = n(),
    sleep_start = min(minute_datetime),
    sleep_end = max(minute_datetime),
    sleep_efficiency = (minutes_asleep / total_minutes_recorded) * 100,
    .groups = 'drop'
  ) %>%
  mutate(
    hours_asleep = minutes_asleep / 60,
    hours_restless = minutes_restless / 60,
    hours_awake = minutes_awake / 60,
    total_hours_recorded = total_minutes_recorded / 60
  )

cat("Sleep sessions after session-level aggregation:", nrow(sleep_minutes_daily), "\n")

Initial Finding: The session-level aggregation revealed 466 sleep sessions compared to only 410 daily records in the SleepDay dataset. This discrepancy indicated that users often have multiple sleep sessions (e.g., naps) recorded throughout the day, requiring further aggregation.

To create comparable daily summaries, I aggregated all sleep sessions for each user on each day.

Sleep State Classification: Based on Fitbit’s sleep tracking methodology, the sleep state values are defined as follows:

1 indicates asleep,
2 represents restless sleep, and
3 denotes awake periods.

This classification system allows for precise analysis of sleep quality and patterns throughout the night.

# Aggregate all sleep sessions by user and day
sleep_minutes_daily_aggregated <- sleep_minutes_clean %>%
  group_by(id, minute_date) %>%
  summarise(
    # Sum across all sleep sessions for the day
    total_minutes_asleep = sum(value == 1),
    total_minutes_restless = sum(value == 2),
    total_minutes_awake = sum(value == 3),
    
    # Count total sleep sessions and recording time
    total_sleep_sessions = n_distinct(log_id),
    total_minutes_recorded = n(),
    
    # Sleep efficiency (daily overall)
    sleep_efficiency = (total_minutes_asleep / total_minutes_recorded) * 100,
    
    # Sleep timing (earliest start, latest end)
    first_sleep_start = min(minute_datetime),
    last_sleep_end = max(minute_datetime),
    
    .groups = 'drop'
  ) %>%
  mutate(
    # Convert to hours for business reporting
    total_hours_asleep = total_minutes_asleep / 60,
    total_hours_restless = total_minutes_restless / 60,
    total_hours_awake = total_minutes_awake / 60,
    total_hours_recorded = total_minutes_recorded / 60
  )

cat("Sleep sessions after final aggregation:", nrow(sleep_minutes_daily_aggregated), "\n")

Key Insight: The daily aggregation consolidated 412 sleep sessions into 412 daily records, revealing 2 additional daily sleep records not present in the original SleepDay dataset.

1.9.3 Appendix C: Sleep Data Integration & Comparison

Sleep Data Comparison

To ensure data quality and determine the optimal integration strategy, I conducted a comparison between the two sleep data sources:

sleep_comparison <- sleep_day_clean %>%
  full_join(sleep_minutes_daily_aggregated, by = c("id", "sleep_date" = "minute_date")) %>%
  # 'x' being sleep_day and 'y' being minute_sleep
  mutate(
    # Check how well the data matches
    has_daily_data = !is.na(total_minutes_asleep.x),
    has_minute_data = !is.na(total_minutes_asleep.y),
    
    # Compare sleep minutes between sources
    sleep_minutes_diff = ifelse(has_daily_data & has_minute_data, 
                                total_minutes_asleep.x - total_minutes_asleep.y, NA),
    sleep_minutes_match = abs(sleep_minutes_diff) <= 10,  # Allow 10-minute difference
    
    # Compare total time in bed
    total_time_diff = ifelse(has_daily_data & has_minute_data,
                             total_time_in_bed - total_minutes_recorded, NA),
    total_time_match = abs(total_time_diff) <= 10
  )

# Generate coverage and consistency summary
coverage_summary <- sleep_comparison %>%
  summarise(
    total_days = n(),
    days_with_both_data = sum(has_daily_data & has_minute_data),
    days_with_only_daily = sum(has_daily_data & !has_minute_data),
    days_with_only_minute = sum(!has_daily_data & has_minute_data),
    
    pct_days_with_both = mean(has_daily_data & has_minute_data) * 100 ,
    pct_perfect_sleep_match = mean(sleep_minutes_match, na.rm = TRUE) * 100
  )
print("Final Sleep Data Integration Analysis:")
print(coverage_summary)

Data Integration Assessment:

97.59% data overlap: Strong alignment between datasets with 410 shared daily records
39 unique records: Additional sleep data available only in the minute-level dataset
95.07% perfect match rate: Moderate consistency in sleep duration measurements between sources

1.9.4 Appendix Conclusion

The comprehensive data validation processes documented in this appendix ensure the analytical integrity of our findings. While the FitBit dataset exhibits some expected real-world usage inconsistencies, the thorough cleaning and validation procedures confirm data reliability for Bellabeat’s strategic decision-making.

All documented limitations have been considered in the analysis interpretation, and the final dataset provides a robust foundation for the business insights and recommendations presented in this report.

2 References

2.1 Data & Research Sources

Fuller, D., Colwell, E., Low, J., Orychock, K., Tobin, M. A., Simango, B., … & Mean, A. (2020). Reliability and validity of commercially available wearable devices for measuring steps, energy expenditure, and heart rate: Systematic review. JMIR mHealth and uHealth, 8(9), e18694. https://doi.org/10.2196/18694

Santos-Longhurst, A. (2018, November 27). How many steps do people take per day on average? Healthline. https://www.healthline.com/health/average-steps-per-day

Kahn, J. (2021). Sleep doctor explains your restless sleep and how to fix it. Rise Science. https://www.risescience.com/blog/restless-sleep

Cleveland Clinic. (2023, June 9). Burn calories by reading this article! Health Essentials. https://health.clevelandclinic.org/calories-burned-in-a-day

The Munich Eye. (2025, April 13). Evaluating fitness tracker scores: Are they trustworthy? The Munich Eye. https://themunicheye.com/evaluating-fitness-tracker-scores-17289

2.2 Data Source

FitBit Fitness Tracker Data. (2016). Kaggle Dataset. Mobius. https://www.kaggle.com/datasets/arashnic/fitbit/data

2.3 Technical Tools

R version 4.3.1 (2023-06-16)
RStudio 2023.06.0+421
Tidyverse package ecosystem (Wickham et al., 2019)

Analysis conducted: November 2025

Bellabeat Data Analysis: How Can a Wellness Technology Company Play It Smart?

Yerowo Mey

2025-11-07