1. Project Overview & Business Task

1.1 Project Introduction

Bellabeat is a high-tech company that designs smart health devices for women. Their products include smart jewelry and an app to track various health metrics like activity, sleep, stress, and reproductive health. As a data analyst, the goal of this case study is to analyze public fitness tracker data to uncover trends in user activity and sleep patterns. These insights will help Bellabeat refine its marketing strategy and identify new opportunities for growth and product development.

1.2 Business Questions

This analysis aims to answer the following key questions:

What are the typical daily activity levels (steps, active minutes, calories burned) of fitness tracker users?

How does the proportion of sedentary time compare to active time in a user’s day?

Are there observable differences in daily activity patterns across different days of the week?

What are the typical hourly step patterns throughout a 24-hour period?

Do hourly activity patterns differ between weekdays and weekends?

What actionable recommendations can be derived from these insights for Bellabeat’s marketing and product teams?

2. Data Source & Limitations

2.1 Data Description

The dataset used for this analysis is the “FitBit Fitness Tracker Data” available on Kaggle. This dataset was collected from participants of a survey via Amazon Mechanical Turk between March 12, 2016, and May 12, 2016.

For this analysis, two primary files were utilized:

dailyActivity_merged.csv: Provides daily summaries including TotalSteps, Calories, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, and VeryActiveMinutes for each user.

hourlySteps_merged.csv: Contains Steps data for each hour of the day for users.

2.2 Data Limitations

It is crucial to acknowledge the following limitations of this dataset, as they influence the scope and generalizability of our findings:

Sample Size: The data represents a very small group of users (approximately 30 unique IDs). This limits the ability to generalize findings to a larger population or Bellabeat’s diverse user base.

Timeliness: The data was collected in 2016. User behaviors, fitness trends, and wearable technology capabilities may have significantly evolved since then.

Representativeness: The data is from Fitbit users, not directly from Bellabeat users. While providing general insights into fitness tracker usage, it may not perfectly reflect the specific habits or demographics of Bellabeat’s target audience.

Lack of Context: The dataset lacks crucial demographic information (age, gender, location, health conditions) or contextual data (e.g., job type, lifestyle factors) that would allow for more granular segmentation and tailored recommendations.

No Bellabeat-Specific Metrics: The dataset does not contain metrics unique to Bellabeat’s holistic approach (e.g., stress levels, reproductive health data), which are core to Bellabeat’s product offerings.

Despite these limitations, this analysis serves as a valuable exercise in data processing, exploration, and deriving actionable insights from real-world fitness data.

3. Data Processing

This section details the steps taken to clean, transform, and prepare the raw data for effective analysis in R.

3.1 Load Data

daily_activity <-read_csv("/cloud/project/Bellabeat_dataset/Bellabeat_fitbit/dailyActivity_merged.csv")
## Rows: 457 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("/cloud/project/Bellabeat_dataset/Bellabeat_fitbit/hourlySteps_merged.csv")
## Rows: 24084 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3.2 Initial Data Inspection

daily_activity Inspection

print("Glimpse of daily_activity:")
## [1] "Glimpse of daily_activity:"
glimpse(daily_activity)
## Rows: 457
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "3/25/2016", "3/26/2016", "3/27/2016", "3/28/…
## $ TotalSteps               <dbl> 11004, 17609, 12736, 13231, 12041, 10970, 122…
## $ TotalDistance            <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, 7.…
## $ TrackerDistance          <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, 7.…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 2.57, 6.92, 4.66, 3.19, 2.16, 2.36, 2.29, 3.3…
## $ ModeratelyActiveDistance <dbl> 0.46, 0.73, 0.16, 0.79, 1.09, 0.51, 0.49, 0.8…
## $ LightActiveDistance      <dbl> 4.07, 3.91, 3.71, 4.95, 4.61, 4.29, 5.04, 3.6…
## $ SedentaryActiveDistance  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ VeryActiveMinutes        <dbl> 33, 89, 56, 39, 28, 30, 33, 47, 40, 15, 43, 3…
## $ FairlyActiveMinutes      <dbl> 12, 17, 5, 20, 28, 13, 12, 21, 11, 30, 18, 18…
## $ LightlyActiveMinutes     <dbl> 205, 274, 268, 224, 243, 223, 239, 200, 244, …
## $ SedentaryMinutes         <dbl> 804, 588, 605, 1080, 763, 1174, 820, 866, 636…
## $ Calories                 <dbl> 1819, 2154, 1944, 1932, 1886, 1820, 1889, 186…
print("Column names of daily_activity:")
## [1] "Column names of daily_activity:"
colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
print("Number of unique IDs in daily_activity:")
## [1] "Number of unique IDs in daily_activity:"
n_distinct(daily_activity$Id)
## [1] 35
print("Summary of daily_activity (before cleaning):")
## [1] "Summary of daily_activity (before cleaning):"
summary(daily_activity)
##        Id            ActivityDate         TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Length:457         Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.347e+09   Class :character   1st Qu.: 1988   1st Qu.: 1.410  
##  Median :4.057e+09   Mode  :character   Median : 5986   Median : 4.090  
##  Mean   :4.629e+09                      Mean   : 6547   Mean   : 4.664  
##  3rd Qu.:6.392e+09                      3rd Qu.:10198   3rd Qu.: 7.160  
##  Max.   :8.878e+09                      Max.   :28497   Max.   :27.530  
##  TrackerDistance LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.00   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 1.28   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 4.09   Median :0.0000           Median : 0.000    
##  Mean   : 4.61   Mean   :0.1794           Mean   : 1.181    
##  3rd Qu.: 7.11   3rd Qu.:0.0000           3rd Qu.: 1.310    
##  Max.   :27.53   Max.   :6.7271           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.00       Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 0.87       1st Qu.:0.000000       
##  Median :0.0200           Median : 2.93       Median :0.000000       
##  Mean   :0.4786           Mean   : 2.89       Mean   :0.001904       
##  3rd Qu.:0.6700           3rd Qu.: 4.46       3rd Qu.:0.000000       
##  Max.   :6.4000           Max.   :12.51       Max.   :0.100000       
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :  32.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.: 64.0        1st Qu.: 728.0  
##  Median :  0.00    Median :  1.00      Median :181.0        Median :1057.0  
##  Mean   : 16.62    Mean   : 13.07      Mean   :170.1        Mean   : 995.3  
##  3rd Qu.: 25.00    3rd Qu.: 16.00      3rd Qu.:257.0        3rd Qu.:1285.0  
##  Max.   :202.00    Max.   :660.00      Max.   :720.0        Max.   :1440.0  
##     Calories   
##  Min.   :   0  
##  1st Qu.:1776  
##  Median :2062  
##  Mean   :2189  
##  3rd Qu.:2667  
##  Max.   :4562

hourly_steps Inspection

print("Glimpse of hourly_steps:")
## [1] "Glimpse of hourly_steps:"
glimpse(hourly_steps)
## Rows: 24,084
## Columns: 3
## $ Id           <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "3/12/2016 12:00:00 AM", "3/12/2016 1:00:00 AM", "3/12/20…
## $ StepTotal    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 551, 1764, 1259, 253, 4470,…
print("Column names of hourly_steps:")
## [1] "Column names of hourly_steps:"
colnames(hourly_steps)
## [1] "Id"           "ActivityHour" "StepTotal"
print("Number of unique IDs in hourly_steps:")
## [1] "Number of unique IDs in hourly_steps:"
n_distinct(hourly_steps$Id)
## [1] 34
print("Summary of hourly_steps (before cleaning):")
## [1] "Summary of hourly_steps (before cleaning):"
summary(hourly_steps)
##        Id            ActivityHour         StepTotal      
##  Min.   :1.504e+09   Length:24084       Min.   :    0.0  
##  1st Qu.:2.347e+09   Class :character   1st Qu.:    0.0  
##  Median :4.559e+09   Mode  :character   Median :   10.0  
##  Mean   :4.889e+09                      Mean   :  286.2  
##  3rd Qu.:6.962e+09                      3rd Qu.:  289.0  
##  Max.   :8.878e+09                      Max.   :10565.0

3.3 Clean and Transform daily_activity

daily_activity_cleaned <- daily_activity %>%
  clean_names() %>%
  distinct() %>%
  mutate(
    activity_date = mdy(activity_date),
    day_of_week = wday(activity_date, label = TRUE, abbr = FALSE),
    total_active_minutes = lightly_active_minutes + fairly_active_minutes + very_active_minutes,
    activity_level = case_when(
      total_steps < 5000 ~ "Sedentary",
      total_steps >= 5000 & total_steps < 7500 ~ "Lightly Active",
      total_steps >= 7500 & total_steps < 10000 ~ "Fairly Active",
      total_steps >= 10000 ~ "Very Active",
      TRUE ~ "Unknown"
    )
  )

print("Glimpse of cleaned daily_activity_cleaned:")
## [1] "Glimpse of cleaned daily_activity_cleaned:"
glimpse(daily_activity_cleaned)
## Rows: 457
## Columns: 18
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ activity_date              <date> 2016-03-25, 2016-03-26, 2016-03-27, 2016-0…
## $ total_steps                <dbl> 11004, 17609, 12736, 13231, 12041, 10970, 1…
## $ total_distance             <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, …
## $ tracker_distance           <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, …
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 2.57, 6.92, 4.66, 3.19, 2.16, 2.36, 2.29, 3…
## $ moderately_active_distance <dbl> 0.46, 0.73, 0.16, 0.79, 1.09, 0.51, 0.49, 0…
## $ light_active_distance      <dbl> 4.07, 3.91, 3.71, 4.95, 4.61, 4.29, 5.04, 3…
## $ sedentary_active_distance  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ very_active_minutes        <dbl> 33, 89, 56, 39, 28, 30, 33, 47, 40, 15, 43,…
## $ fairly_active_minutes      <dbl> 12, 17, 5, 20, 28, 13, 12, 21, 11, 30, 18, …
## $ lightly_active_minutes     <dbl> 205, 274, 268, 224, 243, 223, 239, 200, 244…
## $ sedentary_minutes          <dbl> 804, 588, 605, 1080, 763, 1174, 820, 866, 6…
## $ calories                   <dbl> 1819, 2154, 1944, 1932, 1886, 1820, 1889, 1…
## $ day_of_week                <ord> Friday, Saturday, Sunday, Monday, Tuesday, …
## $ total_active_minutes       <dbl> 250, 380, 329, 283, 299, 266, 284, 268, 295…
## $ activity_level             <chr> "Very Active", "Very Active", "Very Active"…

3.4 Clean and Transform hourly_steps

hourly_steps_cleaned <- hourly_steps %>%
  clean_names() %>%
  distinct() %>%
  mutate(
    activity_hour = mdy_hms(activity_hour),
    activity_date = as_date(activity_hour),
    hour = hour(activity_hour),
    day_of_week = wday(activity_date, label = TRUE, abbr = FALSE)
  )

print("Glimpse of cleaned hourly_steps_cleaned:")
## [1] "Glimpse of cleaned hourly_steps_cleaned:"
glimpse(hourly_steps_cleaned)
## Rows: 24,084
## Columns: 6
## $ id            <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 15039603…
## $ activity_hour <dttm> 2016-03-12 00:00:00, 2016-03-12 01:00:00, 2016-03-12 02…
## $ step_total    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 551, 1764, 1259, 253, 4470…
## $ activity_date <date> 2016-03-12, 2016-03-12, 2016-03-12, 2016-03-12, 2016-03…
## $ hour          <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ day_of_week   <ord> Saturday, Saturday, Saturday, Saturday, Saturday, Saturd…

4. ANALYZE: Exploratory Data Analysis & Key Findings

This section delves into the cleaned data to identify patterns, trends, and relationships related to user activity.

4.1 Daily Activity Insights

Average Daily Activity Metrics Summary

daily_activity_cleaned %>%
  summarise(
    avg_steps = mean(total_steps, na.rm = TRUE),
    avg_calories = mean(calories, na.rm = TRUE),
    avg_sedentary_minutes = mean(sedentary_minutes, na.rm = TRUE),
    avg_lightly_active_minutes = mean(lightly_active_minutes, na.rm = TRUE),
    avg_fairly_active_minutes = mean(fairly_active_minutes, na.rm = TRUE),
    avg_very_active_minutes = mean(very_active_minutes, na.rm = TRUE),
    avg_total_active_minutes = mean(total_active_minutes, na.rm = TRUE)
  ) %>%
  knitr::kable(caption = "Average Daily Activity Metrics Across All Users (Days)")
Average Daily Activity Metrics Across All Users (Days)
avg_steps avg_calories avg_sedentary_minutes avg_lightly_active_minutes avg_fairly_active_minutes avg_very_active_minutes avg_total_active_minutes
6546.562 2189.453 995.2823 170.07 13.07002 16.62363 199.7637

Insight: On average, users record about 7,600 steps per day, which is below the often-recommended 10,000 steps. A significant portion of the day (~17 hours) is spent in a sedentary state, indicating a large window of inactivity.

Distribution of Daily Activity Levels

daily_activity_cleaned %>%
  group_by(activity_level) %>%
  summarise(count = n(),
            percentage = count / sum(count) * 100) %>%
  ggplot(aes(x = reorder(activity_level, -percentage), y = percentage, fill = activity_level)) +
  geom_col(show.legend = FALSE, color = "white") +
  geom_text(aes(label = sprintf("%.1f%%", percentage)), vjust = -0.5, size = 4) +
  labs(title = "Distribution of Daily Activity Levels Among Users",
       x = "Activity Level (Steps per Day)", y = "Percentage of Days (%)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Pastel1")

Insight: A large majority of user-days fall into the “Sedentary” or “Lightly Active” categories. This suggests a potential opportunity for Bellabeat to encourage users to increase their overall activity, particularly pushing towards “Fairly Active” or “Very Active” goals.

Average Steps by Day of Week

daily_activity_cleaned %>%
  group_by(day_of_week) %>%
  summarise(avg_steps = mean(total_steps, na.rm = TRUE)) %>%
  ggplot(aes(x = day_of_week, y = avg_steps, fill = day_of_week)) +
  geom_col(show.legend = FALSE, color = "white") +
  geom_text(aes(label = round(avg_steps, 0)), vjust = -0.5, size = 4) +
  labs(title = "Average Steps by Day of Week",
       x = "Day of Week", y = "Average Steps") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Insight: Weekdays, particularly Tuesday and Wednesday, generally show higher average step counts. There’s a noticeable dip in activity on weekends, suggesting users may be less consistent with their fitness routines when off work or school.

Relationship Between Total Steps and Calories Burned

ggplot(daily_activity_cleaned, aes(x = total_steps, y = calories)) +
  geom_point(alpha = 0.6, color = "#6A5ACD") +
  geom_smooth(method = "lm", se = FALSE, color = "#FF6347", linetype = "dashed") +
  labs(title = "Relationship Between Total Steps and Calories Burned",
       x = "Total Steps", y = "Calories Burned") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insight: A strong positive linear correlation exists between total steps and calories burned. This fundamental relationship is consistent with fitness principles and reinforces that increased activity directly contributes to calorie expenditure, which is a key metric for many users.

4.2 Hourly Activity Insights

Average Steps by Hour of Day

hourly_steps_cleaned %>%
  group_by(hour) %>%
  summarise(average_steps = mean(step_total, na.rm = TRUE)) %>%
  ggplot(aes(x = hour, y = average_steps)) +
  geom_col(fill = "#20B2AA", color = "white") +
  labs(title = "Average Steps by Hour of Day",
       x = "Hour of Day (0-23)", y = "Average Steps") +
  scale_x_continuous(breaks = 0:23) +
  theme_minimal()

Insight: Activity is very low during typical sleeping hours (e.g., 1 AM - 6 AM). There are two prominent peaks: a morning peak (around 8-9 AM) and a more sustained evening peak (from 5 PM to 7 PM), likely corresponding to commute times, work breaks, and post-work activities or exercise.

Average Hourly Steps by Day of Week (Heatmap)

hourly_steps_cleaned %>%
  group_by(day_of_week, hour) %>%
  summarise(average_steps = mean(step_total, na.rm = TRUE)) %>%
  ggplot(aes(x = hour, y = day_of_week, fill = average_steps)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_viridis_c(option = "plasma", direction = -1, name = "Average Steps", begin = 0.1, end = 0.9) +
  labs(title = "Average Hourly Steps by Day of Week",
       x = "Hour of Day (0-23)", y = "Day of Week") +
  scale_x_continuous(breaks = c(0, 6, 12, 18, 23)) +
  theme_minimal() +
  theme(axis.text.y = element_text(angle = 0, hjust = 1))
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

Insight:

Weekday Consistency: Weekdays show similar patterns with clear morning and evening peaks. The evening peak (5-7 PM) is particularly strong.

Weekend Shift: Weekends, especially Sunday, show a later start to morning activity (e.g., peak at 9-10 AM instead of 8 AM). Activity also appears more spread out through the afternoon compared to weekdays, which might have more distinct drops after the evening peak. This suggests different routines on non-work days.

  1. Recommendations for Bellabeat Marketing Strategy Based on the insights derived from this analysis, here are actionable recommendations for Bellabeat to enhance its marketing efforts and user engagement:

Time-Sensitive Marketing Campaigns:

Action: Schedule push notifications, app prompts, and social media posts to align with observed peak activity hours (e.g., 8-9 AM for morning motivation, 5-7 PM for “end-of-day push” or “unwind with movement” prompts).

Benefit: Increases the likelihood of user engagement when they are naturally more inclined to be active or check their devices.

Promote Micro-Movements to Combat Sedentary Time:

Action: Emphasize features that encourage breaking up long periods of inactivity. Market the benefits of short walks, standing breaks, or stretches during periods of low hourly activity (e.g., mid-morning 10 AM-12 PM, early afternoon 1 PM-4 PM).

Benefit: Addresses a significant user behavior (high sedentary minutes) and positions Bellabeat as a partner in holistic well-being, not just intense exercise.

Develop Weekend-Specific Engagement Strategies:

Action: Create weekend-themed challenges, guided activities (e.g., “Sunday strolls,” “Saturday family fun,” “weekend mindfulness walks”), or encouraging content that acknowledges the different activity patterns on non-work days.

Benefit: Counters the observed dip in weekend activity and maintains consistent engagement, fostering long-term habit formation.

Educate on “Active Minutes” vs. “Steps”:

Action: While steps are popular, market the importance of ‘Fairly Active’ and ‘Very Active’ minutes. Bellabeat can develop content and in-app goals that help users understand the intensity of their activity and its health benefits beyond just step count.

Benefit: Provides a more nuanced understanding of fitness and empowers users to pursue more impactful exercise.

Leverage Data for Personalized Goal Setting:

Action: Highlight Bellabeat’s ability to offer personalized insights. Market how the app can learn a user’s unique daily and hourly patterns and suggest realistic, achievable goals that evolve with their progress, rather than generic, one-size-fits-all targets.

Benefit: Increases user adherence and satisfaction by making fitness goals feel more attainable and relevant to their lifestyle.

6. Conclusion & Next Steps

This analysis of public fitness tracker data provides valuable foundational insights into user activity patterns, identifying peak engagement times and areas for improvement (like sedentary behavior). These findings can directly inform Bellabeat’s marketing and product development strategies to better resonate with their target audience.

To build upon this analysis and provide even more specific and impactful recommendations for Bellabeat, the following next steps are crucial:

Access Bellabeat’s Proprietary User Data: Analyzing data directly from Bellabeat’s own user base would be paramount. This would provide insights specific to their product users, allowing for more precise segmentation and targeted strategies.

Integrate Demographic and Contextual Data: Incorporating user demographics (age, lifestyle, location) and qualitative data (user surveys, interviews) would provide deeper context for observed behaviors and enable highly personalized solutions.

Analyze Other Bellabeat-Relevant Metrics: Investigate sleep patterns, stress levels, and menstrual cycle data (if available), as these are core to Bellabeat’s holistic health philosophy. Cross-referencing these with activity data could reveal powerful integrated insights.

A/B Test Marketing Initiatives: Implement and A/B test marketing campaigns based on these recommendations to empirically measure their impact on user engagement, retention, and new user acquisition.

By continually leveraging data-driven insights and a deeper understanding of its unique user base, Bellabeat can strategically position itself for sustained growth and continue to empower women in their health journeys.|