Intro

This is a Case Study on figuring out “How can a wellness company play it smart?”.

The company

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused sma products. Sršen used her background as an aist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

ASK - Business challenge

How can Bellabeat find more opportunities to grow by analyzing smart device usage data to understand how to improve their health and wellness.

The key stakeholders include:

Data Sources

  • FitBit Fitness Tracker Data: Kaggle Dataset by Arash Nic (CC0 Public Domain). Accessed May 2025.

PREPARE

PROCESS

For this case Im working with the following dataframes:

Using sheets, I removed duplicates. sleepDay_merged.csv had 3 duplicates. Also renamed the columns names and checked for whitespaces to facilitate further coding with R.

Loading relevant libraries.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)    # For data manipulation and visualization
library(janitor)      # For data cleaning functions
library(dplyr)        # For data manipulation functions
library(lubridate)    #For parse dates
library(ggplot2)      #For viz
library(tidyr)        #For cleaning and preparing data
Created copies of the raw data, unified the naming system, cleaning names and loaded into R
daily_activity <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailyActivity_merged.csv") %>% clean_names()
daily_calories <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailyCalories_merged.csv") %>% clean_names()
daily_intensitie <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailyIntensities_merged.csv") %>% clean_names()
daily_steps <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailySteps_merged.csv") %>% clean_names()
daily_sleep <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_sleepDay_merged.csv") %>% clean_names()

ANALYZE

Checking unique participants for each dataset
n_distinct(daily_activity$id)
## [1] 33
n_distinct(daily_calories$id)
## [1] 33
n_distinct(daily_intensitie$id)
## [1] 33
n_distinct(daily_steps$id)
## [1] 33
n_distinct(daily_sleep$id)
## [1] 24

Important insight

Not every participant responded daily sleep.The difference in participant counts between your sleep dataset (24 users) and other datasets (33 users) is significant and worth exploring. Which we’ll dive into possibilities why further into the Analyze phase. Let’s create an unique dataframe to work on, to easy the coding:

dataset_list <- list(
  daily_activity, 
  daily_calories, 
  daily_intensitie, 
  daily_sleep, 
  daily_steps
)

merged_data <- dataset_list %>%
  reduce(full_join, by = c("id", "activitydate"))
Let’s take a look at the merged result!
glimpse(merged_data)
## Rows: 940
## Columns: 26
## $ id                           <dbl> 1503960366, 1503960366, 1503960366, 15039…
## $ activitydate                 <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4…
## $ total_steps                  <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 1…
## $ total_distance               <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59,…
## $ tracker_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59,…
## $ logged_activities_distance   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ very_active_distance.x       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25,…
## $ moderately_active_distance.x <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64,…
## $ light_active_distance.x      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71,…
## $ sedentary_active_distance.x  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ very_active_minutes.x        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 6…
## $ fairly_active_minutes.x      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27…
## $ lightly_active_minutes.x     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 2…
## $ sedentary_minutes.x          <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775,…
## $ calories.x                   <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921,…
## $ calories.y                   <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921,…
## $ sedentary_minutes.y          <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775,…
## $ lightly_active_minutes.y     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 2…
## $ fairly_active_minutes.y      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27…
## $ very_active_minutes.y        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 6…
## $ sedentary_active_distance.y  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ light_active_distance.y      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71,…
## $ moderately_active_distance.y <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64,…
## $ very_active_distance.y       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25,…
## $ total_time_in_bed            <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377…
## $ step_total                   <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 1…
Now for cleaning, let’s check for NA values in key columns, clean all column names and convert data type.
sum(is.na(merged_data$id))
## [1] 0
sum(is.na(merged_data$activitydate))
## [1] 0
merged_clean <- merged_data %>%
  clean_names() %>% 
  rename_with(~str_remove_all(., "\\W+")) %>%  
  rename(
    id = matches("^id$|^i_d$|participant"),  
    activity_date = matches("activitydate|date|^day$") 
  )

merged_clean <- merged_clean %>%
  mutate(
    activity_date = parse_date_time(
      activity_date,
      orders = c("ymd", "mdy", "dmy", "Y-m-d", "m/d/Y")
    ) %>% as.Date()
  ) 
For the analysis, I’ll create relevant values, such as day_of_week, is_weekend, week_of_year and activity_level.
merged_analysis <- merged_clean %>%
  mutate(
    day_of_week = weekdays(activity_date),
    is_weekend = day_of_week %in% c("Saturday", "Sunday"),
    week_of_year = week(activity_date),
    
    activity_level = case_when(
      step_total > 10000 ~ "high",
      step_total > 5000 ~ "medium",
      TRUE ~ "low"
    ))
To secure the cleaning and preparation made so far, let’s save the work done into a new .csv file
write_csv(merged_analysis, "fitbit_data_merged.csv")
Now we can go back at why do we have less participants for daily_sleep.
merged_analysis %>%
  mutate(has_sleep_data = !is.na(total_time_in_bed)) %>%
  group_by(has_sleep_data) %>%
  summarise(avg_steps = mean(total_steps, na.rm = TRUE),
            avg_active_mins = mean(very_active_minutes_x, na.rm = TRUE))
## # A tibble: 2 × 3
##   has_sleep_data avg_steps avg_active_mins
##   <lgl>              <dbl>           <dbl>
## 1 FALSE              6959.            18.2
## 2 TRUE               8515.            25.0
We see that users without sleep data has less steps and active minutes. We can considering the following variables for that:
  • Some users may consistently remove their devices at night
  • Some trackers may not have sleep tracking capabilities
  • Tech issues with syncing problems specific to sleep data
Let’s follow up with some more visualization.
ggplot(merged_analysis, aes(x = total_steps, fill = activity_level)) +
  geom_density(alpha = 0.6) +
  labs(title = "Step Count Distribution by Activity Level",
       x = "Total Daily Steps",
       y = "Density",
       fill = "Activity Level") +
  scale_fill_manual(
    values = c("low" = "#d62828",       
               "medium" = "#fcbf49", 
               "high" = "#003049"),  
    name = "Activity Level"      
  ) +
  theme_minimal()

By checking step count distribution by activity level we can see a natural break point in low activity level users.
ggplot(heatmap_data, aes(x = day_of_week, y = intensity, fill = minutes)) +
  geom_tile(color = "#c7f9cc") +
  scale_fill_gradient(low = "#fdcc6d", high = "#e75414") +
  labs(title = "Average Activity Intensity by Day of Week",
       x = "",
       y = "Activity Intensity",
       fill = "Minutes") +
  theme_minimal() +
  geom_text(aes(label = round(minutes)), color = "black", size = 3)

By this heat map we can see users’ activity intensity by day of week.
  • It’s very distant the difference between sedentary time, it dominates all days.
  • The problem we can see is that users spent most of their time inactive with very little room for relevant intense activity.
  • Monday is the most inactive day of the week in counterpart, participants are less sedentary on thw weekends!
ggplot(merged_analysis, aes(x = total_steps, y = total_distance)) +
  geom_point(aes(color = activity_level), alpha = 0.6) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Relationship Between Steps and Distance",
       x = "Total Steps",
       y = "Total Distance (miles/km)",
       color = "Activity Level") +
  scale_color_manual(
    values = c("low" = "#d62828",   
               "medium" = "#fcbf49",
               "high" = "#003049"),
    name = "Activity Level"        
  )
## `geom_smooth()` using formula = 'y ~ x'

Step to distance ratio is consistent across users!

We can see that throughout the whole time data was collected the activity remained the same.

SHARE

Final conclusion and next steps

It would be great to capture data for the entire year to make sure this trend is not seasonal, if there are direct influences of weather, holiday seasoning, hormonal cycle or something else that might affect the results.

For the activity density by day of the week:

  • Focus on reducing sedentary time rather than just increasing vigorous activity
  • Campaign for Active Monday
  • Create involving challenges to assure less sedentary time over the week

For the natural break point in low activity level users:

  • Investigate what could be causing the break point, if its possible to identify the user that will fall into the low activity level
  • Consider implement a feature that will motivate the user to maintain their activity strike

For the sleep data having less participants compared to the other datasets:

  • Investigate if the sleep tracking feature is difficult to use
  • Check if users need education about sleep tracking benefits
  • Consider automatic sleep detection

Here I end my case study! Thank you for reading, feedback wil be greatly appreciated!