Case Study: How can a wellness company play it smart?

Intro

This is a Case Study on figuring out “How can a wellness company play it smart?”.

The company

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused sma products. Sršen used her background as an aist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

ASK - Business challenge

How can Bellabeat find more opportunities to grow by analyzing smart device usage data to understand how to improve their health and wellness.

The key stakeholders include:

Urška Sršen: Chief Creative Officer and Bellabeat’s Co-founder.
Sando Mur: Mathematician and Bellabeat’s Co-founder.
Bellabeat’s marketing analytics team: a team of data analysts.

Data Sources

FitBit Fitness Tracker Data: Kaggle Dataset by Arash Nic (CC0 Public Domain). Accessed May 2025.

PREPARE

The data is long format with ID and DATE variables making it easy for merging multiple datasets to create a wide format data .
The data fall into the Reliable, Original, Comprehensible, Cited but it’s not Current not passing at the ROCCC checklist. The data is from 2016 and not uptodate with todays technology advances (2025).
The datasets were verified to be public domain. I checked sources making sure I got the raw original data. The fields from the dataset has relevant data that helps understand the usage of smart devices and its relevance.
And finally, there was no problems detected on the data.

PROCESS

I used Google Sheets for an overall view of the data, I find it faster to check column names and gross possible problems. Then used R for cleaning, preparing and visualization.
Ensured your datas integrity by making copies when making any manipulation and double checking number of columns and rows whenever modifying it.
Checked for unique IDs, Na values in key columns, duplicates and transformed date format to make it uniform, converted the data to lowercase with underscores and removed special characters.
Verified that the data is clean and ready to analyze by checking data’s summanry with commands like: head(dataframe), str(dataframe) and glimpse(dataframe).
The cleaning process is also documented as comments as I code in order to review and share my results.

For this case Im working with the following dataframes:

dailyActivity_merged.csv
dailyCalories_merged.csv
dailyIntensities_merged.csv
dailySteps_merged.csv
sleepDay_merged.csv

Using sheets, I removed duplicates. sleepDay_merged.csv had 3 duplicates. Also renamed the columns names and checked for whitespaces to facilitate further coding with R.

Loading relevant libraries.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)    # For data manipulation and visualization
library(janitor)      # For data cleaning functions
library(dplyr)        # For data manipulation functions
library(lubridate)    #For parse dates
library(ggplot2)      #For viz
library(tidyr)        #For cleaning and preparing data

Created copies of the raw data, unified the naming system, cleaning names and loaded into R

daily_activity <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailyActivity_merged.csv") %>% clean_names()
daily_calories <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailyCalories_merged.csv") %>% clean_names()
daily_intensitie <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailyIntensities_merged.csv") %>% clean_names()
daily_steps <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailySteps_merged.csv") %>% clean_names()
daily_sleep <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_sleepDay_merged.csv") %>% clean_names()

ANALYZE

Checking unique participants for each dataset

n_distinct(daily_activity$id)

## [1] 33

n_distinct(daily_calories$id)

## [1] 33

n_distinct(daily_intensitie$id)

## [1] 33

n_distinct(daily_steps$id)

## [1] 33

n_distinct(daily_sleep$id)

## [1] 24

Important insight

Not every participant responded daily sleep.The difference in participant counts between your sleep dataset (24 users) and other datasets (33 users) is significant and worth exploring. Which we’ll dive into possibilities why further into the Analyze phase. Let’s create an unique dataframe to work on, to easy the coding:

dataset_list <- list(
  daily_activity, 
  daily_calories, 
  daily_intensitie, 
  daily_sleep, 
  daily_steps
)

merged_data <- dataset_list %>%
  reduce(full_join, by = c("id", "activitydate"))

Let’s take a look at the merged result!

glimpse(merged_data)

## Rows: 940
## Columns: 26
## $ id                           <dbl> 1503960366, 1503960366, 1503960366, 15039…
## $ activitydate                 <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4…
## $ total_steps                  <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 1…
## $ total_distance               <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59,…
## $ tracker_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59,…
## $ logged_activities_distance   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ very_active_distance.x       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25,…
## $ moderately_active_distance.x <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64,…
## $ light_active_distance.x      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71,…
## $ sedentary_active_distance.x  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ very_active_minutes.x        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 6…
## $ fairly_active_minutes.x      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27…
## $ lightly_active_minutes.x     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 2…
## $ sedentary_minutes.x          <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775,…
## $ calories.x                   <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921,…
## $ calories.y                   <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921,…
## $ sedentary_minutes.y          <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775,…
## $ lightly_active_minutes.y     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 2…
## $ fairly_active_minutes.y      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27…
## $ very_active_minutes.y        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 6…
## $ sedentary_active_distance.y  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ light_active_distance.y      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71,…
## $ moderately_active_distance.y <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64,…
## $ very_active_distance.y       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25,…
## $ total_time_in_bed            <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377…
## $ step_total                   <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 1…

Now for cleaning, let’s check for NA values in key columns, clean all column names and convert data type.

sum(is.na(merged_data$id))

## [1] 0

sum(is.na(merged_data$activitydate))

## [1] 0

merged_clean <- merged_data %>%
  clean_names() %>% 
  rename_with(~str_remove_all(., "\\W+")) %>%  
  rename(
    id = matches("^id$|^i_d$|participant"),  
    activity_date = matches("activitydate|date|^day$") 
  )

merged_clean <- merged_clean %>%
  mutate(
    activity_date = parse_date_time(
      activity_date,
      orders = c("ymd", "mdy", "dmy", "Y-m-d", "m/d/Y")
    ) %>% as.Date()
  )

For the analysis, I’ll create relevant values, such as `day_of_week`, `is_weekend`, `week_of_year` and `activity_level`.

merged_analysis <- merged_clean %>%
  mutate(
    day_of_week = weekdays(activity_date),
    is_weekend = day_of_week %in% c("Saturday", "Sunday"),
    week_of_year = week(activity_date),
    
    activity_level = case_when(
      step_total > 10000 ~ "high",
      step_total > 5000 ~ "medium",
      TRUE ~ "low"
    ))

To secure the cleaning and preparation made so far, let’s save the work done into a new .csv file

write_csv(merged_analysis, "fitbit_data_merged.csv")

Now we can go back at why do we have less participants for `daily_sleep`.

merged_analysis %>%
  mutate(has_sleep_data = !is.na(total_time_in_bed)) %>%
  group_by(has_sleep_data) %>%
  summarise(avg_steps = mean(total_steps, na.rm = TRUE),
            avg_active_mins = mean(very_active_minutes_x, na.rm = TRUE))

## # A tibble: 2 × 3
##   has_sleep_data avg_steps avg_active_mins
##   <lgl>              <dbl>           <dbl>
## 1 FALSE              6959.            18.2
## 2 TRUE               8515.            25.0

We see that users without sleep data has less steps and active minutes. We can considering the following variables for that:

Some users may consistently remove their devices at night
Some trackers may not have sleep tracking capabilities
Tech issues with syncing problems specific to sleep data

Let’s follow up with some more visualization.

ggplot(merged_analysis, aes(x = total_steps, fill = activity_level)) +
  geom_density(alpha = 0.6) +
  labs(title = "Step Count Distribution by Activity Level",
       x = "Total Daily Steps",
       y = "Density",
       fill = "Activity Level") +
  scale_fill_manual(
    values = c("low" = "#d62828",       
               "medium" = "#fcbf49", 
               "high" = "#003049"),  
    name = "Activity Level"      
  ) +
  theme_minimal()

By checking step count distribution by activity level we can see a natural break point in low activity level users.

ggplot(heatmap_data, aes(x = day_of_week, y = intensity, fill = minutes)) +
  geom_tile(color = "#c7f9cc") +
  scale_fill_gradient(low = "#fdcc6d", high = "#e75414") +
  labs(title = "Average Activity Intensity by Day of Week",
       x = "",
       y = "Activity Intensity",
       fill = "Minutes") +
  theme_minimal() +
  geom_text(aes(label = round(minutes)), color = "black", size = 3)

By this heat map we can see users’ activity intensity by day of week.

It’s very distant the difference between sedentary time, it dominates all days.
The problem we can see is that users spent most of their time inactive with very little room for relevant intense activity.
Monday is the most inactive day of the week in counterpart, participants are less sedentary on thw weekends!

ggplot(merged_analysis, aes(x = total_steps, y = total_distance)) +
  geom_point(aes(color = activity_level), alpha = 0.6) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Relationship Between Steps and Distance",
       x = "Total Steps",
       y = "Total Distance (miles/km)",
       color = "Activity Level") +
  scale_color_manual(
    values = c("low" = "#d62828",   
               "medium" = "#fcbf49",
               "high" = "#003049"),
    name = "Activity Level"        
  )

## `geom_smooth()` using formula = 'y ~ x'

Case Study: How can a wellness company play it smart?

Andressa Silva

2025-05-02

Intro

The company

ASK - Business challenge

Data Sources

PREPARE

PROCESS

Loading relevant libraries.

Created copies of the raw data, unified the naming system, cleaning names and loaded into R

ANALYZE

Checking unique participants for each dataset

Important insight

Let’s take a look at the merged result!

Now for cleaning, let’s check for NA values in key columns, clean all column names and convert data type.

For the analysis, I’ll create relevant values, such as `day_of_week`, `is_weekend`, `week_of_year` and `activity_level`.

To secure the cleaning and preparation made so far, let’s save the work done into a new .csv file

Now we can go back at why do we have less participants for `daily_sleep`.

We see that users without sleep data has less steps and active minutes. We can considering the following variables for that:

Let’s follow up with some more visualization.

By checking step count distribution by activity level we can see a natural break point in low activity level users.

By this heat map we can see users’ activity intensity by day of week.

Step to distance ratio is consistent across users!

We can see that throughout the whole time data was collected the activity remained the same.

Case Study: How can a wellness company play it smart?

Andressa Silva

2025-05-02

Intro

The company

ASK - Business challenge

Data Sources

PREPARE

PROCESS

Loading relevant libraries.

Created copies of the raw data, unified the naming system, cleaning names and loaded into R

ANALYZE

Checking unique participants for each dataset

Important insight

Let’s take a look at the merged result!

Now for cleaning, let’s check for NA values in key columns, clean all column names and convert data type.

For the analysis, I’ll create relevant values, such as day_of_week, is_weekend, week_of_year and activity_level.

To secure the cleaning and preparation made so far, let’s save the work done into a new .csv file

Now we can go back at why do we have less participants for daily_sleep.

We see that users without sleep data has less steps and active minutes. We can considering the following variables for that:

Let’s follow up with some more visualization.

By checking step count distribution by activity level we can see a natural break point in low activity level users.

By this heat map we can see users’ activity intensity by day of week.

Step to distance ratio is consistent across users!

We can see that throughout the whole time data was collected the activity remained the same.

SHARE

Final conclusion and next steps

For the analysis, I’ll create relevant values, such as `day_of_week`, `is_weekend`, `week_of_year` and `activity_level`.

Now we can go back at why do we have less participants for `daily_sleep`.