Bellabeat Case Study

Introduction

This case study analyzes Bellabeat, a wellness tech company that designs smart products for women. The Bellabeat app tracks activity, sleep, stress, and mindfulness to help users make healthier decisions. Bellabeat’s marketing team manages campaigns across digital channels to continuously improve their strategy. As a junior data analyst on the marketing analytics team, I was asked to focus on the Bellabeat app and analyze smart-device data to understand how consumers use these product.

ASK

Business task

The goal of the project is to understand trends in the use of smart devices produced by Bellabeat in order to provide recommendations to the marketing team to help decide where to focus efforts to grow the company.

Key questions

What usage trends emerge from FitBit smart device data?
What marketing strategies can Bellabeat adopt based on insights?

Key Stakeholders

Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer.
Sando Mur: Mathematician and Bellabeat’s co-founder. Bellabeat marketing analytics team.

PREPARE

Data source

The data is stored on Kaggle: FitBit Fitness Tracker Data. The dataset is public and openly accessible, provided by Fitbit Fitness Tracker, a well-known brand.

Data stored

C:/Users/User/OneDrive/Desktop/Data analyst/Portfolio/A. Wellness technology/data_wellness/Fitabase Data

Data selection

The dataset contains 18 .csv files, but I selected only two:
- dailyActivity_merged
- sleepDay_merged

Both files are in long format, where each ID has tracked data in multiple rows sorted by date, including:
- Physical activity: steps, distance
- Energy: calories burned
- Sedentary behavior in minutes
- Sleep data

Credibility of the data

I evaluated the dataset using the ROCCC criteria.

Reliable: medium. The data comes from Fitbit, a trusted source. However, users may forget to wear their device, causing incomplete records.
Original: low. The dataset was collected by a third party, not directly by Bellabeat.
Comprehensive: medium. Includes personal health metrics sufficient to answer business questions, but only from 30 users, limiting representativeness.
Current: low. Data was collected in 2016, making it outdated for current trend analysis.
Cited: medium. Well documented on Kaggle, not in peer-reviewed.

The dataset includes only 30 users and covers 31 days, without demographic details such as age, gender, or device model, which may bias results toward more health-conscious participants. Overall, the data is reliable but limited in representativeness and time span.

PROCESS

Load needed R packages

library(tidyverse) # includes dplyr, tidyr, ggplot2
library(lubridate) # for working with date formats
library(reshape2) # for melt() to create correlation heatmaps

Load data files

da = Daily activity data
ds = Sleep day data

da <- read.csv("C:/Users/User/OneDrive/Desktop/Data analyst/Portfolio/A_Wellness technology/data_wellness/Fitabase Data/dailyActivity_merged - copia.csv")
ds <- read.csv("C:/Users/User/OneDrive/Desktop/Data analyst/Portfolio/A_Wellness technology/data_wellness/Fitabase Data/sleepDay_merged - copia.csv")

Check and explore, clean data

Daily activity data

str(da)

## 'data.frame':    457 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "3/25/2016" "3/26/2016" "3/27/2016" "3/28/2016" ...
##  $ TotalSteps              : int  11004 17609 12736 13231 12041 10970 12256 12262 11248 10016 ...
##  $ TotalDistance           : num  7.11 11.55 8.53 8.93 7.85 ...
##  $ TrackerDistance         : num  7.11 11.55 8.53 8.93 7.85 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  2.57 6.92 4.66 3.19 2.16 ...
##  $ ModeratelyActiveDistance: num  0.46 0.73 0.16 0.79 1.09 ...
##  $ LightActiveDistance     : num  4.07 3.91 3.71 4.95 4.61 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  33 89 56 39 28 30 33 47 40 15 ...
##  $ FairlyActiveMinutes     : int  12 17 5 20 28 13 12 21 11 30 ...
##  $ LightlyActiveMinutes    : int  205 274 268 224 243 223 239 200 244 314 ...
##  $ SedentaryMinutes        : int  804 588 605 1080 763 1174 820 866 636 655 ...
##  $ Calories                : int  1819 2154 1944 1932 1886 1820 1889 1868 1843 1850 ...

Sleep data

str(ds)

## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

Clean datasets: first, check formats and add useful derived metrics

# The column ActivityDate of Activity data is renamed as Date.
da <- da %>%     
  rename(Date = ActivityDate)

# The column Date of Sleep data contains also the time. I separate Date and time
ds <- ds %>%
  separate(SleepDay, into = c("Date", "Time"), sep= " ")

# Fix the Date format in both datasets
da$Date <- as.Date(da$Date, format="%m/%d/%Y")
ds$Date <- as.Date(ds$Date, format="%m/%d/%Y")

# Add day of week for analysis
Sys.setlocale("LC_TIME", "C")

## [1] "C"

da$Weekday <- weekdays(da$Date)
ds$Weekday <- weekdays(ds$Date)

# Fix Weekday format and order it
da$Weekday <- factor(da$Weekday,
                     levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
                                "Friday", "Saturday", "Sunday"),
                     ordered = TRUE)

ds$Weekday <- factor(ds$Weekday,
                     levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
                                "Friday", "Saturday", "Sunday"),
                     ordered = TRUE)

Clean datasets: check for errors

# Check for duplicates
sum(duplicated(da))

## [1] 0

sum(duplicated(ds))

## [1] 3

# Remove Duplicates
ds <- ds %>%
  distinct()

# Check for N/As
colSums(is.na(da))

##                       Id                     Date               TotalSteps 
##                        0                        0                        0 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                        0                        0                        0 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                        0                        0                        0 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                        0                        0                        0 
##                  Weekday 
##                        0

colSums(is.na(ds))

##                 Id               Date               Time  TotalSleepRecords 
##                  0                  0                  0                  0 
## TotalMinutesAsleep     TotalTimeInBed            Weekday 
##                  0                  0                  0

# Remove trim spaces
da$Id <- trimws(da$Id)
ds$Id <- trimws(ds$Id)

I want to merge the two datasets.
There are multiple observations for each ID, so I check how many IDs are in each dataset.

n_distinct(da$Id)

## [1] 35

n_distinct(ds$Id)

## [1] 24

Only 24 participants contribute sleep data, while 35 contribute activity data. How many IDs are common between the datasets?

common_ids <- intersect(da$Id, ds$Id)
length(common_ids)

## [1] 24

How many Dates are common between the datasets?

common_dates <- intersect(da$Date, ds$Date)
length(common_dates)

## [1] 1

ANALYZE

I want to understand whether there are temporal (daily or weekly) trends, recurring over time, and user-level trends.

User-level trends: transform the data

Since I have 24 IDs in common but only one date shared between the two datasets, it means the dates don’t match for the same IDs. Therefore, I aggregate the data only by ID. This is consistent with the analysis goals, as we want to understand users’ habits, indipendently of the date.
First, I calculate the averages per ID, then I merge the two datasets, so I’ll have only one row per ID.

mean_da <- da %>%
  group_by(Id) %>% 
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))  
mean_da

## # A tibble: 35 × 14
##    Id         TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance
##    <chr>           <dbl>         <dbl>           <dbl>                    <dbl>
##  1 1503960366     11641.          7.61            7.61                        0
##  2 1624580081      4226.          2.75            2.75                        0
##  3 1644430081      9275.          6.75            6.75                        0
##  4 1844505072      3641.          2.41            2.41                        0
##  5 1927972279      2181.          1.51            1.51                        0
##  6 2022484408     12175.          8.77            8.77                        0
##  7 2026352035      3393.          2.10            2.10                        0
##  8 2320127002      3138.          2.12            2.12                        0
##  9 2347167796      9800.          6.51            6.51                        0
## 10 2873212765      6637.          4.47            4.47                        0
## # ℹ 25 more rows
## # ℹ 9 more variables: VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

mean_ds <- ds %>%
  group_by(Id) %>%
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
mean_ds

## # A tibble: 24 × 4
##    Id         TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##    <chr>                  <dbl>              <dbl>          <dbl>
##  1 1503960366              1.08               360.           383.
##  2 1644430081              1                  294            346 
##  3 1844505072              1                  652            961 
##  4 1927972279              1.6                417            438.
##  5 2026352035              1                  506.           538.
##  6 2320127002              1                   61             69 
##  7 2347167796              1                  447.           491.
##  8 3977333714              1.14               294.           461.
##  9 4020332650              1                  349.           380.
## 10 4319703577              1.04               477.           502.
## # ℹ 14 more rows

mean_da_ds <- inner_join(mean_da, mean_ds, by = "Id")
mean_da_ds

## # A tibble: 24 × 17
##    Id         TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance
##    <chr>           <dbl>         <dbl>           <dbl>                    <dbl>
##  1 1503960366     11641.          7.61            7.61                        0
##  2 1644430081      9275.          6.75            6.75                        0
##  3 1844505072      3641.          2.41            2.41                        0
##  4 1927972279      2181.          1.51            1.51                        0
##  5 2026352035      3393.          2.10            2.10                        0
##  6 2320127002      3138.          2.12            2.12                        0
##  7 2347167796      9800.          6.51            6.51                        0
##  8 3977333714      8664.          5.81            5.81                        0
##  9 4020332650      5777.          4.14            4.14                        0
## 10 4319703577      7821.          5.26            5.26                        0
## # ℹ 14 more rows
## # ℹ 12 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>, TotalSleepRecords <dbl>,
## #   TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>

colnames(mean_da_ds)

##  [1] "Id"                       "TotalSteps"              
##  [3] "TotalDistance"            "TrackerDistance"         
##  [5] "LoggedActivitiesDistance" "VeryActiveDistance"      
##  [7] "ModeratelyActiveDistance" "LightActiveDistance"     
##  [9] "SedentaryActiveDistance"  "VeryActiveMinutes"       
## [11] "FairlyActiveMinutes"      "LightlyActiveMinutes"    
## [13] "SedentaryMinutes"         "Calories"                
## [15] "TotalSleepRecords"        "TotalMinutesAsleep"      
## [17] "TotalTimeInBed"

dim(mean_da_ds)

## [1] 24 17

User-level trends: Activity VS Sleep

Select only relevant columns for correlation

merged_data_selected <- mean_da_ds %>%
  select(Id, TotalSteps, Calories, TotalMinutesAsleep, TotalTimeInBed)
corr_matrix <- cor(merged_data_selected %>% 
                     select(-Id,))

Temporal-trends: Activity per day

da %>% 
  select(TotalSteps, Calories) %>% 
  summary()

##    TotalSteps       Calories   
##  Min.   :    0   Min.   :   0  
##  1st Qu.: 1988   1st Qu.:1776  
##  Median : 5986   Median :2062  
##  Mean   : 6547   Mean   :2189  
##  3rd Qu.:10198   3rd Qu.:2667  
##  Max.   :28497   Max.   :4562

The participants averaged ~6,500 steps/day, which is less than the recommended 10,000 steps by Centers for Disease Control and Prevention (CDC). Since taking fewer than 5000 steps per day is considered sedentary, the participants in the dataset can be considered slightly more active than most of the population. The mean calories burned per day is 2,189.

da %>% 
  select(SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes) %>% 
  summary()

##  SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes
##  Min.   :  32.0   Min.   :  0.0        Min.   :  0.00      Min.   :  0.00   
##  1st Qu.: 728.0   1st Qu.: 64.0        1st Qu.:  0.00      1st Qu.:  0.00   
##  Median :1057.0   Median :181.0        Median :  1.00      Median :  0.00   
##  Mean   : 995.3   Mean   :170.1        Mean   : 13.07      Mean   : 16.62   
##  3rd Qu.:1285.0   3rd Qu.:257.0        3rd Qu.: 16.00      3rd Qu.: 25.00   
##  Max.   :1440.0   Max.   :720.0        Max.   :660.00      Max.   :202.00

Mean sedentary minutes is 995.3 minutes, aprox 16 hours per day.

Temporal-trends: Sleep per day

names(ds)

## [1] "Id"                 "Date"               "Time"              
## [4] "TotalSleepRecords"  "TotalMinutesAsleep" "TotalTimeInBed"    
## [7] "Weekday"

Having the minutes spent in bed and the minutes asleep, I can calculate the percentage of time participants are asleep out of the time spent in bed.

sleep <- ds %>% 
  mutate(PercentSleep = (TotalMinutesAsleep/TotalTimeInBed)*100)
head(sleep)

##           Id       Date     Time TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 2016-04-12 12:00:00                 1                327
## 2 1503960366 2016-04-13 12:00:00                 2                384
## 3 1503960366 2016-04-15 12:00:00                 1                412
## 4 1503960366 2016-04-16 12:00:00                 2                340
## 5 1503960366 2016-04-17 12:00:00                 1                700
## 6 1503960366 2016-04-19 12:00:00                 1                304
##   TotalTimeInBed   Weekday PercentSleep
## 1            346   Tuesday     94.50867
## 2            407 Wednesday     94.34889
## 3            442    Friday     93.21267
## 4            367  Saturday     92.64305
## 5            712    Sunday     98.31461
## 6            320   Tuesday     95.00000

sleep %>% 
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed, PercentSleep) %>% 
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed   PercentSleep   
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0   Min.   : 49.84  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8   1st Qu.: 91.18  
##  Median :1.00      Median :432.5      Median :463.0   Median : 94.26  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5   Mean   : 91.65  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0   3rd Qu.: 96.06  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0   Max.   :100.00

Users sleep about 419.2 minutes or 7 hours per night, which is considered healthy, as adults are recommended to sleep 7 or more hours per night. They spend 91% of their time in bed asleep.

Temporal-trends: Activity per Weekday

Day of the week when users are most active

steps_by_weekday <- da %>% 
     group_by(Weekday) %>%
     summarise(mean_steps = mean(TotalSteps, na.rm = TRUE)) %>%
     arrange(desc(mean_steps))

User-level trends: Activity VS Sleep

# Correlation matrix considering the mean for each user (mean_da_ds
melted_corr <- melt(corr_matrix) 
ggplot(data = melted_corr, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() + 
  geom_text(aes(label=round(value, 2)), color="black") + 
  scale_fill_gradient2(low="violetred3", mid="white", high="springgreen1", midpoint=0) + 
  labs(title="Correlation heatmap", x="", y="") + 
  theme_minimal()

TotalSteps & Calories (0.35): people who take more steps tend to burn more calories.
TotalSteps & TotalMinutesAsleep (–0.17): more active people might sleep a little less.
Calories & TotalMinutesAsleep (–0.06): almost no connection between calories burned and sleep time.
TotalMinutesAsleep & TotalTimeInBed (0.94): Very strong relationship. People who sleep more also spend more time in bed.

We can see these correlation in the following plots:

ggplot(mean_da_ds, aes(x = TotalSteps, y = Calories)) + 
  geom_point(alpha = 0.6) + 
  geom_smooth() + 
  labs(title="Average Total Steps vs Calories per User",
       x = "Average Daily Steps",
       y = "Average Daily Calories")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Users who walk more burn more calories.

ggplot(mean_da_ds, aes(x = TotalSteps, y = TotalMinutesAsleep)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title="Average Total Steps vs Sleep per User",
       x = "Average Daily Steps",
       y = "Average Minutes Asleep")

## `geom_smooth()` using formula = 'y ~ x'

Users with higher step counts logged slightly less sleep duration, but the negative correlation is very weak.

ggplot(data=ds, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title="Total Minutes Asleep vs. Total Time in Bed",
       x = "Total Minutes Asleep",
       y = "Total Time in Bed")

## `geom_smooth()` using formula = 'y ~ x'

There is a linear relationship between Total Minutes Asleep and Total Time in Bed.

Temporal-trend: Activity per Weekday

steps_by_weekday

## # A tibble: 7 × 2
##   Weekday   mean_steps
##   <ord>          <dbl>
## 1 Wednesday      7511.
## 2 Monday         7119.
## 3 Saturday       7090.
## 4 Thursday       6847.
## 5 Friday         6738.
## 6 Sunday         6058.
## 7 Tuesday        4915.

ggplot(steps_by_weekday, aes(x = Weekday, y = mean_steps)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average steps per weekday",
       x = "Weekday",
       y = "Average steps") +
  theme_minimal()

Activity peaks on Wednesdays, followed by Mondays and Saturdays.

ACT

Activity: users average 6,500 steps/day, below the recommended 10,000 goal. Activity peaks on Wednesdays, with similar levels on Mondays and Saturdays, suggesting that motivation may rise mid-week and drop toward the end. → Notifications could reinforce movement consistency throughout the week, especially on low-activity days (Thursday–Friday).
Sedentary behavior: users spend ~16 h/day inactive → I recommend to conduct further analysis by hour of the day to identify when inactivity peaks and send targeted reminders (“time to move!”) during those hours.
Sleep: users sleep ~7 h/night, with a 91% efficiency → add “go to sleep” reminders and suggest optimal bedtime.
Behavior patterns: higher activity on weekends → launch weekend challenges to keep engagement stable.
Positioning: present Bellabeat as a wellness coach, not just a fitness app → use personalized insights and data-driven nudges to motivate healthy habits.

Thanks for reading! I’ve worked with R during my PhD research, but this is my first project shared publicly as part of my portfolio. I’d really appreciate any comments or advice to help me improve.

Bellabeat Case Study

Elena S.

2025-11-05

Introduction

ASK

Business task

Key questions

Key Stakeholders

PREPARE

Data source

Data stored

Data selection

Credibility of the data

PROCESS

Load needed R packages

Load data files

Check and explore, clean data

ANALYZE

User-level trends: transform the data

User-level trends: Activity VS Sleep

Temporal-trends: Activity per day

Temporal-trends: Sleep per day

Temporal-trends: Activity per Weekday

ACT

Bellabeat Case Study

Elena S.

2025-11-05

Introduction

ASK

Business task

Key questions

Key Stakeholders

PREPARE

Data source

Data stored

Data selection

Credibility of the data

PROCESS

Load needed R packages

Load data files

Check and explore, clean data

ANALYZE

User-level trends: transform the data

User-level trends: Activity VS Sleep

Temporal-trends: Activity per day

Temporal-trends: Sleep per day

Temporal-trends: Activity per Weekday

SHARE

User-level trends: Activity VS Sleep

Temporal-trend: Activity per Weekday

ACT