Introduction

This case study analyzes Bellabeat, a wellness tech company that designs smart products for women. The Bellabeat app tracks activity, sleep, stress, and mindfulness to help users make healthier decisions. Bellabeat’s marketing team manages campaigns across digital channels to continuously improve their strategy. As a junior data analyst on the marketing analytics team, I was asked to focus on the Bellabeat app and analyze smart-device data to understand how consumers use these product.


ASK

Business task

The goal of the project is to understand trends in the use of smart devices produced by Bellabeat in order to provide recommendations to the marketing team to help decide where to focus efforts to grow the company.

Key questions

  • What usage trends emerge from FitBit smart device data?
  • What marketing strategies can Bellabeat adopt based on insights?

Key Stakeholders

  • Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer.
  • Sando Mur: Mathematician and Bellabeat’s co-founder. Bellabeat marketing analytics team.


PREPARE

Data source

The data is stored on Kaggle: FitBit Fitness Tracker Data. The dataset is public and openly accessible, provided by Fitbit Fitness Tracker, a well-known brand.

Data stored

C:/Users/User/OneDrive/Desktop/Data analyst/Portfolio/A. Wellness technology/data_wellness/Fitabase Data

Data selection

The dataset contains 18 .csv files, but I selected only two:
- dailyActivity_merged
- sleepDay_merged

Both files are in long format, where each ID has tracked data in multiple rows sorted by date, including:
- Physical activity: steps, distance
- Energy: calories burned
- Sedentary behavior in minutes
- Sleep data

Credibility of the data

I evaluated the dataset using the ROCCC criteria.

  • Reliable: medium. The data comes from Fitbit, a trusted source. However, users may forget to wear their device, causing incomplete records.
  • Original: low. The dataset was collected by a third party, not directly by Bellabeat.
  • Comprehensive: medium. Includes personal health metrics sufficient to answer business questions, but only from 30 users, limiting representativeness.
  • Current: low. Data was collected in 2016, making it outdated for current trend analysis.
  • Cited: medium. Well documented on Kaggle, not in peer-reviewed.

The dataset includes only 30 users and covers 31 days, without demographic details such as age, gender, or device model, which may bias results toward more health-conscious participants. Overall, the data is reliable but limited in representativeness and time span.


PROCESS

Load needed R packages

library(tidyverse) # includes dplyr, tidyr, ggplot2
library(lubridate) # for working with date formats
library(reshape2) # for melt() to create correlation heatmaps

Load data files

da = Daily activity data
ds = Sleep day data

da <- read.csv("C:/Users/User/OneDrive/Desktop/Data analyst/Portfolio/A_Wellness technology/data_wellness/Fitabase Data/dailyActivity_merged - copia.csv")
ds <- read.csv("C:/Users/User/OneDrive/Desktop/Data analyst/Portfolio/A_Wellness technology/data_wellness/Fitabase Data/sleepDay_merged - copia.csv")

Check and explore, clean data

Daily activity data

str(da)
## 'data.frame':    457 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "3/25/2016" "3/26/2016" "3/27/2016" "3/28/2016" ...
##  $ TotalSteps              : int  11004 17609 12736 13231 12041 10970 12256 12262 11248 10016 ...
##  $ TotalDistance           : num  7.11 11.55 8.53 8.93 7.85 ...
##  $ TrackerDistance         : num  7.11 11.55 8.53 8.93 7.85 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  2.57 6.92 4.66 3.19 2.16 ...
##  $ ModeratelyActiveDistance: num  0.46 0.73 0.16 0.79 1.09 ...
##  $ LightActiveDistance     : num  4.07 3.91 3.71 4.95 4.61 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  33 89 56 39 28 30 33 47 40 15 ...
##  $ FairlyActiveMinutes     : int  12 17 5 20 28 13 12 21 11 30 ...
##  $ LightlyActiveMinutes    : int  205 274 268 224 243 223 239 200 244 314 ...
##  $ SedentaryMinutes        : int  804 588 605 1080 763 1174 820 866 636 655 ...
##  $ Calories                : int  1819 2154 1944 1932 1886 1820 1889 1868 1843 1850 ...

Sleep data

str(ds)
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

Clean datasets: first, check formats and add useful derived metrics

# The column ActivityDate of Activity data is renamed as Date.
da <- da %>%     
  rename(Date = ActivityDate)

# The column Date of Sleep data contains also the time. I separate Date and time
ds <- ds %>%
  separate(SleepDay, into = c("Date", "Time"), sep= " ")

# Fix the Date format in both datasets
da$Date <- as.Date(da$Date, format="%m/%d/%Y")
ds$Date <- as.Date(ds$Date, format="%m/%d/%Y")

# Add day of week for analysis
Sys.setlocale("LC_TIME", "C")
## [1] "C"
da$Weekday <- weekdays(da$Date)
ds$Weekday <- weekdays(ds$Date)

# Fix Weekday format and order it
da$Weekday <- factor(da$Weekday,
                     levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
                                "Friday", "Saturday", "Sunday"),
                     ordered = TRUE)

ds$Weekday <- factor(ds$Weekday,
                     levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
                                "Friday", "Saturday", "Sunday"),
                     ordered = TRUE)

Clean datasets: check for errors

# Check for duplicates
sum(duplicated(da))
## [1] 0
sum(duplicated(ds))
## [1] 3
# Remove Duplicates
ds <- ds %>%
  distinct()

# Check for N/As
colSums(is.na(da))
##                       Id                     Date               TotalSteps 
##                        0                        0                        0 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                        0                        0                        0 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                        0                        0                        0 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                        0                        0                        0 
##                  Weekday 
##                        0
colSums(is.na(ds))
##                 Id               Date               Time  TotalSleepRecords 
##                  0                  0                  0                  0 
## TotalMinutesAsleep     TotalTimeInBed            Weekday 
##                  0                  0                  0
# Remove trim spaces
da$Id <- trimws(da$Id)
ds$Id <- trimws(ds$Id)

I want to merge the two datasets.
There are multiple observations for each ID, so I check how many IDs are in each dataset.

n_distinct(da$Id)
## [1] 35
n_distinct(ds$Id)
## [1] 24

Only 24 participants contribute sleep data, while 35 contribute activity data. How many IDs are common between the datasets?

common_ids <- intersect(da$Id, ds$Id)
length(common_ids) 
## [1] 24

How many Dates are common between the datasets?

common_dates <- intersect(da$Date, ds$Date)
length(common_dates)
## [1] 1



ANALYZE

I want to understand whether there are temporal (daily or weekly) trends, recurring over time, and user-level trends.

SHARE

Temporal-trend: Activity per Weekday

steps_by_weekday
## # A tibble: 7 × 2
##   Weekday   mean_steps
##   <ord>          <dbl>
## 1 Wednesday      7511.
## 2 Monday         7119.
## 3 Saturday       7090.
## 4 Thursday       6847.
## 5 Friday         6738.
## 6 Sunday         6058.
## 7 Tuesday        4915.
ggplot(steps_by_weekday, aes(x = Weekday, y = mean_steps)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average steps per weekday",
       x = "Weekday",
       y = "Average steps") +
  theme_minimal()

Activity peaks on Wednesdays, followed by Mondays and Saturdays.


ACT

Thanks for reading! I’ve worked with R during my PhD research, but this is my first project shared publicly as part of my portfolio. I’d really appreciate any comments or advice to help me improve.