This case study analyzes Bellabeat, a wellness tech company that
designs smart products for women. The Bellabeat app tracks activity,
sleep, stress, and mindfulness to help users make healthier decisions.
Bellabeat’s marketing team manages campaigns across digital channels to
continuously improve their strategy. As a junior data analyst on the
marketing analytics team, I was asked to focus on the Bellabeat app and
analyze smart-device data to understand how consumers use these
product.
The goal of the project is to understand trends in the use of smart devices produced by Bellabeat in order to provide recommendations to the marketing team to help decide where to focus efforts to grow the company.
The data is stored on Kaggle: FitBit Fitness Tracker Data. The dataset is public and openly accessible, provided by Fitbit Fitness Tracker, a well-known brand.
C:/Users/User/OneDrive/Desktop/Data analyst/Portfolio/A. Wellness technology/data_wellness/Fitabase Data
The dataset contains 18 .csv files, but I selected only two:
- dailyActivity_merged
- sleepDay_merged
Both files are in long format, where each ID has tracked data in
multiple rows sorted by date, including:
- Physical activity: steps, distance
- Energy: calories burned
- Sedentary behavior in minutes
- Sleep data
I evaluated the dataset using the ROCCC criteria.
The dataset includes only 30 users and covers 31 days, without
demographic details such as age, gender, or device model, which may bias
results toward more health-conscious participants. Overall, the data is
reliable but limited in representativeness and time span.
library(tidyverse) # includes dplyr, tidyr, ggplot2
library(lubridate) # for working with date formats
library(reshape2) # for melt() to create correlation heatmaps
da = Daily activity data
ds = Sleep day data
da <- read.csv("C:/Users/User/OneDrive/Desktop/Data analyst/Portfolio/A_Wellness technology/data_wellness/Fitabase Data/dailyActivity_merged - copia.csv")
ds <- read.csv("C:/Users/User/OneDrive/Desktop/Data analyst/Portfolio/A_Wellness technology/data_wellness/Fitabase Data/sleepDay_merged - copia.csv")
Daily activity data
str(da)
## 'data.frame': 457 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "3/25/2016" "3/26/2016" "3/27/2016" "3/28/2016" ...
## $ TotalSteps : int 11004 17609 12736 13231 12041 10970 12256 12262 11248 10016 ...
## $ TotalDistance : num 7.11 11.55 8.53 8.93 7.85 ...
## $ TrackerDistance : num 7.11 11.55 8.53 8.93 7.85 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 2.57 6.92 4.66 3.19 2.16 ...
## $ ModeratelyActiveDistance: num 0.46 0.73 0.16 0.79 1.09 ...
## $ LightActiveDistance : num 4.07 3.91 3.71 4.95 4.61 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 33 89 56 39 28 30 33 47 40 15 ...
## $ FairlyActiveMinutes : int 12 17 5 20 28 13 12 21 11 30 ...
## $ LightlyActiveMinutes : int 205 274 268 224 243 223 239 200 244 314 ...
## $ SedentaryMinutes : int 804 588 605 1080 763 1174 820 866 636 655 ...
## $ Calories : int 1819 2154 1944 1932 1886 1820 1889 1868 1843 1850 ...
Sleep data
str(ds)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
Clean datasets: first, check formats and add useful derived metrics
# The column ActivityDate of Activity data is renamed as Date.
da <- da %>%
rename(Date = ActivityDate)
# The column Date of Sleep data contains also the time. I separate Date and time
ds <- ds %>%
separate(SleepDay, into = c("Date", "Time"), sep= " ")
# Fix the Date format in both datasets
da$Date <- as.Date(da$Date, format="%m/%d/%Y")
ds$Date <- as.Date(ds$Date, format="%m/%d/%Y")
# Add day of week for analysis
Sys.setlocale("LC_TIME", "C")
## [1] "C"
da$Weekday <- weekdays(da$Date)
ds$Weekday <- weekdays(ds$Date)
# Fix Weekday format and order it
da$Weekday <- factor(da$Weekday,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"),
ordered = TRUE)
ds$Weekday <- factor(ds$Weekday,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"),
ordered = TRUE)
Clean datasets: check for errors
# Check for duplicates
sum(duplicated(da))
## [1] 0
sum(duplicated(ds))
## [1] 3
# Remove Duplicates
ds <- ds %>%
distinct()
# Check for N/As
colSums(is.na(da))
## Id Date TotalSteps
## 0 0 0
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 0 0 0
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 0 0 0
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 0 0 0
## LightlyActiveMinutes SedentaryMinutes Calories
## 0 0 0
## Weekday
## 0
colSums(is.na(ds))
## Id Date Time TotalSleepRecords
## 0 0 0 0
## TotalMinutesAsleep TotalTimeInBed Weekday
## 0 0 0
# Remove trim spaces
da$Id <- trimws(da$Id)
ds$Id <- trimws(ds$Id)
I want to merge the two datasets.
There are multiple observations for each ID, so I check how many IDs are
in each dataset.
n_distinct(da$Id)
## [1] 35
n_distinct(ds$Id)
## [1] 24
Only 24 participants contribute sleep data, while 35 contribute activity data. How many IDs are common between the datasets?
common_ids <- intersect(da$Id, ds$Id)
length(common_ids)
## [1] 24
How many Dates are common between the datasets?
common_dates <- intersect(da$Date, ds$Date)
length(common_dates)
## [1] 1
I want to understand whether there are temporal (daily or weekly) trends, recurring over time, and user-level trends.
Since I have 24 IDs in common but only one date shared between the
two datasets, it means the dates don’t match for the same IDs.
Therefore, I aggregate the data only by ID. This is consistent with the
analysis goals, as we want to understand users’ habits, indipendently of
the date.
First, I calculate the averages per ID, then I merge the two datasets,
so I’ll have only one row per ID.
mean_da <- da %>%
group_by(Id) %>%
summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
mean_da
## # A tibble: 35 × 14
## Id TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 11641. 7.61 7.61 0
## 2 1624580081 4226. 2.75 2.75 0
## 3 1644430081 9275. 6.75 6.75 0
## 4 1844505072 3641. 2.41 2.41 0
## 5 1927972279 2181. 1.51 1.51 0
## 6 2022484408 12175. 8.77 8.77 0
## 7 2026352035 3393. 2.10 2.10 0
## 8 2320127002 3138. 2.12 2.12 0
## 9 2347167796 9800. 6.51 6.51 0
## 10 2873212765 6637. 4.47 4.47 0
## # ℹ 25 more rows
## # ℹ 9 more variables: VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
mean_ds <- ds %>%
group_by(Id) %>%
summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
mean_ds
## # A tibble: 24 × 4
## Id TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <chr> <dbl> <dbl> <dbl>
## 1 1503960366 1.08 360. 383.
## 2 1644430081 1 294 346
## 3 1844505072 1 652 961
## 4 1927972279 1.6 417 438.
## 5 2026352035 1 506. 538.
## 6 2320127002 1 61 69
## 7 2347167796 1 447. 491.
## 8 3977333714 1.14 294. 461.
## 9 4020332650 1 349. 380.
## 10 4319703577 1.04 477. 502.
## # ℹ 14 more rows
mean_da_ds <- inner_join(mean_da, mean_ds, by = "Id")
mean_da_ds
## # A tibble: 24 × 17
## Id TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 11641. 7.61 7.61 0
## 2 1644430081 9275. 6.75 6.75 0
## 3 1844505072 3641. 2.41 2.41 0
## 4 1927972279 2181. 1.51 1.51 0
## 5 2026352035 3393. 2.10 2.10 0
## 6 2320127002 3138. 2.12 2.12 0
## 7 2347167796 9800. 6.51 6.51 0
## 8 3977333714 8664. 5.81 5.81 0
## 9 4020332650 5777. 4.14 4.14 0
## 10 4319703577 7821. 5.26 5.26 0
## # ℹ 14 more rows
## # ℹ 12 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>, TotalSleepRecords <dbl>,
## # TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>
colnames(mean_da_ds)
## [1] "Id" "TotalSteps"
## [3] "TotalDistance" "TrackerDistance"
## [5] "LoggedActivitiesDistance" "VeryActiveDistance"
## [7] "ModeratelyActiveDistance" "LightActiveDistance"
## [9] "SedentaryActiveDistance" "VeryActiveMinutes"
## [11] "FairlyActiveMinutes" "LightlyActiveMinutes"
## [13] "SedentaryMinutes" "Calories"
## [15] "TotalSleepRecords" "TotalMinutesAsleep"
## [17] "TotalTimeInBed"
dim(mean_da_ds)
## [1] 24 17
Select only relevant columns for correlation
merged_data_selected <- mean_da_ds %>%
select(Id, TotalSteps, Calories, TotalMinutesAsleep, TotalTimeInBed)
corr_matrix <- cor(merged_data_selected %>%
select(-Id,))
da %>%
select(TotalSteps, Calories) %>%
summary()
## TotalSteps Calories
## Min. : 0 Min. : 0
## 1st Qu.: 1988 1st Qu.:1776
## Median : 5986 Median :2062
## Mean : 6547 Mean :2189
## 3rd Qu.:10198 3rd Qu.:2667
## Max. :28497 Max. :4562
The participants averaged ~6,500 steps/day, which is less than the recommended 10,000 steps by Centers for Disease Control and Prevention (CDC). Since taking fewer than 5000 steps per day is considered sedentary, the participants in the dataset can be considered slightly more active than most of the population. The mean calories burned per day is 2,189.
da %>%
select(SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes) %>%
summary()
## SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes
## Min. : 32.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 728.0 1st Qu.: 64.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median :1057.0 Median :181.0 Median : 1.00 Median : 0.00
## Mean : 995.3 Mean :170.1 Mean : 13.07 Mean : 16.62
## 3rd Qu.:1285.0 3rd Qu.:257.0 3rd Qu.: 16.00 3rd Qu.: 25.00
## Max. :1440.0 Max. :720.0 Max. :660.00 Max. :202.00
Mean sedentary minutes is 995.3 minutes, aprox 16 hours per day.
names(ds)
## [1] "Id" "Date" "Time"
## [4] "TotalSleepRecords" "TotalMinutesAsleep" "TotalTimeInBed"
## [7] "Weekday"
Having the minutes spent in bed and the minutes asleep, I can calculate the percentage of time participants are asleep out of the time spent in bed.
sleep <- ds %>%
mutate(PercentSleep = (TotalMinutesAsleep/TotalTimeInBed)*100)
head(sleep)
## Id Date Time TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 2016-04-12 12:00:00 1 327
## 2 1503960366 2016-04-13 12:00:00 2 384
## 3 1503960366 2016-04-15 12:00:00 1 412
## 4 1503960366 2016-04-16 12:00:00 2 340
## 5 1503960366 2016-04-17 12:00:00 1 700
## 6 1503960366 2016-04-19 12:00:00 1 304
## TotalTimeInBed Weekday PercentSleep
## 1 346 Tuesday 94.50867
## 2 407 Wednesday 94.34889
## 3 442 Friday 93.21267
## 4 367 Saturday 92.64305
## 5 712 Sunday 98.31461
## 6 320 Tuesday 95.00000
sleep %>%
select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed, PercentSleep) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed PercentSleep
## Min. :1.00 Min. : 58.0 Min. : 61.0 Min. : 49.84
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8 1st Qu.: 91.18
## Median :1.00 Median :432.5 Median :463.0 Median : 94.26
## Mean :1.12 Mean :419.2 Mean :458.5 Mean : 91.65
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0 3rd Qu.: 96.06
## Max. :3.00 Max. :796.0 Max. :961.0 Max. :100.00
Users sleep about 419.2 minutes or 7 hours per night, which is considered healthy, as adults are recommended to sleep 7 or more hours per night. They spend 91% of their time in bed asleep.
Day of the week when users are most active
steps_by_weekday <- da %>%
group_by(Weekday) %>%
summarise(mean_steps = mean(TotalSteps, na.rm = TRUE)) %>%
arrange(desc(mean_steps))
Activity: users average 6,500 steps/day, below the recommended 10,000 goal. Activity peaks on Wednesdays, with similar levels on Mondays and Saturdays, suggesting that motivation may rise mid-week and drop toward the end. → Notifications could reinforce movement consistency throughout the week, especially on low-activity days (Thursday–Friday).
Sedentary behavior: users spend ~16 h/day inactive → I recommend to conduct further analysis by hour of the day to identify when inactivity peaks and send targeted reminders (“time to move!”) during those hours.
Sleep: users sleep ~7 h/night, with a 91% efficiency → add “go to
sleep” reminders and suggest optimal bedtime.
Behavior patterns: higher activity on weekends → launch weekend challenges to keep engagement stable.
Positioning: present Bellabeat as a wellness coach, not just a fitness app → use personalized insights and data-driven nudges to motivate healthy habits.
Thanks for reading! I’ve worked with R during my PhD research, but this is my first project shared publicly as part of my portfolio. I’d really appreciate any comments or advice to help me improve.