Introduction and background

Bellabeat is a wellness-focused tech company that manufactures smart devices to help women track their daily habits. Their product suite includes wellness trackers, a mobile app, and a subscription-based coaching service. The company has been growing steadily and wants to leverage data to make smarter marketing decisions. By exploring data collected from Fitbit users, Bellabeat aims to understand how consumers use health tracking devices. The results will help shape Bellabeat’s marketing strategy and inform improvements to the Bellabeat app.

This case study is part of the final capstone project for the Google Data Analytics Professional Certificate. The project follows the six-step data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.

1. Business Task (ASK)

Bellabeat, a high-tech company, manufactures health-focused smart products for women. The marketing analytics team at Bellabeat wants to understand trends in smart device usage, specifically in terms of activity, sleep, weight, heart rate, and calorie data. By analyzing Fitbit user data, Bellabeat seeks insights to improve the user experience of its app and inform marketing strategies.

Objective: Identify patterns in user activity, sleep, and other health behaviors that Bellabeat can use to enhance its app’s features and boost user engagement.

2. Prepare the Data

Upload your CSV files to R

We are using public Fitbit datasets sourced from Kaggle. These datasets include daily and second-level data from 30+ users over a 30-day period. The goal is to load and inspect multiple datasets to determine their usability.

Uploaded 5 CSV files from the data source: https://www.kaggle.com/arashnic/fitbit

Setting up my enviroment

Install and load packages

Load the tidyverse package with the library() function

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load required packages

library(lubridate)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggplot2)
library(skimr)
library(readr)
install.packages("janitor")
## Warning: package 'janitor' is in use and will not be installed
library(janitor)
install.packages("skimr")
## Warning: package 'skimr' is in use and will not be installed
library(skimr)

Load csv files and create dataframes

activity <- read.csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/dailyActivity_merged.csv")
sleep <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sedentary <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean data

activity <- clean_names(activity)
sleep <- clean_names(sleep)
weight <- clean_names(weight)
heartrate <- clean_names(heartrate)
sedentary <- clean_names(sedentary)
activity$date <- as.Date(activity$activity_date, format="%m/%d/%Y")
sleep$date <- as.Date(sleep$sleep_day, format="%m/%d/%Y")
weight$date <- as.Date(weight$date)
heartrate$time <- ymd_hms(heartrate$time)
## Warning: All formats failed to parse. No formats found.
sedentary$date <- as.Date(sedentary$activity_day, format="%m/%d/%Y")
activity <- distinct(activity)
sleep <- distinct(sleep)
weight <- distinct(weight)
heartrate <- distinct(heartrate)
sedentary <- distinct(sedentary)

3. Process the Data

After loading the datasets, we assess data quality by checking for duplicates, missing values, and incorrect data types. We also perform data type conversion for dates and timestamps.

sum(is.na(activity))
## [1] 0
sum(is.na(sleep))
## [1] 0
sum(is.na(weight))
## [1] 103
sum(is.na(heartrate))
## [1] 1769
sum(is.na(sedentary))
## [1] 0
skim(activity)
Data summary
Name activity
Number of rows 940
Number of columns 16
_______________________
Column type frequency:
character 1
Date 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
activity_date 0 1 8 9 0 31 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2016-04-12 2016-05-12 2016-04-26 31

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09 ▇▅▃▅▅
total_steps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04 ▇▇▁▁▁
total_distance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01 ▇▆▁▁▁
tracker_distance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01 ▇▆▁▁▁
logged_activities_distance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00 ▇▁▁▁▁
very_active_distance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01 ▇▁▁▁▁
moderately_active_distance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00 ▇▁▁▁▁
light_active_distance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01 ▆▇▆▁▁
sedentary_active_distance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01 ▇▁▁▁▁
very_active_minutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02 ▇▁▁▁▁
fairly_active_minutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02 ▇▁▁▁▁
lightly_active_minutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02 ▅▇▇▃▁
sedentary_minutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03 ▁▁▇▅▇
calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03 ▁▆▇▃▁
skim(sleep)
Data summary
Name sleep
Number of rows 410
Number of columns 6
_______________________
Column type frequency:
character 1
Date 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sleep_day 0 1 20 21 0 31 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2016-04-12 2016-05-12 2016-04-27 31

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 4.994963e+09 2.060863e+09 1503960366 3.977334e+09 4702921684.0 6962181067 8792009665 ▆▆▇▅▃
total_sleep_records 0 1 1.120000e+00 3.500000e-01 1 1.000000e+00 1.0 1 3 ▇▁▁▁▁
total_minutes_asleep 0 1 4.191700e+02 1.186400e+02 58 3.610000e+02 432.5 490 796 ▁▂▇▃▁
total_time_in_bed 0 1 4.584800e+02 1.274600e+02 61 4.037500e+02 463.0 526 961 ▁▃▇▁▁
skim(weight)
Data summary
Name weight
Number of rows 67
Number of columns 8
_______________________
Column type frequency:
Date 1
logical 1
numeric 6
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 38 0.43 0004-12-20 0005-12-20 0005-05-20 13

Variable type: logical

skim_variable n_missing complete_rate mean count
is_manual_report 0 1 0.61 TRU: 41, FAL: 26

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 7.009282e+09 1.950322e+09 1.503960e+09 6.962181e+09 6.962181e+09 8.877689e+09 8.877689e+09 ▁▁▂▇▆
weight_kg 0 1.00 7.204000e+01 1.392000e+01 5.260000e+01 6.140000e+01 6.250000e+01 8.505000e+01 1.335000e+02 ▇▃▃▁▁
weight_pounds 0 1.00 1.588100e+02 3.070000e+01 1.159600e+02 1.353600e+02 1.377900e+02 1.875000e+02 2.943200e+02 ▇▃▃▁▁
fat 65 0.03 2.350000e+01 2.120000e+00 2.200000e+01 2.275000e+01 2.350000e+01 2.425000e+01 2.500000e+01 ▇▁▁▁▇
bmi 0 1.00 2.519000e+01 3.070000e+00 2.145000e+01 2.396000e+01 2.439000e+01 2.556000e+01 4.754000e+01 ▇▁▁▁▁
log_id 0 1.00 1.461772e+12 7.829948e+08 1.460444e+12 1.461079e+12 1.461802e+12 1.462375e+12 1.463098e+12 ▇▇▆▇▇
skim(heartrate)
## Warning: There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
##   mangled_skimmers$funs)`.
## ℹ In group 0: .
## Caused by warning:
## ! There were 2 warnings in `dplyr::summarize()`.
## The first warning was:
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
##   mangled_skimmers$funs)`.
## Caused by warning in `min.default()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
Data summary
Name heartrate
Number of rows 1769
Number of columns 3
_______________________
Column type frequency:
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 5.307499e+09 2.124234e+09 2022484408 4020332650 5553957443 6962181067 8877689391 ▆▇▇▆▅
value 0 1 1.114600e+02 3.959000e+01 36 78 110 143 203 ▆▇▇▆▂

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
time 1769 0 Inf -Inf NA 0
skim(sedentary)
Data summary
Name sedentary
Number of rows 940
Number of columns 4
_______________________
Column type frequency:
character 1
Date 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
activity_day 0 1 8 9 0 31 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2016-04-12 2016-05-12 2016-04-26 31

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 4.855407e+09 2.424805e+09 1503960366 2320127002.0 4445114986 6.962181e+09 8877689391 ▇▅▃▅▅
calories 0 1 2.303610e+03 7.181700e+02 0 1828.5 2134 2.793250e+03 4900 ▁▆▇▃▁

Note: Weight and sleep data contain some missing entries. We’ll proceed with complete cases or summary-level analysis where appropriate.

4. Analyze the Data

We merge datasets where needed and visualize relationships among variables. The goal is to detect trends and correlations that reveal how users interact with their fitness trackers.

merged_data <- merge(activity, sleep, by = "id")
# Steps vs Calories
ggplot(activity, aes(x = total_steps, y = calories)) +
  geom_point(color = 'darkgreen') +
  geom_smooth(method = lm) +
  labs(title = "Steps vs Calories", x = "Total Steps", y = "Calories")
## `geom_smooth()` using formula = 'y ~ x'

# Sleep vs Steps
ggplot(merged_data, aes(x = total_minutes_asleep, y = total_steps)) +
  geom_point(color = 'blue') +
  geom_smooth(method = lm) +
  labs(title = "Sleep Duration vs Steps", x = "Minutes Asleep", y = "Total Steps")
## `geom_smooth()` using formula = 'y ~ x'

# Sedentary vs Calories
ggplot(sedentary, aes(x = calories)) +
  geom_density(fill = 'purple', alpha = 0.5) +
  labs(title = "Calorie Distribution in Sedentary Dataset", x = "Calories", y = "Density")

activity$day_of_week <- weekdays(activity$date)
ggplot(activity, aes(x = day_of_week, y = total_steps)) +
  stat_summary(fun = mean, geom = "bar", fill = "steelblue") +
  labs(title = "Average Steps by Day of Week", x = "Day", y = "Average Steps") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(weight, aes(x = weight_kg, y = bmi)) +
  geom_point(color = "darkorange") +
  geom_smooth(method = "lm") +
  labs(title = "Relationship Between Weight and BMI", x = "Weight (kg)", y = "BMI")
## `geom_smooth()` using formula = 'y ~ x'

5. Share Results

We present key insights with supporting visualizations that demonstrate how different variables relate to each other.

Key Findings: - Activity vs Calories: A strong positive correlation between step count and calories burned. - Sleep vs Activity: Users who sleep longer tend to be more active. - Heart Rate Patterns: The distribution is mostly within a normal range of 60–90 bpm, aligning with expected resting heart rates. - Sedentary Behavior: Calorie burn still occurs during sedentary periods, suggesting BMR-related activity. - Day of Week Patterns: Users tend to walk more on weekdays compared to weekends. - Weight and BMI: There is a linear relationship, confirming expected BMI calculation patterns.

These visualizations support recommendations for improving user engagement.

6. Act: Recommendations

Based on the analysis, Bellabeat can take the following actions:

Appendix

Data Source: Fitbit Fitness Tracker Data