Introduction and background

Bellabeat is a wellness-focused tech company that manufactures smart devices to help women track their daily habits. Their product suite includes wellness trackers, a mobile app, and a subscription-based coaching service. The company has been growing steadily and wants to leverage data to make smarter marketing decisions. By exploring data collected from Fitbit users, Bellabeat aims to understand how consumers use health tracking devices. The results will help shape Bellabeat’s marketing strategy and inform improvements to the Bellabeat app.

This case study is part of the final capstone project for the Google Data Analytics Professional Certificate. The project follows the six-step data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.

1. Business Task (ASK)

Bellabeat, a high-tech company, manufactures health-focused smart products for women. The marketing analytics team at Bellabeat wants to understand trends in smart device usage, specifically in terms of activity, sleep, weight, heart rate, and calorie data. By analyzing Fitbit user data, Bellabeat seeks insights to improve the user experience of its app and inform marketing strategies.

Objective: Identify patterns in user activity, sleep, and other health behaviors that Bellabeat can use to enhance its app’s features and boost user engagement.

2. Prepare the Data

Upload your CSV files to R

We are using public Fitbit datasets sourced from Kaggle. These datasets include daily and second-level data from 30+ users over a 30-day period. The goal is to load and inspect multiple datasets to determine their usability.

Uploaded 5 CSV files from the data source: https://www.kaggle.com/arashnic/fitbit

Setting up my enviroment

Install and load packages

Load the tidyverse package with the library() function

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load required packages

library(lubridate)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ggplot2)
library(skimr)
library(readr)

install.packages("janitor")

## Warning: package 'janitor' is in use and will not be installed

library(janitor)

install.packages("skimr")

## Warning: package 'skimr' is in use and will not be installed

library(skimr)

Load csv files and create dataframes

activity <- read.csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/dailyActivity_merged.csv")

sleep <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

weight <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/weightLogInfo_merged.csv")

## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

heartrate <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sedentary <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/dailyCalories_merged.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean data

activity <- clean_names(activity)
sleep <- clean_names(sleep)
weight <- clean_names(weight)
heartrate <- clean_names(heartrate)
sedentary <- clean_names(sedentary)

activity$date <- as.Date(activity$activity_date, format="%m/%d/%Y")
sleep$date <- as.Date(sleep$sleep_day, format="%m/%d/%Y")
weight$date <- as.Date(weight$date)
heartrate$time <- ymd_hms(heartrate$time)

## Warning: All formats failed to parse. No formats found.

sedentary$date <- as.Date(sedentary$activity_day, format="%m/%d/%Y")

activity <- distinct(activity)
sleep <- distinct(sleep)
weight <- distinct(weight)
heartrate <- distinct(heartrate)
sedentary <- distinct(sedentary)

3. Process the Data

After loading the datasets, we assess data quality by checking for duplicates, missing values, and incorrect data types. We also perform data type conversion for dates and timestamps.

sum(is.na(activity))

## [1] 0

sum(is.na(sleep))

## [1] 0

sum(is.na(weight))

## [1] 103

sum(is.na(heartrate))

## [1] 1769

sum(is.na(sedentary))

## [1] 0

skim(activity)

Data summary
Name	activity
Number of rows	940
Number of columns	16
_______________________
Column type frequency:
character	1
Date	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
activity_date	0	1	8	9	0	31	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2016-04-12	2016-05-12	2016-04-26	31

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	4.855407e+09	2.424805e+09	1503960366	2.320127e+09	4.445115e+09	6.962181e+09	8.877689e+09	▇▅▃▅▅
total_steps	1	7.637910e+03	5.087150e+03	0	3.789750e+03	7.405500e+03	1.072700e+04	3.601900e+04	▇▇▁▁▁
total_distance	1	5.490000e+00	3.920000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01	▇▆▁▁▁
tracker_distance	1	5.480000e+00	3.910000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01	▇▆▁▁▁
logged_activities_distance	1	1.100000e-01	6.200000e-01	0	0.000000e+00	0.000000e+00	0.000000e+00	4.940000e+00	▇▁▁▁▁
very_active_distance	1	1.500000e+00	2.660000e+00	0	0.000000e+00	2.100000e-01	2.050000e+00	2.192000e+01	▇▁▁▁▁
moderately_active_distance	1	5.700000e-01	8.800000e-01	0	0.000000e+00	2.400000e-01	8.000000e-01	6.480000e+00	▇▁▁▁▁
light_active_distance	1	3.340000e+00	2.040000e+00	0	1.950000e+00	3.360000e+00	4.780000e+00	1.071000e+01	▆▇▆▁▁
sedentary_active_distance	1	0.000000e+00	1.000000e-02	0	0.000000e+00	0.000000e+00	0.000000e+00	1.100000e-01	▇▁▁▁▁
very_active_minutes	1	2.116000e+01	3.284000e+01	0	0.000000e+00	4.000000e+00	3.200000e+01	2.100000e+02	▇▁▁▁▁
fairly_active_minutes	1	1.356000e+01	1.999000e+01	0	0.000000e+00	6.000000e+00	1.900000e+01	1.430000e+02	▇▁▁▁▁
lightly_active_minutes	1	1.928100e+02	1.091700e+02	0	1.270000e+02	1.990000e+02	2.640000e+02	5.180000e+02	▅▇▇▃▁
sedentary_minutes	1	9.912100e+02	3.012700e+02	0	7.297500e+02	1.057500e+03	1.229500e+03	1.440000e+03	▁▁▇▅▇
calories	1	2.303610e+03	7.181700e+02	0	1.828500e+03	2.134000e+03	2.793250e+03	4.900000e+03	▁▆▇▃▁

skim(sleep)

Data summary
Name	sleep
Number of rows	410
Number of columns	6
_______________________
Column type frequency:
character	1
Date	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
sleep_day	0	1	20	21	0	31	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2016-04-12	2016-05-12	2016-04-27	31

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	4.994963e+09	2.060863e+09	1503960366	3.977334e+09	4702921684.0	6962181067	8792009665	▆▆▇▅▃
total_sleep_records	1	1.120000e+00	3.500000e-01	1	1.000000e+00	1.0	1	3	▇▁▁▁▁
total_minutes_asleep	1	4.191700e+02	1.186400e+02	58	3.610000e+02	432.5	490	796	▁▂▇▃▁
total_time_in_bed	1	4.584800e+02	1.274600e+02	61	4.037500e+02	463.0	526	961	▁▃▇▁▁

skim(weight)

Data summary
Name	weight
Number of rows	67
Number of columns	8
_______________________
Column type frequency:
Date	1
logical	1
numeric	6
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	38	0.43	0004-12-20	0005-12-20	0005-05-20	13

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
is_manual_report	0	1	0.61	TRU: 41, FAL: 26

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	7.009282e+09	1.950322e+09	1.503960e+09	6.962181e+09	6.962181e+09	8.877689e+09	8.877689e+09	▁▁▂▇▆
weight_kg	0	1.00	7.204000e+01	1.392000e+01	5.260000e+01	6.140000e+01	6.250000e+01	8.505000e+01	1.335000e+02	▇▃▃▁▁
weight_pounds	0	1.00	1.588100e+02	3.070000e+01	1.159600e+02	1.353600e+02	1.377900e+02	1.875000e+02	2.943200e+02	▇▃▃▁▁
fat	65	0.03	2.350000e+01	2.120000e+00	2.200000e+01	2.275000e+01	2.350000e+01	2.425000e+01	2.500000e+01	▇▁▁▁▇
bmi	0	1.00	2.519000e+01	3.070000e+00	2.145000e+01	2.396000e+01	2.439000e+01	2.556000e+01	4.754000e+01	▇▁▁▁▁
log_id	0	1.00	1.461772e+12	7.829948e+08	1.460444e+12	1.461079e+12	1.461802e+12	1.462375e+12	1.463098e+12	▇▇▆▇▇

skim(heartrate)

## Warning: There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
##   mangled_skimmers$funs)`.
## ℹ In group 0: .
## Caused by warning:
## ! There were 2 warnings in `dplyr::summarize()`.
## The first warning was:
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
##   mangled_skimmers$funs)`.
## Caused by warning in `min.default()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

Data summary
Name	heartrate
Number of rows	1769
Number of columns	3
_______________________
Column type frequency:
numeric	2
POSIXct	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1	5.307499e+09	2.124234e+09	2022484408	4020332650	5553957443	6962181067	8877689391	▆▇▇▆▅
value	0	1	1.114600e+02	3.959000e+01	36	78	110	143	203	▆▇▇▆▂

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
time	1769	0	Inf	-Inf	NA	0

skim(sedentary)

Data summary
Name	sedentary
Number of rows	940
Number of columns	4
_______________________
Column type frequency:
character	1
Date	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
activity_day	0	1	8	9	0	31	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2016-04-12	2016-05-12	2016-04-26	31

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1	4.855407e+09	2.424805e+09	1503960366	2320127002.0	4445114986	6.962181e+09	8877689391	▇▅▃▅▅
calories	0	1	2.303610e+03	7.181700e+02	0	1828.5	2134	2.793250e+03	4900	▁▆▇▃▁

Note: Weight and sleep data contain some missing entries. We’ll proceed with complete cases or summary-level analysis where appropriate.

4. Analyze the Data

We merge datasets where needed and visualize relationships among variables. The goal is to detect trends and correlations that reveal how users interact with their fitness trackers.

merged_data <- merge(activity, sleep, by = "id")

# Steps vs Calories
ggplot(activity, aes(x = total_steps, y = calories)) +
  geom_point(color = 'darkgreen') +
  geom_smooth(method = lm) +
  labs(title = "Steps vs Calories", x = "Total Steps", y = "Calories")

## `geom_smooth()` using formula = 'y ~ x'

# Sleep vs Steps
ggplot(merged_data, aes(x = total_minutes_asleep, y = total_steps)) +
  geom_point(color = 'blue') +
  geom_smooth(method = lm) +
  labs(title = "Sleep Duration vs Steps", x = "Minutes Asleep", y = "Total Steps")

## `geom_smooth()` using formula = 'y ~ x'

# Sedentary vs Calories
ggplot(sedentary, aes(x = calories)) +
  geom_density(fill = 'purple', alpha = 0.5) +
  labs(title = "Calorie Distribution in Sedentary Dataset", x = "Calories", y = "Density")

activity$day_of_week <- weekdays(activity$date)
ggplot(activity, aes(x = day_of_week, y = total_steps)) +
  stat_summary(fun = mean, geom = "bar", fill = "steelblue") +
  labs(title = "Average Steps by Day of Week", x = "Day", y = "Average Steps") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(weight, aes(x = weight_kg, y = bmi)) +
  geom_point(color = "darkorange") +
  geom_smooth(method = "lm") +
  labs(title = "Relationship Between Weight and BMI", x = "Weight (kg)", y = "BMI")

## `geom_smooth()` using formula = 'y ~ x'

5. Share Results

We present key insights with supporting visualizations that demonstrate how different variables relate to each other.

Key Findings: - Activity vs Calories: A strong positive correlation between step count and calories burned. - Sleep vs Activity: Users who sleep longer tend to be more active. - Heart Rate Patterns: The distribution is mostly within a normal range of 60–90 bpm, aligning with expected resting heart rates. - Sedentary Behavior: Calorie burn still occurs during sedentary periods, suggesting BMR-related activity. - Day of Week Patterns: Users tend to walk more on weekdays compared to weekends. - Weight and BMI: There is a linear relationship, confirming expected BMI calculation patterns.

These visualizations support recommendations for improving user engagement.

6. Act: Recommendations

Based on the analysis, Bellabeat can take the following actions:

Promote Daily Engagement: Use in-app reminders to encourage daily step goals and sleep tracking.
Highlight Health Metrics: Include heart rate summaries in daily reports to drive awareness.
Expand Weight Tracking: Add educational content around weight management and visual incentives to log weight.
Sedentary Alerts: Notify users when they’ve been inactive for extended periods to prompt movement.
Personalized Insights: Tailor feedback based on user trends (e.g., poor sleepers, low step count days).

Appendix

Data Source: Fitbit Fitness Tracker Data

Bellabeat Capstone Project

Elyshea Devore

2025-06-11

Introduction and background

1. Business Task (ASK)

2. Prepare the Data

Upload your CSV files to R

Setting up my enviroment

Install and load packages

Load csv files and create dataframes

Clean data

3. Process the Data

4. Analyze the Data

6. Act: Recommendations

Appendix