Bellabeat is a wellness-focused tech company that manufactures smart devices to help women track their daily habits. Their product suite includes wellness trackers, a mobile app, and a subscription-based coaching service. The company has been growing steadily and wants to leverage data to make smarter marketing decisions. By exploring data collected from Fitbit users, Bellabeat aims to understand how consumers use health tracking devices. The results will help shape Bellabeat’s marketing strategy and inform improvements to the Bellabeat app.
This case study is part of the final capstone project for the Google Data Analytics Professional Certificate. The project follows the six-step data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.
Bellabeat, a high-tech company, manufactures health-focused smart products for women. The marketing analytics team at Bellabeat wants to understand trends in smart device usage, specifically in terms of activity, sleep, weight, heart rate, and calorie data. By analyzing Fitbit user data, Bellabeat seeks insights to improve the user experience of its app and inform marketing strategies.
Objective: Identify patterns in user activity, sleep, and other health behaviors that Bellabeat can use to enhance its app’s features and boost user engagement.
We are using public Fitbit datasets sourced from Kaggle. These datasets include daily and second-level data from 30+ users over a 30-day period. The goal is to load and inspect multiple datasets to determine their usability.
Uploaded 5 CSV files from the data source: https://www.kaggle.com/arashnic/fitbit
Load the tidyverse package with the library()
function
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load required packages
library(lubridate)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2)
library(skimr)
library(readr)
install.packages("janitor")
## Warning: package 'janitor' is in use and will not be installed
library(janitor)
install.packages("skimr")
## Warning: package 'skimr' is in use and will not be installed
library(skimr)
activity <- read.csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/dailyActivity_merged.csv")
sleep <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sedentary <- read_csv("C:/Users/elysh/OneDrive/Data Analytics/Bellabeat_Capstone_Project/datasets_bellabeat_project/dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
activity <- clean_names(activity)
sleep <- clean_names(sleep)
weight <- clean_names(weight)
heartrate <- clean_names(heartrate)
sedentary <- clean_names(sedentary)
activity$date <- as.Date(activity$activity_date, format="%m/%d/%Y")
sleep$date <- as.Date(sleep$sleep_day, format="%m/%d/%Y")
weight$date <- as.Date(weight$date)
heartrate$time <- ymd_hms(heartrate$time)
## Warning: All formats failed to parse. No formats found.
sedentary$date <- as.Date(sedentary$activity_day, format="%m/%d/%Y")
activity <- distinct(activity)
sleep <- distinct(sleep)
weight <- distinct(weight)
heartrate <- distinct(heartrate)
sedentary <- distinct(sedentary)
After loading the datasets, we assess data quality by checking for duplicates, missing values, and incorrect data types. We also perform data type conversion for dates and timestamps.
sum(is.na(activity))
## [1] 0
sum(is.na(sleep))
## [1] 0
sum(is.na(weight))
## [1] 103
sum(is.na(heartrate))
## [1] 1769
sum(is.na(sedentary))
## [1] 0
skim(activity)
| Name | activity |
| Number of rows | 940 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| activity_date | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2016-04-12 | 2016-05-12 | 2016-04-26 | 31 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 | ▇▅▃▅▅ |
| total_steps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 | ▇▇▁▁▁ |
| total_distance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 | ▇▆▁▁▁ |
| tracker_distance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 | ▇▆▁▁▁ |
| logged_activities_distance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 | ▇▁▁▁▁ |
| very_active_distance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 | ▇▁▁▁▁ |
| moderately_active_distance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 | ▇▁▁▁▁ |
| light_active_distance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 | ▆▇▆▁▁ |
| sedentary_active_distance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 | ▇▁▁▁▁ |
| very_active_minutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 | ▇▁▁▁▁ |
| fairly_active_minutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 | ▇▁▁▁▁ |
| lightly_active_minutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 | ▅▇▇▃▁ |
| sedentary_minutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 | ▁▁▇▅▇ |
| calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 | ▁▆▇▃▁ |
skim(sleep)
| Name | sleep |
| Number of rows | 410 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| sleep_day | 0 | 1 | 20 | 21 | 0 | 31 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2016-04-12 | 2016-05-12 | 2016-04-27 | 31 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.994963e+09 | 2.060863e+09 | 1503960366 | 3.977334e+09 | 4702921684.0 | 6962181067 | 8792009665 | ▆▆▇▅▃ |
| total_sleep_records | 0 | 1 | 1.120000e+00 | 3.500000e-01 | 1 | 1.000000e+00 | 1.0 | 1 | 3 | ▇▁▁▁▁ |
| total_minutes_asleep | 0 | 1 | 4.191700e+02 | 1.186400e+02 | 58 | 3.610000e+02 | 432.5 | 490 | 796 | ▁▂▇▃▁ |
| total_time_in_bed | 0 | 1 | 4.584800e+02 | 1.274600e+02 | 61 | 4.037500e+02 | 463.0 | 526 | 961 | ▁▃▇▁▁ |
skim(weight)
| Name | weight |
| Number of rows | 67 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| logical | 1 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 38 | 0.43 | 0004-12-20 | 0005-12-20 | 0005-05-20 | 13 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| is_manual_report | 0 | 1 | 0.61 | TRU: 41, FAL: 26 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 7.009282e+09 | 1.950322e+09 | 1.503960e+09 | 6.962181e+09 | 6.962181e+09 | 8.877689e+09 | 8.877689e+09 | ▁▁▂▇▆ |
| weight_kg | 0 | 1.00 | 7.204000e+01 | 1.392000e+01 | 5.260000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 | ▇▃▃▁▁ |
| weight_pounds | 0 | 1.00 | 1.588100e+02 | 3.070000e+01 | 1.159600e+02 | 1.353600e+02 | 1.377900e+02 | 1.875000e+02 | 2.943200e+02 | ▇▃▃▁▁ |
| fat | 65 | 0.03 | 2.350000e+01 | 2.120000e+00 | 2.200000e+01 | 2.275000e+01 | 2.350000e+01 | 2.425000e+01 | 2.500000e+01 | ▇▁▁▁▇ |
| bmi | 0 | 1.00 | 2.519000e+01 | 3.070000e+00 | 2.145000e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 | ▇▁▁▁▁ |
| log_id | 0 | 1.00 | 1.461772e+12 | 7.829948e+08 | 1.460444e+12 | 1.461079e+12 | 1.461802e+12 | 1.462375e+12 | 1.463098e+12 | ▇▇▆▇▇ |
skim(heartrate)
## Warning: There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
## mangled_skimmers$funs)`.
## ℹ In group 0: .
## Caused by warning:
## ! There were 2 warnings in `dplyr::summarize()`.
## The first warning was:
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
## mangled_skimmers$funs)`.
## Caused by warning in `min.default()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
| Name | heartrate |
| Number of rows | 1769 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 5.307499e+09 | 2.124234e+09 | 2022484408 | 4020332650 | 5553957443 | 6962181067 | 8877689391 | ▆▇▇▆▅ |
| value | 0 | 1 | 1.114600e+02 | 3.959000e+01 | 36 | 78 | 110 | 143 | 203 | ▆▇▇▆▂ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time | 1769 | 0 | Inf | -Inf | NA | 0 |
skim(sedentary)
| Name | sedentary |
| Number of rows | 940 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| activity_day | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2016-04-12 | 2016-05-12 | 2016-04-26 | 31 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2320127002.0 | 4445114986 | 6.962181e+09 | 8877689391 | ▇▅▃▅▅ |
| calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1828.5 | 2134 | 2.793250e+03 | 4900 | ▁▆▇▃▁ |
Note: Weight and sleep data contain some missing entries. We’ll proceed with complete cases or summary-level analysis where appropriate.
We merge datasets where needed and visualize relationships among variables. The goal is to detect trends and correlations that reveal how users interact with their fitness trackers.
merged_data <- merge(activity, sleep, by = "id")
# Steps vs Calories
ggplot(activity, aes(x = total_steps, y = calories)) +
geom_point(color = 'darkgreen') +
geom_smooth(method = lm) +
labs(title = "Steps vs Calories", x = "Total Steps", y = "Calories")
## `geom_smooth()` using formula = 'y ~ x'
# Sleep vs Steps
ggplot(merged_data, aes(x = total_minutes_asleep, y = total_steps)) +
geom_point(color = 'blue') +
geom_smooth(method = lm) +
labs(title = "Sleep Duration vs Steps", x = "Minutes Asleep", y = "Total Steps")
## `geom_smooth()` using formula = 'y ~ x'
# Sedentary vs Calories
ggplot(sedentary, aes(x = calories)) +
geom_density(fill = 'purple', alpha = 0.5) +
labs(title = "Calorie Distribution in Sedentary Dataset", x = "Calories", y = "Density")
activity$day_of_week <- weekdays(activity$date)
ggplot(activity, aes(x = day_of_week, y = total_steps)) +
stat_summary(fun = mean, geom = "bar", fill = "steelblue") +
labs(title = "Average Steps by Day of Week", x = "Day", y = "Average Steps") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(weight, aes(x = weight_kg, y = bmi)) +
geom_point(color = "darkorange") +
geom_smooth(method = "lm") +
labs(title = "Relationship Between Weight and BMI", x = "Weight (kg)", y = "BMI")
## `geom_smooth()` using formula = 'y ~ x'
Based on the analysis, Bellabeat can take the following actions:
Data Source: Fitbit Fitness Tracker Data