Bellabeat, founded by Urska Srsen and Sando Mur, specializes in health-focused smart products designed to inform and inspire women. Since its inception in 2013, Bellabeat has rapidly grown and positioned itself as a leader in tech-driven wellness products for women.

Phase 1: Ask

Business Task

Analyze non-Bellabeat smart device usage data to gain insights that will enhance Bellabeat’s marketing strategy. Apply these insights to a selected Bellabeat product for targeted improvements.

Key Stakeholders

Urska Srsen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Bellabeat’s cofounder and key member of executive team
Bellabeat Marketing Team

Market Insights

According to Fortune Business Insights, as well as studies and statistics from ValuePenguin (April 2022) and RunRepeat (October 2021), the following insights were observed:

Market Growth: The global fitness tracker market is projected to grow from USD 53.94 billion in 2023 to USD 290.85 billion by 2032, at a CAGR of 21.3%.
User Engagement: 92% of smartwatch wearers use their devices for health and fitness monitoring, with 88% claiming they helped in achieving fitness goals. Smartwatches hold the largest smart device type market share.
Application: Running is the most common use, with 42.8% of users globally prioritizing this feature.
Shipping Volume: 445 million wearable fitness devices were shipped, highlighting strong demand and market potential.
Pandemic Impact: Fitness tracker shipments and revenue grew over 31% during the COVID-19 pandemic, indicating increased consumer interest in health.
Gender: Women are nearly 40% more likely to use fitness trackers than men, benefiting female-focused brands like Bellabeat.
For more detailed insights, you can refer to the full article here.

Phase 2: Prepare

Data Source

Fitabase dataset made available through Kaggle user Mobius and licensed under CC0: Public Domain
Dataset generated from 33 eligible Fitbit user logs from smart devices
Acknowledgements - Robert Furger, Julia Brinton, Michael Keating, Alexa Ortiz

Data Credibility

The data is reliable but dated, originating from 2016 and collected over 31 days from 33 participants.

The small sample size of 33 participants reduces the confidence level of the derived insights. Additionally, the absence of demographic information such as gender, age, and location makes it highly unlikely to determine if the data accurately represents the population. This also hinders the identification of female-specific trends.

Chosen Tools

I initially examined the datasets in Excel to sort and filter the data. However, the ‘Minute’ datasets contained millions of rows, making it impractical to manage in Excel. Therefore, I loaded all the data into R Studio for more efficient cleaning and analysis.

Excel
R
R Packages

install.packages("tidyverse")
install.packages("lubridate")
install.packages("janitor")
install.packages("readxl")
install.packages("data.table")

library(tidyverse)
library(lubridate)
library(janitor)
library(readxl)
library(data.table)

Chosen Datasets

dailyActivity_merged
dailyCalories_merged
dailyIntensities_merged
dailySteps_merged
heartrate_seconds_merged
hourlyCalories_merged
hourlyIntensities_merged
hourlySteps_merged
minuteintensitiesNarrow_merged
minutesSleep_merged
sleepDay_merged
weightLogInfo_merged

daily_activity <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
daily_calories <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
daily_intensities <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
daily_steps <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
hourly_calories <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
hourly_intensities <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
hourly_steps <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
minute_sleep <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")
sleep_day <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weight_log <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

# Load separately to avoid crash
heartrate_seconds <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")

# Load separately to avoid crash
minute_intensities <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/minuteIntensitiesNarrow_merged.csv")

## Rows: 1325580 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityMinute
## dbl (2): Id, Intensity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Phase 3: Process

Cleaning Log

Inspect and Preview All Datasets

Inspected Data using ‘str()’ function
Searched for matching columns for potential merging, and checked for inconsistencies in column names and formats
Found “Id”, “ActivityDate”, and “ActivityHour” columns suitable for merging separate daily and hourly datasets

# Look for matching columns, inconsistencies in column names and in column formats
str(daily_activity)

## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

str(daily_calories)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

str(daily_intensities)

## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay             : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...

str(daily_steps)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...

str(heartrate_seconds)

## 'data.frame':    2483658 obs. of  3 variables:
##  $ Id   : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ Time : chr  "4/12/2016 7:21:00 AM" "4/12/2016 7:21:05 AM" "4/12/2016 7:21:10 AM" "4/12/2016 7:21:20 AM" ...
##  $ Value: int  97 102 105 103 101 95 91 93 94 93 ...

str(hourly_calories)

## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ Calories    : int  81 61 59 47 48 48 48 47 68 141 ...

str(hourly_intensities)

## 'data.frame':    22099 obs. of  4 variables:
##  $ Id              : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour    : chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ TotalIntensity  : int  20 8 7 0 0 0 0 0 13 30 ...
##  $ AverageIntensity: num  0.333 0.133 0.117 0 0 ...

str(hourly_steps)

## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...

str(minute_intensities)

## spc_tbl_ [1,325,580 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id            : num [1:1325580] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityMinute: chr [1:1325580] "4/12/2016 12:00:00 AM" "4/12/2016 12:01:00 AM" "4/12/2016 12:02:00 AM" "4/12/2016 12:03:00 AM" ...
##  $ Intensity     : num [1:1325580] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityMinute = col_character(),
##   ..   Intensity = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(minute_sleep)

## 'data.frame':    188521 obs. of  4 variables:
##  $ Id   : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date : chr  "4/12/2016 2:47:30 AM" "4/12/2016 2:48:30 AM" "4/12/2016 2:49:30 AM" "4/12/2016 2:50:30 AM" ...
##  $ value: int  3 2 1 1 1 1 1 2 2 2 ...
##  $ logId: num  1.14e+10 1.14e+10 1.14e+10 1.14e+10 1.14e+10 ...

str(sleep_day)

## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

str(weight_log)

## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

Dataset Transformations

Cleaned inconsistent column names to match appropriate datasets
Renamed date/time columns to either “ActivityDate”, “ActivityMinutes”, or “ActivitySeconds”
Renamed ‘daily_activity’/“TotalSteps” to “StepTotal” to align with other ‘steps’ datasets
Standardized inconsistent date/time formats
A total of 33 participants were identified, which is 3 more than originally stated in the dataset description.
Identified 65 missing values in the ‘weight_log’ dataset and chose not to delete these rows, as it would remove 99% of the data
Identified and removed 3 duplicates from the ‘sleep_day’ dataset
Identified and removed 543 duplicates from the ‘minute_sleep’ dataset
Found the ‘minute_sleep’ dataset to be of no value and excluded it from the analysis
Merged Daily datasets together
Merged Hourly datasets together
Reduced datasets from 12 to 5: two merged datasets (‘daily_combined’ and ‘hourly_combined’) and three separate datasets (‘minute_intensities’, ‘heartrate_seconds’ and ‘weight_log’)

# Rename columns to ensure consistency and match 'daily_activity' dataset
colnames(daily_activity)[colnames(daily_activity) == "TotalSteps"] <- "StepTotal"
colnames(daily_calories)[colnames(daily_calories) == "ActivityDay"] <- "ActivityDate"
colnames(daily_intensities)[colnames(daily_intensities) == "ActivityDay"] <- "ActivityDate"
colnames(daily_steps)[colnames(daily_steps) == "ActivityDay"] <- "ActivityDate"
colnames(heartrate_seconds)[colnames(heartrate_seconds) == "Time"] <- "ActivitySeconds"
colnames(heartrate_seconds)[colnames(heartrate_seconds) == "Value"] <- "Bpm"
colnames(sleep_day)[colnames(sleep_day) == "SleepDay"] <- "ActivityDate"
colnames(minute_intensities)[colnames(minute_intensities) == "ActivityMinute"] <- "ActivityMinutes"
colnames(minute_sleep)[colnames(minute_sleep) == "date"] <- "ActivityMinutes"
colnames(weight_log)[colnames(weight_log) == "Date"] <- "ActivityDate"

# Ensure date columns are in the same format
daily_activity$ActivityDate <- as.Date(daily_activity$ActivityDate, format="%m/%d/%Y")
daily_calories$ActivityDate <- as.Date(daily_calories$ActivityDate, format="%m/%d/%Y")
daily_intensities$ActivityDate <- as.Date(daily_intensities$ActivityDate, format="%m/%d/%Y")
daily_steps$ActivityDate <- as.Date(daily_steps$ActivityDate, format="%m/%d/%Y")
heartrate_seconds$ActivitySeconds <- as.POSIXct(heartrate_seconds$ActivitySeconds, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_calories$ActivityHour <- as.POSIXct(hourly_calories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_intensities$ActivityHour <- as.POSIXct(hourly_intensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_steps$ActivityHour <- as.POSIXct(hourly_steps$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep_day$ActivityDate <- as.Date(sleep_day$ActivityDate, format="%m/%d/%Y")
minute_intensities$ActivityMinutes <- as.POSIXct(minute_intensities$ActivityMinutes, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
minute_sleep$ActivityMinutes <- as.POSIXct(minute_sleep$ActivityMinutes, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
weight_log$ActivityDate <- as.POSIXct(weight_log$ActivityDate, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())

# Confirm changes in final inspection
str(daily_activity)

## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : Date, format: "2016-04-12" "2016-04-13" ...
##  $ StepTotal               : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

str(daily_calories)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate: Date, format: "2016-04-12" "2016-04-13" ...
##  $ Calories    : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

str(daily_intensities)

## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : Date, format: "2016-04-12" "2016-04-13" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...

str(daily_steps)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate: Date, format: "2016-04-12" "2016-04-13" ...
##  $ StepTotal   : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...

str(heartrate_seconds)

## 'data.frame':    2483658 obs. of  3 variables:
##  $ Id             : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ ActivitySeconds: POSIXct, format: "2016-04-12 07:21:00" "2016-04-12 07:21:05" ...
##  $ Bpm            : int  97 102 105 103 101 95 91 93 94 93 ...

str(hourly_calories)

## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
##  $ Calories    : int  81 61 59 47 48 48 48 47 68 141 ...

str(hourly_intensities)

## 'data.frame':    22099 obs. of  4 variables:
##  $ Id              : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour    : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
##  $ TotalIntensity  : int  20 8 7 0 0 0 0 0 13 30 ...
##  $ AverageIntensity: num  0.333 0.133 0.117 0 0 ...

str(hourly_steps)

## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
##  $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...

str(minute_intensities)

## spc_tbl_ [1,325,580 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id             : num [1:1325580] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityMinutes: POSIXct[1:1325580], format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
##  $ Intensity      : num [1:1325580] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityMinute = col_character(),
##   ..   Intensity = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(minute_sleep)

## 'data.frame':    188521 obs. of  4 variables:
##  $ Id             : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityMinutes: POSIXct, format: "2016-04-12 02:47:30" "2016-04-12 02:48:30" ...
##  $ value          : int  3 2 1 1 1 1 1 2 2 2 ...
##  $ logId          : num  1.14e+10 1.14e+10 1.14e+10 1.14e+10 1.14e+10 ...

str(sleep_day)

## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate      : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

str(weight_log)

## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ ActivityDate  : POSIXct, format: "2016-05-02 23:59:59" "2016-05-03 23:59:59" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

# Identify missing values
# List of datasets
datasets <- list(
  daily_activity = daily_activity,
  daily_calories = daily_calories,
  daily_intensities = daily_intensities,
  daily_steps = daily_steps,
  heartrate_seconds = heartrate_seconds,
  hourly_calories = hourly_calories,
  hourly_intensities = hourly_intensities,
  hourly_steps = hourly_steps,
  minute_intensities = minute_intensities,
  minute_sleep = minute_sleep,
  sleep_day = sleep_day,
  weight_log = weight_log
)

# Check for missing values in each dataset
for (name in names(datasets)) {
  cat("\nChecking for missing values in:", name, "\n")
  print(sapply(datasets[[name]], function(x) sum(is.na(x))))
}

## 
## Checking for missing values in: daily_activity 
##                       Id             ActivityDate                StepTotal 
##                        0                        0                        0 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                        0                        0                        0 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                        0                        0                        0 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                        0                        0                        0 
## 
## Checking for missing values in: daily_calories 
##           Id ActivityDate     Calories 
##            0            0            0 
## 
## Checking for missing values in: daily_intensities 
##                       Id             ActivityDate         SedentaryMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes      FairlyActiveMinutes        VeryActiveMinutes 
##                        0                        0                        0 
##  SedentaryActiveDistance      LightActiveDistance ModeratelyActiveDistance 
##                        0                        0                        0 
##       VeryActiveDistance 
##                        0 
## 
## Checking for missing values in: daily_steps 
##           Id ActivityDate    StepTotal 
##            0            0            0 
## 
## Checking for missing values in: heartrate_seconds 
##              Id ActivitySeconds             Bpm 
##               0               0               0 
## 
## Checking for missing values in: hourly_calories 
##           Id ActivityHour     Calories 
##            0            0            0 
## 
## Checking for missing values in: hourly_intensities 
##               Id     ActivityHour   TotalIntensity AverageIntensity 
##                0                0                0                0 
## 
## Checking for missing values in: hourly_steps 
##           Id ActivityHour    StepTotal 
##            0            0            0 
## 
## Checking for missing values in: minute_intensities 
##              Id ActivityMinutes       Intensity 
##               0               0               0 
## 
## Checking for missing values in: minute_sleep 
##              Id ActivityMinutes           value           logId 
##               0               0               0               0 
## 
## Checking for missing values in: sleep_day 
##                 Id       ActivityDate  TotalSleepRecords TotalMinutesAsleep 
##                  0                  0                  0                  0 
##     TotalTimeInBed 
##                  0 
## 
## Checking for missing values in: weight_log 
##             Id   ActivityDate       WeightKg   WeightPounds            Fat 
##              0              0              0              0             65 
##            BMI IsManualReport          LogId 
##              0              0              0

# Looking for duplicate rows
# List of datasets
datasets <- list(
  daily_activity = daily_activity,
  daily_calories = daily_calories,
  daily_intensities = daily_intensities,
  daily_steps = daily_steps,
  hourly_calories = hourly_calories,
  hourly_intensities = hourly_intensities,
  hourly_steps = hourly_steps,
  minute_intensities = minute_intensities,
  minute_sleep = minute_sleep,
  sleep_day = sleep_day,
  weight_log = weight_log
)

# Check for duplicate rows in each dataset
for (name in names(datasets)) {
  cat("\nChecking for duplicate rows in:", name, "\n")
  duplicates <- sum(duplicated(datasets[[name]]))
  cat("Number of duplicate rows:", duplicates, "\n")
}

## 
## Checking for duplicate rows in: daily_activity 
## Number of duplicate rows: 0 
## 
## Checking for duplicate rows in: daily_calories 
## Number of duplicate rows: 0 
## 
## Checking for duplicate rows in: daily_intensities 
## Number of duplicate rows: 0 
## 
## Checking for duplicate rows in: daily_steps 
## Number of duplicate rows: 0 
## 
## Checking for duplicate rows in: hourly_calories 
## Number of duplicate rows: 0 
## 
## Checking for duplicate rows in: hourly_intensities 
## Number of duplicate rows: 0 
## 
## Checking for duplicate rows in: hourly_steps 
## Number of duplicate rows: 0 
## 
## Checking for duplicate rows in: minute_intensities 
## Number of duplicate rows: 0 
## 
## Checking for duplicate rows in: minute_sleep 
## Number of duplicate rows: 543 
## 
## Checking for duplicate rows in: sleep_day 
## Number of duplicate rows: 3 
## 
## Checking for duplicate rows in: weight_log 
## Number of duplicate rows: 0

# Found 3 duplicates in 'sleep_day'
# Identify duplicates
duplicates <- sleep_day[duplicated(sleep_day) | duplicated(sleep_day, fromLast = TRUE), ]

# Print the duplicate rows to confirm removal
print(duplicates)

# Remove duplicate rows, keeping only the first occurrence
sleep_day <- sleep_day[!duplicated(sleep_day), ]

# Found 543 duplicates in 'minute_sleep'
# Identify duplicates
duplicates <- minute_sleep[duplicated(minute_sleep) | duplicated(minute_sleep, fromLast = TRUE), ]

# Print the duplicate rows to confirm removal
print(duplicates)

# Remove duplicate rows, keeping only the first occurrence
minute_sleep <- minute_sleep[!duplicated(minute_sleep), ]

# Running heart rate_seconds investigation separately to reduce likely hood of crash
# Created a function to process and remove duplicates in chunks
count_duplicates_in_chunks <- function(data, chunk_size = 100000) {
  num_chunks <- ceiling(nrow(data) / chunk_size)
  total_duplicates <- 0
  
  for (i in 1:num_chunks) {
    chunk <- data[((i - 1) * chunk_size + 1):min(i * chunk_size, nrow(data)), ]
    total_duplicates <- total_duplicates + sum(duplicated(chunk))
  }
  
  return(total_duplicates)
}

# Convert heart rate_seconds to data.table
heartrate_seconds_dt <- as.data.table(heartrate_seconds)

# Count duplicates using chunk processing
num_duplicates <- count_duplicates_in_chunks(heartrate_seconds_dt)

# Print the number of duplicates
cat("Number of duplicate rows in heartrate_seconds:", num_duplicates, "\n")

## Number of duplicate rows in heartrate_seconds: 0

# Merge multiple datasets (daily_activity, daily_intensities, daily_calories, daily_steps, sleep_day) into single dataset named 'daily_combined' using common columns "Id" and "ActivityDate"
# Removes any duplicate columns that have a ".y" suffix, ensuring only the original columns are retained in the daily_combined dataset.
daily_combined <- daily_activity %>%
  left_join(daily_intensities, by = c("Id", "ActivityDate")) %>%
  left_join(daily_calories, by = c("Id", "ActivityDate")) %>%
  left_join(daily_steps, by = c("Id", "ActivityDate")) %>%
  left_join(sleep_day, by = c("Id", "ActivityDate")) %>%
  select(-contains(".y"))

# Remove ".x" suffix from column names
colnames(daily_combined) <- gsub("\\.x$", "", colnames(daily_combined))

# Merge hourly datasets using common columns "Id" and "ActivityHour" and named 'hourly_combined'
# Remove any duplicate columns that have a ".y" suffix, ensuring only the original columns are retained in the hourly_combined dataset.
hourly_combined <- hourly_calories %>%
  left_join(hourly_intensities, by = c("Id", "ActivityHour")) %>%
  left_join(hourly_steps, by = c("Id", "ActivityHour")) %>%
  select(-contains(".y"))

Phase 4: Analysis and Insights

User Engagement Patterns

The data measures four primary metrics: Physical Activity, Sleep, Heart Rate, and Weight.
Among these metrics, users clearly valued tracking their physical activity the most, while weight and heart rate were tracked the least. This could be an opportunity to investigate why these metrics are less tracked and how to encourage more users.

# Calculate unique users for physical activity
activity_users <- daily_combined %>%
  summarise(Users = n_distinct(Id)) %>%
  mutate(Metric = "Physical Activity")

# Calculate unique users for heart rate
heartrate_users <- heartrate_seconds %>%
  summarise(Users = n_distinct(Id)) %>%
  mutate(Metric = "Heart Rate")

# Calculate unique users for sleep
sleep_users <- sleep_day %>%
  summarise(Users = n_distinct(Id)) %>%
  mutate(Metric = "Sleep")

# Calculate unique users for weight
weight_users <- weight_log %>%
  summarise(Users = n_distinct(Id)) %>%
  mutate(Metric = "Weight")

# Combine all user data into a single dataframe
user_data <- bind_rows(activity_users, heartrate_users, sleep_users, weight_users)

# Calculate percentages
total_users <- 33  # Total number of users
user_data <- user_data %>%
  mutate(Percentage = (Users / total_users) * 100)

# Create the horizontal bar chart with custom colors and annotations
ggplot(user_data, aes(x = Percentage, y = Metric, fill = Metric)) +
  geom_bar(stat = "identity", color = "lightgrey") +
  scale_fill_manual(values = c("Physical Activity" = "#ff6666", 
                               "Heart Rate" = "#ffb6b9", 
                               "Sleep" = "#FF8967", 
                               "Weight" = "#ffe6e6")) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), 
            color = "black", 
            size = 3) +
  theme_minimal() +
  labs(title = "Percentage of Users Tracking Each Metric",
       x = "Percentage of Users",
       y = "Metric Type") +
  scale_x_continuous(labels = scales::percent_format(scale = 1)) +
  theme(legend.position = "none",
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank(),  
        plot.margin = margin(t = 20, r = 10, b = 10, l = 10, unit = "pt"))

The high percentage of “Consistent Use” participants suggests that the device/platform is highly effective in engaging users regularly. This provides a solid foundation for both marketing the product and developing new features.
At the same time, the presence of “Moderate Use” and “Minimal Use” users indicates an opportunity to further refine and personalize the user experience to increase engagement across the board.

# Calculate the number of unique days each user has recorded data
user_days <- daily_combined %>%
  group_by(Id) %>%
  summarise(NumberOfDays = n_distinct(ActivityDate))

# Categorize users based on the number of days they recorded data
user_days <- user_days %>%
  mutate(UsageCategory = case_when(
    NumberOfDays >= 1 & NumberOfDays <= 10 ~ "Minimal Use 1-10 Days",
    NumberOfDays >= 11 & NumberOfDays <= 20 ~ "Moderate Use 11-20 Days",
    NumberOfDays >= 21 & NumberOfDays <= 31 ~ "Consistent Use 21-31 Days"
  ))

# Adjust the order of the UsageCategory factor levels
user_days <- user_days %>%
  mutate(UsageCategory = factor(UsageCategory, levels = c("Consistent Use 21-31 Days", "Moderate Use 11-20 Days", "Minimal Use 1-10 Days")))

# Create a summary dataset for the pie chart
usage_summary <- user_days %>%
  count(UsageCategory) %>%
  mutate(Percentage = (n / sum(n)) * 100)

# Calculate label positions
usage_summary <- usage_summary %>%
  arrange(desc(UsageCategory)) %>%
  mutate(ypos = cumsum(Percentage) - 0.5 * Percentage)

# Create the pie chart with outside annotations
ggplot(usage_summary, aes(x = "", y = Percentage, fill = UsageCategory)) +
  geom_bar(width = 1, stat = "identity", color = "lightgrey") +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("Minimal Use 1-10 Days" = "#fff5f5", 
                               "Moderate Use 11-20 Days" = "#ffd9d9", 
                               "Consistent Use 21-31 Days" = "#FF8967")) +
  geom_text(aes(y = ypos, label = paste0(round(Percentage, 2), "%")), 
            color = "black", 
            size = 3.5, 
            nudge_x = 0.7) +  # Adjust position to place labels outside
  labs(title = "Usage Rate of Participants",
       x = "",
       y = "") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),  # Remove x-axis text
        axis.ticks = element_blank(),   # Remove axis ticks
        panel.grid = element_blank(),   # Remove grid lines
        axis.title.x = element_blank(), # Remove x-axis title
        axis.title.y = element_blank()) # Remove y-axis title

## # A tibble: 1 × 4
##   AverageDays MinDays MaxDays PercentageOfTotalDays
##         <dbl>   <int>   <int>                 <dbl>
## 1        28.5       4      31                  91.9

The average daily usage of their smart devices was measured at 28.5 days out of 31, indicating that participants used their devices 91.9% of the time.
Only one user used their device for 4 days, while the next closest participant used their device for 16 days.

# Calculate the number of unique days each user has recorded data
user_days <- daily_combined %>%
  group_by(Id) %>%
  summarise(NumberOfDays = n_distinct(ActivityDate))


# Calculate the average, minimum, and maximum number of days across all users
summary_days <- user_days %>%
  summarise(
    AverageDays = mean(NumberOfDays, na.rm = TRUE),
    MinDays = min(NumberOfDays, na.rm = TRUE),
    MaxDays = max(NumberOfDays, na.rm = TRUE)
  )

# Add the percentage column to the summary_days tibble
summary_days <- summary_days %>%
  mutate(PercentageOfTotalDays = (AverageDays / 31) * 100)

# Print the updated summary statistics
print(summary_days)

Activity Level Categories

Fitabase categorizes activity levels into four groups: Sedentary, Lightly Active, Fairly Active, and Very Active.
According to this LiveStrong article, the “10,000 Steps Project” suggests five levels of activity. I combined the last two levels into the Very Active category (10,000 steps and above). By averaging the Daily Total Steps measured for each participant, I was able to categorize each participant into their respective Activity Level.

Even Distribution

The distribution across categories is relatively even, with no single category dominating the user base.
This suggests that users have varied levels of activity, and the platform caters to a diverse group of individuals with different activity levels.

Middle Range Dominance

The middle categories (Lightly Active and Fairly Active) each account for 27.3% of users.
This indicates that most users fall into moderate activity levels, which could be targeted for further engagement and improvement.

Opportunities for Engagement

Users in the Sedentary category could be targeted with personalized interventions and encouragement to increase their activity levels. Reward systems could be utilized to encourage movement.
Users in the Very Active category might be interested in advanced features or challenges to maintain their high activity levels.

# Calculate the average for each column for each unique Id
avg_daily_combined <- daily_combined %>%
  group_by(Id) %>%
  summarise(
    AvgStepTotal = mean(StepTotal, na.rm = TRUE),
    AvgTotalDistance = mean(TotalDistance, na.rm = TRUE),
    AvgVeryActiveMinutes = mean(VeryActiveMinutes, na.rm = TRUE),
    AvgFairlyActiveMinutes = mean(FairlyActiveMinutes, na.rm = TRUE),
    AvgLightlyActiveMinutes = mean(LightlyActiveMinutes, na.rm = TRUE),
    AvgSedentaryMinutes = mean(SedentaryMinutes, na.rm = TRUE),
    AvgCalories = mean(Calories, na.rm = TRUE),
    AvgTotalMinutesAsleep = mean(TotalMinutesAsleep, na.rm = TRUE),
    AvgTotalTimeInBed = mean(TotalTimeInBed, na.rm = TRUE)
  )

# Group users into categories based on the adjusted 'AvgStepTotal' definitions
categorized_data <- avg_daily_combined %>%
  mutate(
    ActivityCategory = case_when(
      AvgStepTotal < 5000 ~ "Sedentary",
      AvgStepTotal >= 5000 & AvgStepTotal <= 7499 ~ "LightlyActive",
      AvgStepTotal >= 7500 & AvgStepTotal <= 9999 ~ "FairlyActive",
      AvgStepTotal >= 10000 ~ "VeryActive"
    )
  )

# Ensure the categories are ordered correctly
categorized_data$ActivityCategory <- factor(categorized_data$ActivityCategory, 
                                            levels = c("Sedentary", "LightlyActive", "FairlyActive", "VeryActive"))

# Create a summary dataset for the pie chart
pie_data <- categorized_data %>%
  count(ActivityCategory) %>%
  mutate(Percentage = n / sum(n) * 100)

# Define the custom colors
custom_colors <- c("Sedentary" = "#fff5f5", 
                   "LightlyActive" = "#ffd9d9", 
                   "FairlyActive" = "#FF8967", 
                   "VeryActive" = "#ff6666")

# Create the pie chart
ggplot(pie_data, aes(x = "", y = Percentage, fill = ActivityCategory)) +
  geom_bar(width = 1, stat = "identity", color = "lightgrey") +
  coord_polar(theta = "y") +
  scale_fill_manual(values = custom_colors) +
  geom_text(aes(label = paste0(round(Percentage, 2), "%")), 
            position = position_stack(vjust = 0.5), 
            color = "black", 
            size = 3) +
  labs(title = "Percentage of Users per Activity Category") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),  # Remove x-axis text
        axis.ticks = element_blank(),   # Remove axis ticks
        panel.grid = element_blank(),   # Remove grid lines
        axis.title.x = element_blank(), # Remove x-axis title
        axis.title.y = element_blank()) # Remove y-axis title

Time Distribution Across Activity Levels

The majority of each day is spent at the sedentary level, accounting for 84.1% or 19 hours and 49 minutes.
This is approximately composed of 7 hours in bed, likely 8-10 hours of working and commuting, and possibly an hour or two of watching TV.
Lightly Active time accounts for 14% or 3 hours and 25 minutes, likely spent on routine daily tasks.
Very Active time is 1.97% or 28 minutes, potentially spent on vigorous exercise.
Lastly, 1.43%, or 20 minutes, is spent in the Fairly Active category, which could also involve exercise.

# Convert to data.table
setDT(minute_intensities)

# Get the range of numbers in the Intensity column
intensity_range <- minute_intensities[, .(min_intensity = min(Intensity, na.rm = TRUE), 
                                          max_intensity = max(Intensity, na.rm = TRUE))]

# Categorize activity levels
minute_intensities[, ActivityLevel := fifelse(Intensity == 0, "Sedentary",
                                              fifelse(Intensity == 1, "LightlyActive",
                                                      fifelse(Intensity == 2, "FairlyActive",
                                                              fifelse(Intensity == 3, "VeryActive", "Other"))))]

# Extract date from ActivityMinutes column
minute_intensities[, Date := as.Date(ActivityMinutes)]

# Calculate the amount of minutes spent in each ActivityLevel per user per day
daily_activity_levels <- minute_intensities[, .(MinutesSpent = .N), by = .(Id, Date, ActivityLevel)]

# Calculate the total minutes spent per user per day
total_minutes_per_day <- daily_activity_levels[, .(TotalMinutesSpent = sum(MinutesSpent, na.rm = TRUE)), by = .(Id, Date)]

# Merge to get the percentage of minutes spent in each activity level
daily_activity_levels <- merge(daily_activity_levels, total_minutes_per_day, by = c("Id", "Date"))
daily_activity_levels[, Percentage := (MinutesSpent / TotalMinutesSpent) * 100]

# Calculate the average percentage per activity level
average_percentage_per_level <- daily_activity_levels[, .(AvgPercentage = mean(Percentage, na.rm = TRUE)), by = .(Id, ActivityLevel)]

# Calculate average minutes spent per activity level per user
average_minutes_per_level <- daily_activity_levels[, .(AvgMinutesSpent = mean(MinutesSpent, na.rm = TRUE)), by = .(Id, ActivityLevel)]

# Merge average percentage back to the average_minutes_per_level dataset
average_minutes_per_level <- merge(average_minutes_per_level, average_percentage_per_level, by = c("Id", "ActivityLevel"))

# Add a column to convert minutes to hours
average_minutes_per_level[, AvgHoursSpent := AvgMinutesSpent / 60]

# Summarize the total average minutes spent, average percentage, and average hours spent for each activity level
summary_table <- average_minutes_per_level[, .(
  TotalAvgMinutesSpent = sum(AvgMinutesSpent, na.rm = TRUE),
  TotalAvgPercentage = mean(AvgPercentage, na.rm = TRUE),
  TotalAvgHoursSpent = sum(AvgHoursSpent, na.rm = TRUE)
), by = ActivityLevel]

# Ensure the total average percentage for each activity level
summary_table <- summary_table[, .(
  ActivityLevel,
  TotalAvgMinutesSpent,
  TotalAvgPercentage,
  TotalAvgHoursSpent
)]

# Define custom colors for the Activity Levels
activity_colors <- c(
  "Sedentary" = "#fff5f5",
  "LightlyActive" = "#ffd9d9",
  "FairlyActive" = "#FF8967",
  "VeryActive" = "#FF6666"
)

# Create labels for the legend
summary_table[, legend_label := paste0(ActivityLevel, " (", round(TotalAvgPercentage, 2), "%)")]

# Create the pie chart using ggplot2
ggplot(summary_table, aes(x = "", y = TotalAvgPercentage, fill = ActivityLevel)) +
  geom_bar(width = 1, stat = "identity", color = "lightgrey") +
  coord_polar(theta = "y") +
  scale_fill_manual(values = activity_colors, labels = summary_table$legend_label) +
  labs(title = "Percentage of Time Spent in Each Activity Level") +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),  # Remove x-axis text
    axis.ticks = element_blank(),   # Remove axis ticks
    panel.grid = element_blank(),   # Remove grid lines
    axis.title.x = element_blank(), # Remove x-axis title
    axis.title.y = element_blank(), # Remove y-axis title
    legend.position = "right"       # Position legend to the right
  ) +
  guides(fill = guide_legend(title = "Activity Level"))

## # A tibble: 4 × 4
##   ActivityLevel TotalAvgMinutesSpent TotalAvgPercentage TotalAvgHoursSpent
##   <chr>                        <dbl>              <dbl>              <dbl>
## 1 FairlyActive                  20.4               1.43              0.339
## 2 LightlyActive                206.               14.4               3.43 
## 3 Sedentary                   1190.               84.1              19.8  
## 4 VeryActive                    28.3               1.97              0.471

# Summarize the total average minutes spent, average percentage, and average hours spent for each activity level
summary_table <- average_minutes_per_level[, .(
  TotalAvgMinutesSpent = mean(AvgMinutesSpent, na.rm = TRUE),
  TotalAvgPercentage = mean(AvgPercentage, na.rm = TRUE),
  TotalAvgHoursSpent = mean(AvgHoursSpent, na.rm = TRUE)
), by = ActivityLevel]

# Print the summary table
tibble(summary_table)

Activity Patterns

Peak Activity Times

High activity times range between 12 PM and 3 PM, with peak activity occurring between 5 PM and 7 PM.
Peak activity is most likely due to post-work workouts.
Send reminders or tips for physical activities during high activity times (12 PM to 3 PM) and peak times (5 PM to 7 PM). Additionally, launch fitness challenges or competitions that align with these peak activity periods to encourage participation and engagement.

# Separate Date and Time
hourly_steps <- hourly_steps %>%
  mutate(
    Date = as.Date(ActivityHour),
    Time = format(ActivityHour, format = "%H:%M:%S")
  )

# Summarize average steps by time
average_steps_by_time <- hourly_steps %>%
  group_by(Time) %>%
  summarize(average_steps = mean(StepTotal, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(Time = factor(Time, levels = unique(Time)))
  
# Create the histogram
ggplot(average_steps_by_time, aes(x = Time, y = average_steps, fill = average_steps)) + 
  geom_col(color = "lightgrey") + 
  labs(title = "Peak Activity Hours", x = "Time", y = "Average Steps") + 
  scale_fill_gradient(low = "#ffe6e6", high = "#E04B4B") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Average Steps, Distance, Calories

## # A tibble: 1 × 3
##   OverallAvgStepsPerDay OverallAvgDistancePerDay OverallAvgCaloriesPerDay
##                   <dbl>                    <dbl>                    <dbl>
## 1                 7527.                     5.41                    2283.

Steps

The overall average steps fall almost exactly between the lightly active and fairly active categories at 7,519.27. This is close to the commonly recommended 10,000 steps per day, suggesting that users are relatively active but could still improve to reach the recommended goal through alerts, step challenges, encouragement, or reward systems.
Identify users with significantly lower step counts and provide them with personalized interventions or encouragement to increase their activity levels.
Segment users based on their average step counts to tailor content and features. For instance, beginners/sedentary users may need basic guidance and motivation, while more active users might benefit from advanced analytics and challenges.

Distance

Unable to completely determine if distance is measured in km or M however, according to the OmniCalculator site, the distance is likely measured in km.
In alignment with step counts, the average distance traveled could be slightly increased to healthier levels through encouragement features.

Calories

The average daily caloric expenditure for each user from the Fitabase data is 2,282 calories, which is slightly below the average caloric burn for a moderately active person (2,437.5 calories) as referenced by WebMD. This suggests that, participants in the Fitabase data may be slightly less active than the general moderately active population.
There is potential to encourage users to increase their activity levels slightly to meet or exceed the average caloric expenditure of a moderately active person. Note: This dataset is not gender-specific. Since Bellabeat is a female-focused brand, it is important to consider the average caloric expenditure of a moderately active female when utilizing this information for marketing and product improvements.

# Calculate the averages for each user
user_averages <- daily_combined %>%
  group_by(Id) %>%
  summarise(
    AvgStepsPerDay = mean(StepTotal, na.rm = TRUE),
    AvgDistancePerDay = mean(TotalDistance, na.rm = TRUE),
    AvgCaloriesPerDay = mean(Calories, na.rm = TRUE)
  )

# Summarize the averages across all users
overall_averages <- user_averages %>%
  summarise(
    OverallAvgStepsPerDay = mean(AvgStepsPerDay, na.rm = TRUE),
    OverallAvgDistancePerDay = mean(AvgDistancePerDay, na.rm = TRUE),
    OverallAvgCaloriesPerDay = mean(AvgCaloriesPerDay, na.rm = TRUE)
  )

# Print the overall averages
print(overall_averages)

Calories vs Steps

As expected, there is a strong positive correlation showing that as users’ physical activity (measured by steps) increases, their caloric expenditure also rises. This underscores the importance of physical activity in managing energy balance and weight.

# Create the scatter plot with trendline
ggplot(daily_combined, aes(x = StepTotal, y = Calories)) +
  geom_point(color = "#FF8967", alpha = 0.5) +  # Scatter plot points
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Trendline
  labs(title = "Scatter Plot of Total Steps vs Calories",
       x = "Total Steps",
       y = "Calories") +
  theme_minimal() +  # Minimal theme
  theme(panel.grid.major = element_blank(),  # Remove major grid lines
        panel.grid.minor = element_blank(),  # Remove minor grid lines
        plot.margin = margin(t = 20, r = 20, b = 20, l = 20))  # Add margins

Daily Activity Stats

Tuesday (8125 steps) and Saturday (8153 steps) have the highest average steps, both falling into the “FairlyActive” category.
The weekend days (Saturday and Sunday) show a contrast in activity levels. Saturday is one of the most active days, while Sunday is one of the least active days.
Sunday has the lowest average steps (6933), falling into the “LightlyActive” category.
This could be an opportunity to develop, promote, and market weekend specific fitness plans or programs that cater to both high activity days like Saturday and low activity days like Sunday.
Highlight the benefits of staying active over the weekend by offering specific workout challenges and promoting family-focused active activities. This could broaden the appeal, encouraging users to involve their family members, potentially increasing the number of new users and enhancing overall user satisfaction and loyalty.

# Add a new column for the day of the week
daily_combined <- daily_combined %>%
  mutate(DayOfWeek = weekdays(as.Date(ActivityDate)))

# Calculate the average TotalSteps and Calories by day of the week
average_activity <- daily_combined %>%
  group_by(DayOfWeek) %>%
  summarise(
    AvgStepTotal = mean(StepTotal, na.rm = TRUE),
    AvgCalories = mean(Calories, na.rm = TRUE)
  )

# Reorder the days of the week for plotting
average_activity$DayOfWeek <- factor(average_activity$DayOfWeek, 
                                     levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

# Define colors for steps categories
step_colors <- c("Sedentary" = "#fff5f5", 
                 "LightlyActive" = "#ffd9d9", 
                 "FairlyActive" = "#FF8967", 
                 "VeryActive" = "#ff6666")

# Categorize average steps
average_activity <- average_activity %>%
  mutate(StepCategory = case_when(
    AvgStepTotal < 5000 ~ "Sedentary",
    AvgStepTotal >= 5000 & AvgStepTotal <= 7499 ~ "LightlyActive",
    AvgStepTotal >= 7500 & AvgStepTotal <= 9999 ~ "FairlyActive",
    AvgStepTotal >= 10000 ~ "VeryActive"
  ))

# Plot average TotalSteps by day of the week with custom colors and horizontal lines
ggplot(average_activity, aes(x = DayOfWeek, y = AvgStepTotal, fill = StepCategory)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = step_colors) +
  geom_hline(yintercept = 5000, color = "black", linetype = "solid", linewidth = 0.25) +
  geom_hline(yintercept = 7500, color = "black", linetype = "solid", linewidth = 0.25) +
  geom_hline(yintercept = 10000, color = "black", linetype = "solid", linewidth = 0.25) +
  labs(title = "Average Total Steps by Day of the Week",
       x = "Day of the Week",
       y = "Average Total Steps") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(),  # Remove major grid lines
        panel.grid.minor = element_blank(),  # Remove minor grid lines
        plot.margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt"))

AAP = Average Active Person Caloric Burn (2437.5)
AAF = Average Active Female Caloric Burn (2150)
Since demographic data is nonexistent, we must assume these results represent a mixed-gender population. However, setting a daily caloric target based on gender and age could help motivate users more effectively.
The average calories burned range from 2199.571 to 2356.013, showing a relatively consistent caloric burn across the week.

# Plot average Calories by day of the week with reference lines and annotations
ggplot(average_activity, aes(x = DayOfWeek, y = AvgCalories)) +
  geom_bar(stat = "identity", fill = "lightgrey") +
  geom_hline(yintercept = 2437.5, color = "#ff6666", linetype = "solid", linewidth = 0.4) +
  geom_hline(yintercept = 2150, color = "#FF8967", linetype = "solid", linewidth = 0.4) +
  annotate("text", x = 4, y = 2437.5, label = "AAP", 
           color = "black", vjust = 0.5) +
  annotate("text", x = 4, y = 2150, label = "AAF", 
           color = "black", vjust = 1) +
  labs(title = "Average Calories Burned by Day of the Week",
       x = "Day of the Week",
       y = "Average Calories Burned") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(),  # Remove major grid lines
        panel.grid.minor = element_blank(),  # Remove minor grid lines
        plot.margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt"))

Sleep Patterns

Sleep Metrics

Two metrics were recorded: TotalTimeAsleep and TotalTimeInBed. Both were converted into hours.
On average, users spent 7 hours in bed and 6hrs 19 mins asleep. This falls 41 minutes short of the recommended sleep time for adults, as advised by the Sleep Foundation.
This presents an opportunity to promote healthier sleep patterns by providing content and tools to help users manage their time better and develop relaxing end-of-day routines. Features can include bedtime reminders, relaxation techniques, and advice on minimizing screen time before bed.

# Calculate the number of distinct days each user recorded sleep data
user_days <- sleep_day %>%
  group_by(Id) %>%
  summarise(NumberOfDays = n_distinct(ActivityDate))

# Summarise total minutes asleep and total time in be for each user
summary_sleep <- sleep_day %>%
  group_by(Id) %>%
  summarise(
    TotalMinutesAsleep = sum(TotalMinutesAsleep, na.rm = TRUE),
    TotalTimeInBed = sum(TotalTimeInBed, na.rm = TRUE)
  )

# Calculate averages for minutes asleep and time in bed, then convert to hours
summary_sleep <- summary_sleep %>%
  left_join(user_days, by = "Id") %>%
  mutate(
    AvgMinutesAsleep = TotalMinutesAsleep / NumberOfDays,
    AvgTimeInBed = TotalTimeInBed / NumberOfDays,
    AvgHoursAsleep = AvgMinutesAsleep / 60,
    AvgHoursInBed = AvgTimeInBed / 60
  )

# Calculate the overall average minutes asleep and time in bed
overall_averages <- summary_sleep %>%
  summarise(
    OverallAvgMinutesAsleep = mean(AvgMinutesAsleep, na.rm = TRUE),
    OverallAvgTimeInBed = mean(AvgTimeInBed, na.rm = TRUE),
    OverallAvgHoursAsleep = OverallAvgMinutesAsleep / 60,
    OverallAvgHoursInBed = OverallAvgTimeInBed / 60
  )
# Create a data frame for plotting
plot_data <- tibble(
  Metric = c("Average Hours Asleep", "Average Hours In Bed"),
  Value = c(overall_averages$OverallAvgHoursAsleep, overall_averages$OverallAvgHoursInBed),
  Color = c("#FF8967", "#ffb6b9")
)

# Plot the horizontal bar graph
ggplot(plot_data, aes(x = Value, y = Metric, fill = Color)) +
  geom_bar(stat = "identity", color = "lightgrey", width = 0.5) +
  geom_text(aes(label = sprintf("%.2f", Value)), hjust = -0.2, color = "black", size = 3.5) +  # Annotate each bar with exact numbers to 2 decimal places
  scale_fill_identity() +
  labs(title = "Overall Average Hours Asleep and In Bed",
       x = "Hours",
       y = "") +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 10, face = "bold"),
    axis.text.x = element_text(size = 10),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

Currently, 72.7% of all participants utilize the sleep recording feature.
Among these users, only 50% can be categorized as consistent users of this feature, with 12.5% using it moderately and over a third (37.5%) using it minimally.
On average, users recorded their sleep data only 17 out of 31 days, or 55% of the time.
Less than desirable usage could be due to charging device overnight. Further exploration of this is needed.

# Calculate the number of distinct days each user recorded sleep data
user_days <- sleep_day %>%
  group_by(Id) %>%
  summarise(NumberOfDays = n_distinct(ActivityDate))

# Categorize users based on the number of days they recorded data
user_days <- user_days %>%
  mutate(UsageCategory = case_when(
    NumberOfDays >= 1 & NumberOfDays <= 10 ~ "Minimal Use 1-10 Days",
    NumberOfDays >= 11 & NumberOfDays <= 20 ~ "Moderate Use 11-20 Days",
    NumberOfDays >= 21 & NumberOfDays <= 31 ~ "Consistent Use 21-31 Days"
  ))

# Adjust the order of the UsageCategory factor levels
user_days <- user_days %>%
  mutate(UsageCategory = factor(UsageCategory, levels = c("Consistent Use 21-31 Days", "Moderate Use 11-20 Days", "Minimal Use 1-10 Days")))


# Create a summary dataset for the pie chart
usage_summary <- user_days %>%
  count(UsageCategory) %>%
  mutate(Percentage = (n / sum(n)) * 100)

# Calculate label positions
usage_summary <- usage_summary %>%
  arrange(desc(UsageCategory)) %>%
  mutate(ypos = cumsum(Percentage) - 0.5 * Percentage)

# Create the pie chart with outside annotations
ggplot(usage_summary, aes(x = "", y = Percentage, fill = UsageCategory)) +
  geom_bar(width = 1, stat = "identity", color = "lightgrey") +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("Minimal Use 1-10 Days" = "#fff5f5", 
                               "Moderate Use 11-20 Days" = "#ffd9d9", 
                               "Consistent Use 21-31 Days" = "#FF8967")) +
  geom_text(aes(y = ypos, label = paste0(round(Percentage, 2), "%")), 
            color = "black", 
            size = 3.5, 
            nudge_x = 0.7) +  # Adjust position to place labels outside
  labs(title = "Usage Rate of Sleep Feature",
       x = "",
       y = "") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),  # Remove x-axis text
        axis.ticks = element_blank(),   # Remove axis ticks
        panel.grid = element_blank(),   # Remove grid lines
        axis.title.x = element_blank(), # Remove x-axis title
        axis.title.y = element_blank()) # Remove y-axis title

Sleep Quality Vs Physical Activity

Interestingly, there is a slight negative correlation between physical exertion and sleep duration, which contradicts the advice from the Sleep Foundation.
However, this could be due to other factors not recorded in the dataset, including diet, screen time, TV habits, water intake, stress, and work/life commitments.

# Create the scatter plot with trendline
ggplot(daily_combined, aes(x = StepTotal, y = TotalMinutesAsleep)) +
  geom_point(color = "#FF8967", alpha = 0.5) +  # Scatter plot points
  geom_smooth(method = "lm", color = "black", se = FALSE) +  # Trendline
  labs(title = "Scatter Plot of Total Steps vs Total Minutes Asleep",
       x = "Step Total",
       y = "Total Minutes Asleep") +
  theme_minimal() +  # Minimal theme
  theme(panel.grid.major = element_blank(),  # Remove major grid lines
        panel.grid.minor = element_blank(),  # Remove minor grid lines
        plot.margin = margin(t = 20, r = 20, b = 20, l = 20))  # Add margins

Heart Rate

Only 14 users (42.4%) utilized this feature.
The average bpm of 79.98 falls within the normal resting heart rate range (60-100), which is good.
The maximum heart rate recorded was 203, which is within the expected range. A commonly used formula to estimate your maximum heart rate is 220 minus your age.
The lowest heart rate recorded was 36, which is lower than that of a well-trained athlete (40 bpm).
All general healthy heart rate ranges referenced from the Mayo Clinic, and Heart Online.

# Summarize max, min, and average bpm by Id
bpm_summary <- heartrate_seconds_dt %>%
  group_by(Id) %>%
  summarize(
    max_bpm = max(Bpm, na.rm = TRUE),
    min_bpm = min(Bpm, na.rm = TRUE),
    average_bpm = mean(Bpm, na.rm = TRUE)
  )

# Overall Bpm Summary
overall_bpm_summary <- bpm_summary %>%
  summarize(
    max_bpm = max(max_bpm, na.rm = TRUE),
    min_bpm = min(min_bpm, na.rm = TRUE),
    average_bpm = mean(average_bpm, na.rm = TRUE)
  )

# Create a summary table
plot_data <- data.frame(
  Metric = c("Max Bpm", "Min Bpm", "Average Bpm"),
  Value = c(overall_bpm_summary$max_bpm,
            overall_bpm_summary$min_bpm,
            overall_bpm_summary$average_bpm)
)

# Create the bar plot
ggplot(plot_data, aes(x = Metric, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", color = "lightgrey") +
  geom_text(aes(label = round(Value, 2)), vjust = -0.5, color = "black", size = 3) +  # Adjust size here
  scale_fill_manual(values = c("Max Bpm" = "#ff6666", "Min Bpm" = "#ffe6e6", "Average Bpm" = "#FF8967")) +
  labs(title = "Overall Bpm Summary", x = "", y = "Bpm") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 12, face = "bold"),
        axis.text.y = element_text(size = 10),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ylim(0, max(plot_data$Value) * 1.1)  # Increase y-axis height

Weight Insights

This feature is utilized by the smallest group of users, accounting for 24.2% or 8 people.
Those who used this feature fall into a diverse weight range, from 52.6 kg (116 lbs) to 134 kg (294 lbs).
The average weight is 77.8 kg (171.54 lbs), and the average BMI is 28.
Since the average BMI of 28 falls into the overweight category, as referenced by the CDC, providing gentle support towards achieving a healthier range is recommended.
The BMI values range from 21.5 to 47.5. Creating and providing segmented advice based on BMI categories could significantly enhance user engagement and the utilization of this feature.

# Summarize max, min, and average WeightKg, WeightPounds, and BMI by Id
weight_summary <- weight_log %>%
  group_by(Id) %>%
  summarize(
    max_kg = max(WeightKg, na.rm = TRUE),
    min_kg = min(WeightKg, na.rm = TRUE),
    avg_kg = mean(WeightKg, na.rm = TRUE),
    max_lb = max(WeightPounds, na.rm = TRUE),
    min_lb = min(WeightPounds, na.rm = TRUE),
    avg_lb = mean(WeightPounds, na.rm = TRUE),
    max_BMI = max(BMI, na.rm = TRUE),
    min_BMI = min(BMI, na.rm = TRUE),
    avg_BMI = mean(BMI, na.rm = TRUE)
  )

# Overall Bpm Summary
overall_weight_summary <- weight_summary %>%
  summarize(
    max_kg = max(max_kg, na.rm = TRUE),
    min_kg = min(min_kg, na.rm = TRUE),
    avg_kg = mean(avg_kg, na.rm = TRUE),
    max_lb = max(max_lb, na.rm = TRUE),
    min_lb = min(min_lb, na.rm = TRUE),
    avg_lb = mean(avg_lb, na.rm = TRUE),
    max_BMI = max(max_BMI, na.rm = TRUE),
    min_BMI = min(min_BMI, na.rm = TRUE),
    avg_BMI = mean(avg_BMI, na.rm = TRUE)
  )

# Create a summary table for the plot
plot_data <- data.frame(
  Metric = rep(c("Weight (kg)", "Weight (lb)", "BMI"), each = 3),
  Statistic = rep(c("Max", "Min", "Average"), 3),
  Value = c(overall_weight_summary$max_kg, overall_weight_summary$min_kg, overall_weight_summary$avg_kg,
            overall_weight_summary$max_lb, overall_weight_summary$min_lb, overall_weight_summary$avg_lb,
            overall_weight_summary$max_BMI, overall_weight_summary$min_BMI, overall_weight_summary$avg_BMI)
)

# Create the grouped bar plot
ggplot(plot_data, aes(x = Metric, y = Value, fill = Statistic)) +
  geom_bar(stat = "identity", position = "dodge", color = "lightgrey") +
  geom_text(aes(label = round(Value, 2)), position = position_dodge(width = 0.9), vjust = -0.5, color = "black", size = 3) + 
  scale_fill_manual(values = c("Max" = "#ff6666", "Min" = "#ffe6e6", "Average" = "#FF8967")) +
  labs(title = "Overall Weight and BMI Summary", x = "Metric", y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 12, face = "bold", angle = 45, hjust = 1),
        axis.text.y = element_text(size = 10),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ylim(0, max(plot_data$Value) * 1.1)  # Increase y-axis height

Phase 5: Share and Act

User Engagement:

Insight: Users highly value tracking their physical activity, showing effective engagement among consistent users. However, there are significant opportunities to further engage moderate and minimal use users.