The BellaBeat Smart Wellness Tracker

Osorio O. Matucurane

Overview

  1. The Bellabeat Wellness Smart Tracker

  2. Business Problem Statment

  3. Ascertaining the Data Quality

  4. Data Exploration with R Programming

  5. Summarizing the Results

  6. Communicate the Findings with GGPLOT2

  7. Final Considerations

BellaBeat - Wellness and Readness Tracker

  • Bellabeat is a Smart tech wellness brand dedicated at women’s health

  • An absolute “Health & wellness game changer”​

  • Fashionable health trackers designed and engineered for women

  • Versatile to be able to be worn on the wrist, collar, or neck, clip it on clothes

Bellabeat Smart Jewelry

:::

The Smart Tracker Regarded Best in the Market

  • The Bellabeat fashionable smart jewelry tracker has no display
  • The tracker is fitted with sensors and it sync with an app

What Bellabeat Realy Tracks

  • The tracker works 24/7 whether you’re sleeping, being active, or meditating.
  • Tracking and monitoring biometric data (respiratory rate, resting heart rate and VHR) and sleep pattern
  • Tracking and monitoring lifestyle data such as steps and distance moved

Business Statement

Explore daily usage data on Bellabeat fitness tracker app to identify trends, patterns and gather sufficient evidences that should enlighten and empower data driven decision making.

  • Bellabeat has no substantial evidences on how customers effectively exploit their products

  • Bellabeat lacks feedback on which features are most valued by their customers

Research Questions

  • What are some trends in smart device usage?

  • How could these trends apply to Bellabeat customers?

  • How could these trends help influence Bellabeat’s marketing strategy?

Data Analysis Tool

  • R Programming is the favorite statistical data analysis software.

  • The following packages/libraries will be used.

Code
#|warning: false
#|message: false
#|error: false 

library(readr)
library(tidyr)
library(dplyr)
library(lubridate)
library(hms)
library(forcats)
library(ggplot2)
library(ggthemes)
library(RColorBrewer)
library(viridis)
library(gt)
library(scales)
library(plotly)
library(summarytools)
library(janitor)
library(flextable)
library(knitr)
library(glue)
library(tibble)

Importing and Loading Datasets

Next we import 7 datasets into R Studio and perform some basic preliminary data preparation by chaining :

  • Specify variable/column data type

  • rename some variables

  • create new variables (month and week day)

  • tidying (reshaping) data - keeping each column as a variable and each row as observation

  • grouping (binning) - transforming numerical variable into categories

  • Each data set is loaded and attributed a single R object a

Code
#|warning: false
setwd("C:\\Users\\USER\\Documents\\DataAnalytics\\Projects\\FitBitFitness\\Dataset")

# daily activity dataset
active <- read_csv("dailyActivity_merged.csv",
  col_types = cols(
    Id = col_character(),
    ActivityDate = col_date(format = "%m/%d/%Y")
  )
) %>%
  dplyr::rename(
    tracker_id = Id,
    tracker_date = ActivityDate,
    tracker_steps = TotalSteps,
    total_distance = TrackerDistance,
    high_dist = VeryActiveDistance,
    moder_dist = ModeratelyActiveDistance,
    light_dist = LightActiveDistance,
    sedent_dist = SedentaryActiveDistance,
    long_tm = VeryActiveMinutes,
    fair_tm = FairlyActiveMinutes,
    light_tm = LightlyActiveMinutes,
    sedent_tm = SedentaryMinutes,
    logged_kms = LoggedActivitiesDistance,
    calor_burnt = Calories
  ) %>%
  pivot_longer(
    cols = ends_with("dist"),
    names_to = "activ_move",
    values_to = "activ_distance"
  ) %>%
  pivot_longer(
    cols = ends_with("tm"),
    names_to = "activ_duration",
    values_to = "activ_time"
  ) %>%
  mutate(
    active_month = month(tracker_date, label = TRUE),
    active_day = wday(tracker_date, label = TRUE),
    active_wday = as.factor(if_else((active_day == "Sun" | active_day == "Sat"), "weekend", "busday")),
    activ_move = as.factor(activ_move),
    activ_duration = as.factor(activ_duration)
  ) %>%
  mutate(nr_steps = as.factor(case_when(
    tracker_steps < 5000 ~ "sedent",
    tracker_steps < 7500 ~ "active",
    tracker_steps <= 10000 ~ "moder_act",
    tracker_steps <= 12500 ~ "hyper_act",
    TRUE ~ "athelete"
  )), .after = 3)
Code
#|warning: false
setwd("C:\\Users\\USER\\Documents\\DataAnalytics\\Projects\\FitBitFitness\\Dataset")


# 1. sleep datset
sleep <- read_csv("sleepDay_merged.csv",
  col_types = cols(
    Id = col_character()
  )
) %>%
  mutate(
    sleep_date = as.Date(SleepDay, format = "%m/%d/%Y %H:%M:%S"),
    sleep_time = parse_date_time(SleepDay, "%m/%d/%Y %I:%M:%S %p"),
    sleep_hm = format(as.POSIXct(sleep_time), format = "%H:%M"),
    sleep_hms = as_hms(sleep_time),
    sleep_month = month(sleep_date, label = TRUE),
    sleep_day = wday(sleep_date, label = TRUE)
  ) %>%
  rename(
    tracker_id = Id,
    sleep_duration = TotalMinutesAsleep,
    count_sleep = TotalSleepRecords
  ) %>% select(-(SleepDay))

# 2. heartrate dataset
hrate <- read_csv("heartrate_seconds_merged.csv",
  col_types = cols(
    Id = col_character()
  )
) %>%
  mutate(
    hrate_date = as.Date(Time, format = "%m/%d/%Y %H:%M:%S"),
    hrate_time = parse_date_time(Time, "%m/%d/%Y %I:%M:%S %p"),
    hrate_hm = format(as.POSIXct(hrate_time), format = "%H:%M"),
    hrate_hms = as_hms(hrate_time),
    hrate_month = month(hrate_date, label = TRUE),
    hrate_day = wday(hrate_date, label = TRUE)
  ) %>%
  rename(
    tracker_id = Id
  ) %>%
  select(-c( Time))

# 3. weight dataset
weight <- read_csv("weightLogInfo_merged.csv",
  col_types = cols(
    Id = col_character()
  )
) %>%
  mutate(
    weight_date = as.Date(Date, format = "%m/%d/%Y %H:%M:%S"),
    weight_dtime = parse_date_time(Date, "%m/%d/%Y %I:%M:%S %p"),
    weight_time = as_hms(weight_dtime),
    weight_month = month(weight_date, label = TRUE),
    weight_day = wday(weight_date, label = TRUE)

  ) %>%
  rename(
    tracker_id = Id
  ) %>%
  select(-c(weight_dtime, Date))

# 4. daily calories burnt dataset
calories <- read_csv("dailyCalories_merged.csv",
   col_types = cols(
     Id = col_character(),
     ActivityDay = col_date(format = "%m/%d/%Y")
   )
 ) %>%
   rename(
     tracker_id = Id,
     calor_date = ActivityDay
   ) %>%
   mutate(
     calor_month = month(calor_date, label = TRUE),
     calor_day = wday(calor_date, label = TRUE))

# 5. daily Intensity dataset

intensities <- read_csv("dailyIntensities_merged.csv",
  col_types = cols(
    Id = col_character(),
    ActivityDay = col_date(format = "%m/%d/%Y")
  )
) %>%
  rename(
    tracker_id = Id,
    intensit_date = ActivityDay
  ) %>%
  mutate(
    intensit_month = month(intensit_date, label = TRUE),
    intensit_day = wday(intensit_date, label = TRUE))

# 6. daily Steps dataset

steps <- read_csv("dailySteps_merged.csv",
  col_types = cols(
    Id = col_character(),
    ActivityDay = col_date(format = "%m/%d/%Y")
  )
) %>%
  rename(
    tracker_id = Id,
    step_date = ActivityDay
  ) %>%
  mutate(
    stept_month = month(step_date, label = TRUE),
    step_day = wday(step_date, label = TRUE)
  )

Setting the Theme

We set a common customized theme for all coming charts and plots

Code
#|warning: false
my_theme <- theme_set(theme_classic() +
  theme(
    plot.subtitle = element_text(
      hjust = 0.5,
      size = 14,
      color = "skyblue",
      face = "bold",
      family = "Times",
      
    ),
    plot.caption = element_text(
      hjust = 1,
      size = 12, color = "grey",
      face = "italic"
    ),
    plot.title = element_text(
      hjust = 0.5,
      size = 16,
      color = "skyblue",
      face = "bold",
      family = "Tahoma"
    ),
    plot.tag = element_text(
      size = 14,
      color = "grey",
      face = "bold"
    ),
    axis.title = element_text(
      color = "steelblue",
      face = "bold",
      size = 15
    ),
    axis.line =  element_line(linewidth = 1.5, color = "lightgrey"),
    axis.text = element_text(
      face = "bold",
      color = "#993333",
      size = 14
    ),
    legend.title = element_blank(),
    legend.position = "top"
  ))

Dataset Dimensions (nrows, ncolumns)

How many observations and variables in each dataset?

  • Activities dataset : 15040, 15

  • intensities dataset : 940, 12

  • steps tracked dataset : 940, 5

  • calories burnt dataset : 940, 5

  • weight dataset : 67, 11

  • sleep dataset : 413, 10

  • heart rate dataset : 2483658, 8

Relationships Between The Datasets

  • Does all dataset share common elements (users)??
  • Which datasets are related in one or another way
Code
library(dplyr)
# unique vectors of the identifier in the smalldatasets
list_slp <- sleep %>% select(tracker_id) %>% unique() %>% as.vector()
list_hrt <- hrate %>% select(tracker_id) %>% unique() %>% as.vector()
list_wgt <- weight %>% select(tracker_id) %>% unique() %>% as.vector()
# Are they common identifiers? How many?
active %>% filter(tracker_id %in% list_slp) %>% sum() 
[1] 0
Code
active %>% filter(tracker_id %in% list_hrt) %>% sum() 
[1] 0
Code
active %>% filter(tracker_id %in% list_wgt) %>% sum() 
[1] 0
Code
# Are they common identifiers between the small datasets? 
sleep %>% filter(tracker_id %in% list_wgt) %>% sum() 
[1] 0
Code
sleep %>% filter(tracker_id %in% list_hrt) %>% sum() 
[1] 0
Code
hrate %>% filter(tracker_id %in% list_wgt) %>% sum()
[1] 0
  • There NO are common elements (users) between the 3 datasets sleep, heart rate and weight .

  • There is no meaningful way to merge these three data sets and run analysis as a single data set

Ascertaining Data Quality

  • The data is publicly available on [Kaggle: FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit) and stored in 18 csv files.

  • Personal fitness tracker data from bellaeat users who consented to the submission of information about their daily activity, steps, heart rate and sleep monitoring.

Activities Dataset Quality

Code
#|fig-cap: "Activities Tracking Record"
#|fig-supcap:
#|  - " Frequency of Records"
#|  - "Days of Records"
#|layout-ncol: 2
#|column: page

pal <- c(
  "<= 10% (poor)" = "red",
  "<= 25% (moderate)" = "orange", 
  "<= 50% (good)" = "yellow", 
  "<= 75% (great)" = "skyblue",
  "75-100% (superb)" = "forestgreen" 
) 

activities_label <- paste0(rep("fit-", 9), 
        seq(1,33,4))

active %>%
  select(tracker_id, tracker_date, activ_move, nr_steps, activ_duration, active_month) %>%
  count(tracker_id) %>%arrange(desc(n)) %>% 

  mutate(freq = round(n / (length(unique(active$tracker_date))*4*4)*100,2)) %>% 
  ggplot(aes(
    y = reorder(tracker_id, freq),
    x = freq,
    fill = case_when(freq <= 10 ~ "<= 10% (poor)",
                     freq <= 25 ~ "<= 25% (moderate)",
                     freq <= 50 ~ "<= 50% (good)",
                     freq <= 75 ~ "<= 75% (great)",
                       TRUE ~ "75-100% (superb)")
  )) +
  geom_col( alpha = 0.6) +
  scale_fill_manual(
    values = pal,
    limits = names(pal)
  )+
  scale_x_continuous(labels = percent_format(scale = 1))+
  scale_y_discrete(labels = activities_label) +
  theme(axis.text.y = element_blank() ,
        axis.ticks.y = element_blank()) +
  xlab("Rating of Respondents (percentage)") +
  ylab("Participants (Users)") +
  ggtitle("High Ratio of response for Wellness Records")

Code
intensities %>%
  group_by(tracker_id) %>%
  count() %>%
  arrange(desc(n)) %>%
  ggplot(aes(
    y = reorder(tracker_id, n),
    x = n,
    fill = n
  )) +
  geom_col() +
  scale_fill_gradient(low = "yellow", high = "lightgreen", na.value = NA) +
  ggtitle("Frequency of Activities Records in 30 days Period") +
  xlab("Days Tracked") +
  ylab("Individual Tracked Users") +
  theme(axis.text.y = element_blank())

Note: This data has a satisfactorily higher completion ratio where most respondents have tracked data covering the data collection interval.

The Sleep Dataset Quality

Code
#|fig-cap: "Sleep Tracking Record"
#|fig-supcap:
#|  - " SleepFrequency of Records"
#|  - "Days of Records"
#|layout-ncol: 2
#|column: page

pal <- c(
  "<= 10% (poor)" = "red",
  "<= 25% (moderate)" = "orange", 
  "<= 50% (good)" = "yellow", 
  "<= 75% (great)" = "lightgreen",
  "75-100% (superb)" = "forestgreen" 
) 

plt_sleep <- sleep %>%
  select(tracker_id, sleep_date) %>%
  count(tracker_id) %>%
  mutate(freq = n / length(unique(sleep$sleep_date))) %>%
  ggplot(aes(
    y = reorder(tracker_id, freq),
    x = freq,
    fill = case_when(
      freq <= .10 ~ "<= 10% (poor)",
      freq <= .25 ~ "<= 25% (moderate)",
      freq <= .50 ~ "<= 50% (good)",
      freq <= .75 ~ "<= 75% (great)",
      TRUE ~ "75-100% (superb)"
    )
  )) +
  geom_col(alpha = 0.6) +
  scale_fill_manual(
    values = pal,
    limits = names(pal)
  ) +
  scale_x_continuous(labels = percent) +
  scale_y_discrete(labels = activities_label) +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  xlab("Rating of Respondents (percentage)") +
  ylab("Participants (Users)") +
  ggtitle("Ratio  of Respondents for Sleep Records in 30 days")

plt_sleep

Code
sleep %>% 
  group_by(tracker_id) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  ggplot(aes(y = reorder(tracker_id,n),
             x = n,
             fill = n)) +
  geom_col()+
scale_fill_gradient(low = "yellow", high = "lightgreen", na.value = NA)+

  ggtitle("Sleep Frequency For Tracked Users in 30 Days Period") +
  xlab("Tracking Days") + ylab("Tracked Users")+
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Note: This data set is incomplete.

There are 5 respondents (15%) with low ratio of participation, bellow 10%.

Around 75% of respondents have complete record in 30 days period

  • Spotted one user responding 32 days - potential OUTLIER

  • Only 2 respondents responding 31 days period selected

  • Only 4 respondents 28 days

  • Only 2 respondents 25 days

  • There are 5 users that responded less that 10% (less that 3 days)

Heart Beat Data Quality Issues

Note:

  • In addition to fewer users monitoring heart rate in 30 days,

  • The dataset is incomplete, with about 50% getting tracked the heart beats scores in the 30 days period

Weight Tracker DataSet

  • This dataset is extremely poor, with only 8 respondents where 6 of them only tracking the weight for less than 10 days

Data Quality Summary

  • A good data should be Reliable, Original, Comprehensive, Current, and Cited (ROCCC).

  • Our data is far from being creditworthy, being riddled or mared with incomplete observations.

  • The sample size is smaller

  • The Data was collected backs to 2016, so not updated.

  • The data source remains credible

Exploring Data

Checking and Matching The Data Types

               Length Class   Mode     
tracker_id     15040  -none-  character
tracker_date   15040  Date    numeric  
tracker_steps  15040  -none-  numeric  
nr_steps       15040  factor  numeric  
TotalDistance  15040  -none-  numeric  
total_distance 15040  -none-  numeric  
logged_kms     15040  -none-  numeric  
calor_burnt    15040  -none-  numeric  
activ_move     15040  factor  numeric  
activ_distance 15040  -none-  numeric  
activ_duration 15040  factor  numeric  
activ_time     15040  -none-  numeric  
active_month   15040  ordered numeric  
active_day     15040  ordered numeric  
active_wday    15040  factor  numeric  

Describing Data - Summary Table

  • A quick broad overview of our data frame with the skimr

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
tracker_id 0 1 10 10 0 33 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
tracker_date 0 1 2016-04-12 2016-05-12 2016-04-26 31

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
nr_steps 0 1 FALSE 5 sed: 4848, act: 2736, mod: 2608, hyp: 2544
activ_move 0 1 FALSE 4 hig: 3760, lig: 3760, mod: 3760, sed: 3760
activ_duration 0 1 FALSE 4 fai: 3760, lig: 3760, lon: 3760, sed: 3760
active_month 0 1 TRUE 2 Apr: 9776, May: 5264, Jan: 0, Feb: 0
active_day 0 1 TRUE 7 Tue: 2432, Wed: 2400, Thu: 2352, Fri: 2016
active_wday 0 1 FALSE 2 bus: 11120, wee: 3920

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tracker_steps 0 1 7637.91 5084.61 0 3789.75 7405.50 10727.00 36019.00 ▇▇▁▁▁
TotalDistance 0 1 5.49 3.92 0 2.62 5.24 7.71 28.03 ▇▆▁▁▁
total_distance 0 1 5.48 3.91 0 2.62 5.24 7.71 28.03 ▇▆▁▁▁
logged_kms 0 1 0.11 0.62 0 0.00 0.00 0.00 4.94 ▇▁▁▁▁
calor_burnt 0 1 2303.61 717.81 0 1828.50 2134.00 2793.25 4900.00 ▁▆▇▃▁
activ_distance 0 1 1.35 2.15 0 0.00 0.09 2.19 21.92 ▇▁▁▁▁
activ_time 0 1 304.69 433.90 0 2.00 61.00 417.50 1440.00 ▇▁▁▁▁
  • We get our summary split into 4 data type categories
  • Detailed account/profile of the column data quality
  • No missing values and No duplicated entries reported
  1. $character - no issue

  2. $Date - no issue

  3. $factor - no issue

  4. $numeric - higher dispersion as given by higher sd statistic

  5. Histogram sketch with long tails suggesting some kind of unusual data points

Dealing with “Potential Outliers”:

  • Hunting unusual data points which are observations lying or falling distant from others

1. Scanning the Number of Steps

  • There are fewer observations spotted above 17500 steps
  • We set the limit to 17 500 steps

2. Scanning the Number of Calories Burnt

  • This pattern of observations for calories burnt seems to be common and plausible

3. Scanning the Distance

4. Scanning the Time Sync with the app

  • Time appears to be tracked in 30 minutes interval during 24 hours (1440 minutes)
  • The tracker device is weared 24 hours a day
  • There are 3 periods of substantial and significant activity
  • The majority is inactive, followed by 30 mints and 60 mints.
  • Most active period is 6 hours (360 mints)

6. Scanning Total Sleep Duration

  • Sleeping duration [3 - 12 hours]

7. Scanning Heart Rate Records

Code
#|label: fig_distance
#|fig-cap: "Distance  Tracked "
#|fig-supcap:
#|  - " Total Distance  Recorded"
#|  - "Distribution Cleanned Data"
#|layout-ncol: 2
#|column: page

hist_hrt1 <- hrate %>% select(Value) %>% 
  
 ggplot(aes(Value) )+
  geom_histogram(col = "tomato", fill = "chartreuse") +
  ggtitle("Distribution heart rate dataset") +
  scale_y_continuous(labels = label_comma())

ggplotly(hist_hrt1)
Code
hrate1 <- hrate %>% filter( Value <= 170)
hist_hrt2 <- hrate1 %>% select(Value) %>% 
  
 ggplot(aes(Value) )+
  geom_histogram(col = "#F4A582", fill = "#FDDBC7") +
  ggtitle("Cleaned Distribution heart rate dataset") +
  scale_y_continuous(labels = label_comma())

ggplotly(hist_hrt2)
  • Heart rate range: [60, 170]
  • values bellow 60 and above 170 beats are suspicious.

Data Summary

Descriptive Summary of Numerical Variables

Table 1. Activity Distance Moved - Average Distance
activ_move activ_distance_mean activ_distance_stdv
high_dist 1.190099120 1.896446727
light_dist 3.260660794 1.985207264
moder_dist 0.550627752 0.870193560
sedent_dist 0.001508811 0.007059017
Table 2. Activity Duration - Average Distance
activ_duration active_hours_mean active_hours_stdv
fair_tm 0.2180617 0.3338661
light_tm 3.1798458 1.8275716
long_tm 0.3101322 0.5017038
sedent_tm 16.5085903 5.0736653

Summary Categorical Variables

Tab. 3 Total Steps by Categories - Summary
Rank Daily Steps Steps Category Total Users Share
1 5 000 Sedentary 4,848 33.37%
2 10 000 Active 2,736 18.83%
3 7 500 Light Active 2,608 17.95%
4 12 500 Hyper Active 2,544 17.51%
5 12 500+ High Performer 1,792 12.33%

DATA VISUALIZATION WITH GGPLOT2

1. The Sample Size

2. Activities Tracking During 30 Days

  • The number of tracked users has declined sharply over the period

2. Wellness Tracking 1. Moved Distance

The Average distance moved is 5.06 kms

  • Bellabeat users are mostly less active (they move less)

  • They move on average 3.5 kms daily as light movements, 1.0 km and 0.5 km as high and moderate movements.

Tracking Metric 3. Logged Distance

  • About 97% of tracked users logged the distance (pre setting the target distance)

Wellness Tracking Metric 4. Average Active Time

  • Daily average active time (hours) = 5.05
  • About 17 hours are spent inactive, in sedentary activities like reading, watching, eating, ….

The Proportion of the Main Activity

  • Sedentary activity is the most dominant amg the tracked bellabeat users with 33% , moving less than 5 kms.

  • Occasionally they hit the recommended 10 km (18%).

  • Less frequently they go over 12. 5 km (12%)

Distribution for Users Active Time

Average time is 5.05 hours

  • Very active and fairly active activities levels receiving less than one hour (10exp(10))

Tracked Metric 5. Daily Average Steps

  • Average Steps = 7156.05
  • Tracked users barely and hardly hit 10 000 daily recommended steps.

  • Tracked users apparently more active during weekdays

Metric 6. Calories Incinerated by Tracked Users

Average Calories burnt by the tracked users = 2260.96

  • Average calories burnt slightly higher on busy days , but falling drastically in May

Average Sleep Duration

  • Week days average sleep hours higher and close to the recommended 8 hours

Heart Rate

  • Min Heart Rate = 36

  • Average Heart = 77.270074

  • Max Heart = 170

---

FINAL CONSIDERATIONS

Bealbeat Activity tracking is solid. The analysis of 33 reveals interesting patterns on how long they have been active, how far they have walked, how many calories you’ve burned and steps completed.

1. The activities are mostly tracked around 24 hours time

2. Tracked users are mostly sedentary where there spend 16 hours inactive, with average 7000 steps and burning 2300 calories .

3. They get sligtly active on week days from monday to friday.

4. It seems that tracked users are not engaged in high intensity cardio or work outs like cross fit training, running, jogging which typically burn more calories.

5. Heart beat, sleep and weigh are less tracked

6. We picked obsevations that are unusual, low and high heart rate, high number of steps and low seeleping hours. This could be related to low precision of the tracker suggesting needed improvements on the tracker.

7. A reminder could be included to alert users to get more active as they fall bellow recommended active scores.

8. A wider tracking period of at least 90 days and increased sample size covering much more users.

9. Bellabeat may improve the tracker utility. Users should have individual target, like weight loss, improved sleeping, and tracked againt the target and assess their performance.

10.. The last but not the least, based on these findings, although inconclusive, Bellabeat may consider designing a fitness plan targeted to improve the users activity scores and get more out of the smart tracker.