Bellabeat Fitbit Project

Background

Bellabeat, a high-tech manufacturer of health-focused products for women, including Leaf, Bellabeat’s classic wellness tracker, and Time, a luxury hybrid wellness watch, believes that analyzing non-Bellabeat smart device fitness data could help unlock new growth opportunities for the company. The goal of this analysis project is to analyze smart device data to gain insight into how consumers are using their smart devices and apply this knowledge to Bellabeat’s products and guide marketing strategy for the company. The results of the project will then be presented to the Bellabeat executive team along with high-level recommendations for the company’s marketing strategy.

Business Task

Analyze smart device usage data for trends and apply insights to both Bellabeat customers and marketing strategy.

Data Source

The data sets were downloaded from Kaggle and belong to the public domain. They contain the personal fitness tracking data from thirty Fitbit users who consented to the submission of their personal tracker data.

Setting Up the Enviornment

Installed and loaded the packages necessary for the analysis:

# install.packages("tidyverse")
# install.packages("skimr")
# install.packages("janitor")
# install.packages("compare")
# install.packages("ggpubr")
# install.packages("scales")
# install.packages("gridExtra")
# install.packages("reshape2")
library("tidyverse")
library("skimr")
library("janitor")
library("compare")
library("ggpubr")
library("scales")
library("gridExtra")
library("reshape2")

Importing the Data

Sett the correct file path and loaded and assigned the smart device usage data.

setwd("C:/Users/nredw/OneDrive/Documents/Nicholas/Learning/Data Analytics/Google Data Analytics Certificate/Capstone/Bellabeat Capstone/Data/Fitabase Data 4.12.16-5.12.16")
daily_activity <- read.csv("dailyActivity_merged.csv")
daily_calories <- read.csv("dailyCalories_merged.csv")
daily_intensities <- read.csv("dailyIntensities_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
daily_sleep <- read.csv("sleepDay_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")
heartrate <- read.csv("heartrate_seconds_merged.csv")

Investigating the Data

Compactly displayed the structure of the imported data sets and assessed them.

str(daily_activity)

## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

str(daily_calories)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

str(daily_intensities)

## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay             : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...

str(daily_steps)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...

str(daily_sleep)

## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

str(weight_log)

## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

str(heartrate)

## 'data.frame':    2483658 obs. of  3 variables:
##  $ Id   : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ Time : chr  "4/12/2016 7:21:00 AM" "4/12/2016 7:21:05 AM" "4/12/2016 7:21:10 AM" "4/12/2016 7:21:20 AM" ...
##  $ Value: int  97 102 105 103 101 95 91 93 94 93 ...

Observing the structure of the data sets above, it is apparent that the daily_calories, daily_intensities, and daily_steps data frames will not be necessary as the same data is contained within the daily_activity data frame. This can be seen from the snippets of data in the str() outputs, but also confirmed with the following comparisons:

compare(daily_activity$Calories, daily_calories$Calories)

## TRUE

compare(daily_activity$TotalSteps, daily_steps$StepTotal, ignoreNames = TRUE)

## TRUE

compare(daily_activity$VeryActiveDistance, daily_intensities$VeryActiveDistance)

## TRUE

compare(daily_activity$ModeratelyActiveDistance, daily_intensities$ModeratelyActiveDistance)

## TRUE

compare(daily_activity$LightActiveDistance, daily_intensities$LightActiveDistance)

## TRUE

compare(daily_activity$SedentaryActiveDistance, daily_intensities$SedentaryActiveDistance)

## TRUE

compare(daily_activity$VeryActiveMinutes, daily_intensities$VeryActiveMinutes)

## TRUE

compare(daily_activity$FairlyActiveMinute, daily_intensities$FairlyActiveMinute)

## TRUE

compare(daily_activity$LightlyActiveMinutes, daily_intensities$LightlyActiveMinutes)

## TRUE

compare(daily_activity$SedentaryMinutes, daily_intensities$SedentaryMinutes)

## TRUE

Additionally, the weight_log data frame only contains 67 observations, which is significantly less than the 960 in the daily_activities and the 413 in the daily_sleep data frames. To investigate further, a look into how many unique Ids appear in each data frame was performed:

#Calculating the amount of unique ids in each of the data frames to be used
weight_log_ids <- weight_log$Id %>% unique %>% length
daily_activity_ids <- daily_activity$Id %>% unique %>% length
daily_sleep_ids <- daily_sleep$Id %>% unique %>% length
heartrate_ids <- heartrate$Id %>% unique %>% length

#Setting up the columns for the data frame to be used in the plot
Data_Frame <- c("weight_log", "daily_activity", "daily_sleep", "heartrate")
Unique_Ids <- c(weight_log_ids, daily_activity_ids, daily_sleep_ids, heartrate_ids)

#Creating the plot
data.frame(Data_Frame, Unique_Ids) %>%
  
  #Setting the fill with the as.factor() function forces discrete colors between the different data frames
  ggplot(aes(x = Data_Frame, y = Unique_Ids, fill = as.factor(Data_Frame))) +
  geom_bar(stat = "identity") +
  ylab("Unique Ids") +
  xlab("Data Frame") +
  
  #Renames the title on the legend (otherwise it will read "as.Factor(Data_Frame)"
  scale_fill_discrete(name = "Legend") +
  
  #Add title and subtitle and adjusts the size
  labs(title = "Unique Ids in Each Data Frame", subtitle = "Most of the data frames contain less unique Ids than the total number of participants in the total data set (n=30).") +
  theme(plot.subtitle = element_text(size = 10))

The above demonstrates that only 8 individuals contributed to the weight_log data. A glance through this data frame in tabular form shows that much of the data is for only two distinct values in the Id column (i.e., most of the data only consists of two individuals). The remaining Ids only have entries on 2-4 dates each. As such, it would be difficult to use this data to conduct analyses as it would be biased towards essentially the activity of just two individuals. Below demonstrates much of the data belongs to just two unique Ids:

weight_log_entries <- weight_log %>% 
  group_by(Id) %>% 
  count

weight_log_entries

#Establish simpler names for Ids for presenting
users = c()
for (i in 1:length(weight_log_entries$Id)) {
  users[i] <- paste("User ", as.character(i))
}

#Calculating percentage of entries per Id
weight_log_chart <- weight_log_entries %>%
   mutate(percent = round(100 * n / sum(weight_log_entries$n),1))
weight_log_chart$user <- users

#Creating pie chart of entries per Id
weight_log_chart %>%
  ggplot(aes(x = "", y = percent, fill = as.factor(user))) +
  geom_col() +
  # Establishes the circular graph
  coord_polar(theta = "y") +
  # Changes color palette
  scale_fill_brewer(palette = "Set3", name = "Legend") +
  #Removes extraneous elements
  theme_void() +
  #Adds labels for the percentages and positions them
  geom_text(aes(x = 1.6, label = percent),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Weight Log Entries by User", subtitle = "More than 75% of weight log entries came from just two distinct users.")

Only 8 users out of ~30 in the total data set logged any weight entries. With 2 of those 8 logging most of the entries, only 2 out 30 (~7%) consistently logged their weight. The lack of quality data around weight highlights an opportunity for encouraging users to be diligent and consistent with weight tracking through notifications. A possible marketing strategy could be to highlight the ease of tracking health data with the Bellabeat app, emphasizing the simplicity of logging weight entries. Future analysis could involve exploring if those who track their activity, sleep, weight, and other health data more consistently tend to lose more weight. Demonstrating this relationship would provide a marketing strategy for selling Bellabeat products (i.e., using the app and the smart wellness products leads to losing weight). It would also be an effective means of encouraging users to continue using the Bellabeat app and the Leaf and Time products.

Cleaning the Data

After exploring the daily_activity data in tabular form, it was observed that many of the rows contain “zero” across all or many of the columns. Sorting by either TotalSteps or TotalDistance made this apparent:

daily_activity %>% arrange(TotalSteps)

It is likely the rows that contain 0 in the TotalStep column indicate days that the individual did not wear their smart device, as it is not realistic for a person to take no steps during their waking hours. Taking this even further, it is likely that many entries even with non-zero values indicate days in which the smart device user did not wear their device consistently. Therefore, a cutoff of 1000 steps was chosen as a threshold for consideration in the analysis:

daily_activity <- daily_activity %>% filter(TotalSteps >= 1000)

The data now contains only values greater than or equal to 1000, which still consists of most of the rows (831) from the original data set:

daily_activity %>% arrange(TotalSteps)

While the daily_activity data frame contains only dates in its time data (i.e., MM/DD/YYYY), the daily_sleep data frame’s time data contains a date along with a time component (i.e., MM/DD/YYYY HH:MM:SS). This time component does not actually contain any useful data (it is always 12:00:00 AM) and needed to be removed to match up with the daily_activity data.

# Code using pipes:

daily_sleep$SleepDay <- daily_sleep$SleepDay %>%
  substr(1, nchar(daily_sleep$SleepDay)-3) %>% #Removing the "AM" from the column
  mdy_hms %>% #Converting the string to a timestamp
  date ##Removing the time component of the timestamp

# Original Code that was written:

# daily_sleep$SleepDay <- paste0("0", daily_sleep$SleepDay)
# daily_sleep$SleepDay <- substr(daily_sleep$SleepDay, 1, nchar(daily_sleep$SleepDay)-3)
# daily_sleep$SleepDay <- mdy_hms(daily_sleep$SleepDay)
# daily_sleep$SleepDay <- date(daily_sleep$SleepDay)

The SleepDay time data in the daily_sleep data frame is now just a date; however, in the process it was converted from a string to a date. For consistency, the time data in the ActivityDate time data in the daily_activity data frame was also converted.

daily_activity$ActivityDate <- daily_activity$ActivityDate %>%
  mdy %>%
  date

The daily_activity data and the daily_sleep data are now ready to be combined together. A join onto the daily_sleep data was chosen, as there are significantly less observations in this data set (413 vs the 940 in daily_activity). While frequently joins are performed such that the broader set is what is joined onto (i.e., a left_join below instead of a right_join), this would have resulted in many N/A entries in the data set, which would have inhibited analyses into correlations between the daily_activity data and the daily_sleep data. As a last step, any of the rows containing “NA” as a result of merging the tables will be removed:

daily_activity_sleep <- right_join(daily_activity, daily_sleep, by=c('Id','ActivityDate'='SleepDay')) %>%
  drop_na

For the heart rate data frame, the Time column in the heart rate data frame was converted from a string to a POSIXct format. Additionally, the time component of this data was broken out to facilitate analysis.

## Converts the data into a POSIXct format

heartrate$Time <- heartrate$Time %>%
  parse_date_time2('%m:%d:%Y %I:%M:%S %p')

## This creates a new column in which only the time element of the Time column is retained; the date is defaulted "0000-01-01", which is not an issue as the aim is to aggregate based on the time and not the date.

heartrate$DayTime <- format(as.POSIXct(heartrate$Time,format="%Y:%m:%d %H:%M:%S"),"%H:%M:%S") %>%
 parse_date_time2('%H:%M:%S')

# Previous attempts at cleaning the data:

# heartrate$Date <- heartrate$Time %>%
#   as.Date()
# 
# heartrate$DayTime <- heartrate$Time %>%
#   as.Time()

# heartrate$Time <- heartrate$Time %>%
#   substr(1, nchar(heartrate$Time)-3)
  

# heartrate$DayTime <- format(as.POSIXct(heartrate$Time,format="%Y:%m:%d %H:%M:%S"),"%H:%M:%S")
# heartrate$Date <- format(as.POSIXct(heartrate$Time,format="%Y:%m:%d %H:%M:%S"),"%Y:%m:%d")


# heartrate <- heartrate %>%
#   separate(Time, into = c('Date','Time'), sep = ' ')
# 
# heartrate$Date <- heartrate$Date %>% ymd
# heartrate$Time <- heartrate$Time %>% as.POSIXct(format = '%H:%M:%S')

Both the original Time column and the new DayTime column in the heart rate data frame are now POSIXct types, and the date has essentially been scrubbed in the DayTime column:

str(heartrate)

## 'data.frame':    2483658 obs. of  4 variables:
##  $ Id     : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ Time   : POSIXct, format: "2016-04-12 07:21:00" "2016-04-12 07:21:05" ...
##  $ Value  : int  97 102 105 103 101 95 91 93 94 93 ...
##  $ DayTime: POSIXct, format: "0000-01-01 07:21:00" "0000-01-01 07:21:05" ...

Now, the data in the heart rate data frame was used to create a new data frame in which the average heartbeat throughout the day was determined for each unique Id in the data set:

heartrate_agg <- heartrate %>%
  group_by(Id, DayTime) %>%
    summarise(Avg_Value = mean(Value))

Analyzing and Visualizing the Data

A new data frame, heartrate_agg_stats, was created to hold summary statistics regarding the newly created aggregated hear rate data frame. The aggregated heart rate data was then used in exploratory visualizations.

heartrate_agg_stats <- heartrate_agg %>%
  group_by(Id) %>%
  summarize(min = min(Avg_Value),
            q1 = quantile(Avg_Value, 0.25),
            median = median(Avg_Value),
            mean = mean(Avg_Value),
            q3 = quantile(Avg_Value, 0.75),
            max = max(Avg_Value)) %>%
  arrange(-mean)
heartrate_agg_stats

heartrate_agg %>%
  ggplot(aes(DayTime,Avg_Value)) +
  
    #as.factor(Id) forces the color to be individual for each plot, rather than a gradient across each Id
    geom_point(aes(color = as.factor(Id), alpha = 1/10)) + 
  
    #scales = "free" allows for each plot to have its own scale on the y- and x-axis; however, the axis limits are rebound below
    facet_wrap(~Id, scales = "free", ncol = 2) + 
  
    #hides the legends that are unnecessary
    guides(color = "none", alpha = "none") + 
  
    #Adjusts the tick marks to consistently show between 00:00:00 and 24:00:00 on the x-axis and between 40 and 200 on the y-axis
    scale_x_datetime(date_labels = '%T', 
                     limits = c(as.POSIXct("0000-01-01 00:00:00", tz = 'UTC'), 
                                as.POSIXct("0000-01-01 24:00:00", tz = 'UTC'))
                     ) +
    scale_y_continuous(limits = c(40,200)) +
  
    #Adds title, subtitle, and y-axis label
    labs(title = "Averaged Heart Rate Throughout the Day",
         subtitle = "The heart rate of 14 FitBit users was averaged over the course of a day for approximately one month.",
         caption = "The dotted lines represent the 25th and 75th quartiles and the red dashed line represents the mean for all heart rate values across all individuals.") +
    ylab("Heart Rate (bpm)") +
  
    #Adjusting the angle of the x-axis tick mark labels, the size of the title and subtitle, and removes the x-axis label
    theme(axis.text.x = element_text(angle = 30),
          plot.title = element_text(size = 20),
          plot.subtitle = element_text(size = 10),
          axis.title.x = element_blank()) +
    
    #Adds trendline
    geom_smooth() +
  
    #Adds in the three horizontal lines for both the 25th and 75th quartiles and the average for each Id
    geom_hline(aes(yintercept=quantile(Avg_Value, 0.75)), linetype = 'dotted') +
    geom_hline(aes(yintercept=mean(Avg_Value)), color = 'red', linetype = 'dashed') +
    geom_hline(aes(yintercept=quantile(Avg_Value, 0.25)), linetype = 'dotted')

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The graphical representation of each individual’s average heart rate throughout the day led to several observations.

Firstly, several of the individuals have observed peaks in their data, representing times of either vigorous labor or deliberate exercise. Additionally, most of the individuals have periods of marked lower heart rates, whether compared to the average or the 25th quantile of users’ data, likely indicating sleep. This data could be used to generate notifications to users that either prompt them to consider getting a workout in or to begin winding down for bed based on their historic average heart rate data.

Secondly, there are gaps in some of the individual’s data, in most cases occurring sometime during the morning. For some of the individuals, it appears that they do not wear the device when they sleep, while others have small gaps, perhaps during morning routines such as showering and eating breakfast, and others still have only sporadic spurts of averaged data. If the intent is that the device is worn continuously in order to provide accurate data, then notifications to users based on when their activity is noted to be missing could be created that provide encouragement to wear the device during those hours along with helpful information about the benefits of doing so.

Lastly, while the population is small (n = 14), there appears to be a few “archetypes” of user based on the overall appearance of their average daily heartbeat data. The first archetype appears as a trend that is relatively level, but then has obvious peaks, sometimes several, at particular times of the day. This type of data likely reflects a user who is typically not particularly active during most of the day, but then makes a concerted effort to perform workouts and get in routine exercise. A second archetype appears to be a trend line that has a definitive, sustained period of increased heart rate throughout the day, but without necessarily as dramatic of peaks as the first archetype. This data likely reflects a user that performs some form of physical labor throughout their day, likely as part of a job. One last archetype of user to consider would be one whose data does not reflect peaks nor periods of sustained elevated heart rate throughout the day. This likely would indicate someone who is largely sedentary during the day and does not routinely exercise.

While further data would be necessary to better flush out these categories, understand assumptions, and explore if other ones exist, this preliminary analysis proves valuable for Bellabeat’s market strategy, as each archetype could be targeted in unique fashion to provide better support and services.

Next, the daily_activity_sleep data frame was analyzed.

summary(daily_activity_sleep)

##        Id             ActivityDate          TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   : 1202   Min.   : 0.780  
##  1st Qu.:3.977e+09   1st Qu.:2016-04-19   1st Qu.: 5450   1st Qu.: 3.723  
##  Median :4.703e+09   Median :2016-04-27   Median : 9105   Median : 6.380  
##  Mean   :5.047e+09   Mean   :2016-04-26   Mean   : 8720   Mean   : 6.166  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:11420   3rd Qu.: 8.063  
##  Max.   :8.792e+09   Max.   :2016-05-12   Max.   :22770   Max.   :17.540  
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.780   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 3.723   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 6.380   Median :0.0000           Median : 0.630    
##  Mean   : 6.160   Mean   :0.1157           Mean   : 1.483    
##  3rd Qu.: 8.053   3rd Qu.:0.0000           3rd Qu.: 2.430    
##  Max.   :17.540   Max.   :4.0817           Max.   :12.540    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.000            Min.   :0.350       Min.   :0.0000000      
##  1st Qu.:0.000            1st Qu.:2.625       1st Qu.:0.0000000      
##  Median :0.440            Median :3.760       Median :0.0000000      
##  Mean   :0.767            Mean   :3.884       Mean   :0.0009406      
##  3rd Qu.:1.052            3rd Qu.:5.000       3rd Qu.:0.0000000      
##  Max.   :6.480            Max.   :9.480       Max.   :0.1100000      
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   : 34.0        Min.   : 125.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:161.8        1st Qu.: 631.8  
##  Median : 10.00    Median : 12.00      Median :210.5        Median : 716.5  
##  Mean   : 25.75    Mean   : 18.44      Mean   :220.9        Mean   : 713.1  
##  3rd Qu.: 38.25    3rd Qu.: 28.00      3rd Qu.:264.2        3rd Qu.: 781.0  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1265.0  
##     Calories    TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   : 741   Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1860   1st Qu.:1.000     1st Qu.:362.5      1st Qu.:405.2  
##  Median :2225   Median :1.000     Median :433.0      Median :463.0  
##  Mean   :2421   Mean   :1.114     Mean   :419.0      Mean   :458.3  
##  3rd Qu.:2933   3rd Qu.:1.000     3rd Qu.:492.0      3rd Qu.:527.8  
##  Max.   :4900   Max.   :3.000     Max.   :796.0      Max.   :961.0

Firstly, The average daily total steps taken was 8720, which is less than the commonly cited 10,000 step goal from CDC; however, research has indicated there are significant health benefits of taking even 8000 steps daily (as compared to only taking 4000 steps). If this data is indicative of the average Leaf user as well, then marketing can use this research in promotional materials to state that the average user of Leaf devices can receive these health benefits.

Next, the average total distance is much greater than what the average American walks (1.5 - 2 miles daily) at 6.1, and even the 1st quartile is 3.7. Again, marketing information could promote that the average user of Leaf devices covers a much greater distance on average than the typical American, and that even casual users still go a significantly higher distance on average as well, as compared to the average American.

Continuing, the average very active, fairly active, and lightly active minutes were 25.75, 18.44, and 220.9 respectively. The CDC recommends 150 minutes of “moderate-intensity” or 75 minutes of “vigorous-intensity” activity each week, or an equivalent combination. With the previously stated averages, assuming very active is vigorous-intensity and fairly active is moderate-intensity, then the average user from this data set gets approximately 180 and 130 minutes of vigorous-intensity and moderate-intensity activity each week. These are well above the CDC’s recommendations, and are yet another boast that could be used in marketing promotions.

A final metric to consider is sleep, which the average user receives 419 minutes of daily. This equates to 6.98 hours of sleep every night on average, which is just shy of recommended 7+ hours of sleep for most adults. There are several very low observations in the data (entries with as low as 1, 2, or 3 hours of sleep) that could arguably be outliers. Additionally, with more data specific to Leaf users, the data could very well be over the 7-hour mark the CDC recommends. If considered indicative of average Leaf users, then stating that the average Leaf user gets the recommended amount of sleep each night could be used for marketing purposes.

In exploring the medians of these variables; however, the very active and fairly active minutes columns have much lower medians than means. This was then further explored in the following box plots:

daily_activity_sleep_m <- daily_activity_sleep %>%
  
  #Converting the data into a long format to facilitate plotting
  melt(id.vars = "Id", measure.vars = c("TotalSteps", "TotalDistance", "VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActiveMinutes", "TotalMinutesAsleep"))

daily_activity_sleep_m %>%
  ggplot(aes(x = variable, y = value, fill = variable)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Daily Activity Box Plots", subtitle = "The breakdown of the activity data into box plots reveals many outliers for very active and fairly active minutes.") +
  theme(plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 10),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.title.y = element_blank()) +
  guides(fill = "none")

The box plots revealed that there appear to be many outliers in the very active and fairly active minutes data. As such, these were further broken down by individual and examined:

daily_activity_sleep %>%
  ggplot(aes(x = reorder(as.factor(Id), -VeryActiveMinutes) , y = VeryActiveMinutes, color = as.factor(Id))) +
  guides(color = "none") +
  theme(axis.text.x = element_text(angle = 30),
          plot.title = element_text(size = 20),
          plot.subtitle = element_text(size = 10),
          axis.title.x = element_blank()) +
  labs(title = "Very Active Minutes in a Day by Id", 
       subtitle = "Only a few individuals had days with greater than 100 very active minutes.",
       caption = "The dotted lines represent the 25th and 75th quartiles and the red dashed line represents the median."
       ) +
  geom_point() +
  geom_hline(aes(yintercept=quantile(VeryActiveMinutes, 0.75)), linetype = 'dotted') +
  geom_hline(aes(yintercept=median(VeryActiveMinutes)), color = 'red', linetype = 'dashed') +
  geom_hline(aes(yintercept=quantile(VeryActiveMinutes, 0.25)), linetype = 'dotted')

daily_activity_sleep %>%
  ggplot(aes(x = reorder(as.factor(Id), -FairlyActiveMinutes) , y = FairlyActiveMinutes, color = as.factor(Id))) +
  guides(color = "none") +
  theme(axis.text.x = element_text(angle = 30),
          plot.title = element_text(size = 20),
          plot.subtitle = element_text(size = 10),
          axis.title.x = element_blank()) +
  labs(title = "Fairly Active Minutes in a Day by Id", 
       subtitle = "Only a few individuals had days with greater than 75 fairly active minutes.",
       caption = "The dotted lines represent the 25th and 75th quartiles and the red dashed line represents the median."
       ) +
  geom_point() +
  geom_hline(aes(yintercept=quantile(FairlyActiveMinutes, 0.75)), linetype = 'dotted') +
  geom_hline(aes(yintercept=median(FairlyActiveMinutes)), color = 'red', linetype = 'dashed') +
  geom_hline(aes(yintercept=quantile(FairlyActiveMinutes, 0.25)), linetype = 'dotted')

For either variable, very or fairly active minutes, the dot plots demonstrate that only a few individuals were contributing to the outliers demonstrated in the their respective box plots. These individuals would therefore be contributing towards the larger discrepancy between the median and mean for these variables, as their days with a significantly higher number of very active or fairly active minutes would pull the average higher as compared to the median.

The variation between individuals very active and fairly active minutes in the data set highlights the differences that each person has in their exercise and activity habits. While marketing efforts can still make claims based on averages across all users, tailoring notifications, messages, or analytics to each user would provide personalized feedback that can improve their health. For example, based on the users very active and fairly active minutes data, a message or notification indicating that the user is or is not reaching the CDC recommended 150 minutes of “moderate-intensity” or 75 minutes of “vigorous-intensity” activity along with a comparison of how their activity is to this threshold.

The relationship between variables in the activity data frame and calories burned were next investigated:

totalsteps_calories <- daily_activity_sleep %>%
  ggplot(aes(TotalSteps, Calories)) +
  geom_point() +
  geom_smooth(method = "lm") +
  stat_cor()

totalsteps_calories_f <- daily_activity_sleep %>%
  filter(TotalSteps < quantile(TotalSteps, 0.975)) %>%
  ggplot(aes(TotalSteps, Calories)) +
  geom_point() +
  geom_smooth(method = "lm") +
  stat_cor()

trackerdistance_calories <- daily_activity_sleep %>%
  filter(TrackerDistance < quantile(TrackerDistance, 0.975)) %>%
  ggplot(aes(TrackerDistance, Calories)) +
  geom_point() +
  geom_smooth(method = "lm") +
  stat_cor()

veryactivedist_calories <- daily_activity_sleep %>%
  filter(VeryActiveDistance < quantile(VeryActiveDistance, 0.975)) %>%
  ggplot(aes(VeryActiveDistance, Calories)) +
  geom_point() +
  geom_smooth(method = "lm") +
  stat_cor()

modactivedist_calories <- daily_activity_sleep %>%
  filter(ModeratelyActiveDistance < quantile(ModeratelyActiveDistance, 0.975)) %>%
  ggplot(aes(ModeratelyActiveDistance, Calories)) +
  geom_point() +
  geom_smooth(method = "lm") +
  stat_cor()

lightactivedist_calories <- daily_activity_sleep %>%
  filter(LightActiveDistance < quantile(LightActiveDistance, 0.975)) %>%
  ggplot(aes(LightActiveDistance, Calories)) +
  geom_point() +
  geom_smooth(method = "lm") +
  stat_cor()

veryactivemin_calories <- daily_activity_sleep %>%
  filter(VeryActiveMinutes < quantile(VeryActiveMinutes, 0.975)) %>%
  ggplot(aes(VeryActiveMinutes, Calories)) +
  geom_point() +
  geom_smooth(method = "lm") +
  stat_cor()

grid.arrange(totalsteps_calories, totalsteps_calories_f, trackerdistance_calories, 
             veryactivedist_calories, modactivedist_calories, lightactivedist_calories, 
             veryactivemin_calories, 
             ncol = 2, 
             top = text_grob("Daily Activity vs. Calories Scatter Plots", hjust = -0.125, vjust = 0.5, x = 0, size = 20)
             )

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The first total steps vs. calories plot reveals a weak correlation between total steps taken and calories; however, there appear to outliers (~ >15,000 steps) that may be skewing the correlation. By filtering out outliers with a criteria that they are beyond the 0.975 quantile, the correlation becomes even weaker. So while a positive relationship exists between total steps and calories burned, it is not very strong and therefore would not be useful for further considerations.

A stronger relationship exists between daily tracked distance and calories burned; however, the correlation is still relatively weak.

Additionally, only poor correlations could be demonstrated between the active distance measurements and calories burned.

A moderate correlation exists between calories burned and very active minutes; this could be used to notify users of the Bellabeat app that when they begin logging very active minutes that they are burning calories at a high rate.

As exploring scatter plots one-by-one was inefficient, a correlation matrix was generated to quickly understand the relationship between variables:

#Picking the columns in the daily_activity_sleep data frame to correlate
daily_activity_sleep_numerics <- daily_activity_sleep[,c(3,4,5,6,7,8,9,11,12,13,14,15,17,18)]

#Generating the correlation matrix
activity_cor_mat <- cor(daily_activity_sleep_numerics)

#Converting the matrix into a data frame and transforming its arrangement such that each correlation variable and the correlation itself has its own column
activity_cor_table <- activity_cor_mat %>%
  as.data.frame %>%
  rownames_to_column("Row_Variable") %>%
  pivot_longer(-"Row_Variable")

#Renaming columns
activity_cor_table <- activity_cor_table %>%
  rename("Column_Variable" = "name") %>%
  rename("Correlation" = "value")

#Removing the diagonal (i.e., where the same variable was correlated with itself)
activity_cor_table_f <- filter(activity_cor_table, Column_Variable != Row_Variable)

#Removing instances where two of the same correlation existed (i.e., Row_var A with Column Var B and then Row_Var B with Column Var A, etc.)
activity_cor_table_f <- activity_cor_table_f %>%
  distinct(Correlation, .keep_all = TRUE)

#Removing any weak correlations
activity_cor_table_f <- filter(activity_cor_table_f, Correlation >= 0.40 | Correlation <= -0.40)

activity_cor_table_f <- arrange(activity_cor_table_f, -Correlation)
activity_cor_table_f

activity_cor_table %>%
  ggplot(aes(Row_Variable, Column_Variable, fill = Correlation)) +
  geom_tile() +
  
  #Sets the colors of the heat map and forces the limits of color gradient to match correlations of -1 to 1. 
  scale_fill_gradient2(low = "blue", mid = ("white"), high="orange", limits = c(-1,1)) +
  
  theme(
        
        #Adjusts the angle, margin, and position of the x-axis texts
        axis.text.x = element_text(angle = -30,
                                   margin = margin(t = -0), 
                                   hjust = 0
                                   ),
        axis.ticks.length.x = unit(0.5, "cm"),
        axis.title = element_blank(),
        
        #Adjusting the position of the title and subtitle
        plot.title = element_text(hjust = -2),
        plot.subtitle = element_text(hjust = 0.935)
        ) +
  
  labs(title="Daily Activity, Calories, and Sleep Correlations", subtitle = "The correlation heat map reveals that many of the activity variables have weak correlations.")

The three variables with the highest correlations (R > .98) were combinations of TotalDistance, TrackerDistance, and TotalSteps. The strong correlation between these variables highlights how little users are entering logged activities that contribute to the total distance as opposed to relying on the tracked distance their device calculates. In fact, of the original 831 observations in the raw daily_activity data frame, only ~30 entries contain non-zero values for the LoggedActivitiesDistance column. This highlights that users prefer to rely on the device to track their numbers rather than rely on manual entry. This preference can be used for marketing strategy by playing up the ease of using the Leaf or Time devices to track user activity.

The remaining positive correlations that were at least weakly correlated (R > 0.40) were combinations of the four distinct categories (Very Active, Moderately Active, Light Active and Sedentary) distances or minutes, the three aforementioned highly correlation variables (TotalDistance, TrackerDistance, and TotalSteps), and Calories. Of these correlations, Calories was most highly correlated with VeryActiveMinutes (R > 0.61). This was previously shown to have a weaker correlation upon removing outliers (R > 0.51), but still constitutes a moderate correlation. While additional analysis with weight logging data would be ideal, users using viewing their data through the app could be encouraged when viewing their very active minutes that it is the most correlated with burning the most calories in a day.

Summary of Findings and Recommendations

Trend: Smart device users are not logging their weight consistently.

Application to Bellabeat customers: Encourage users to be diligent and consistent with weight tracking through notifications on the Bellabeat app.
Marketing insight: Highlight the ease of tracking health data with the Bellabeat app, emphasizing the simplicity of logging weight entries.
Future analysis: Explore if those who track their activity, sleep, weight, and other health data more consistently tend to lose more weight. Demonstrating this relationship would provide a marketing strategy for selling Bellabeat products (i.e., using the app and the smart wellness products leads to losing weight). It would also be an effective means of encouraging users to continue using the Bellabeat app and the Leaf and Time products.

Trend: Over the course of a month, several users’ heart rate data demonstrated peaks representing times of vigorous labor, possibly deliberate exercise, as well as periods of marked lower heart rates, whether compared to the average or 25th quantile of users’ data, likely indicating sleep.

Application to Bellabeat customers: This data could be used to generate notifications on the Bellabeat app to users that either prompt them to consider getting a workout in or to begin winding down for bed based on their historic average heart rate data.
Marketing insight: Advertise that users of wearable Bellabeat smart wellness products (i.e., Leaf and Time) will receive personalized and customizable recommendations from the Bellabeat app based on their typical day to day activity levels.
Future analysis: Investigate clinical studies that involve health data tracked by Bellabeat products for correlations on different outcomes. With an understanding of how this health data connects to clinical outcomes, users’ health data could be used to help inform medical decision making in collaboration with their health care providers and be used as a selling point to help improve health. A possible obstacle, however, could be regulations regarding medical devices.

Trend: Some users’ heart rate data shows consistent times where no data was collected.

Application to Bellabeat customers: Depending on the circumstances, prompt or encourage users to wear their wearable smart wellness product during times of noted missing heart rate data.
Marketing insight: Emphasize how the Leaf and Time products are fashionable to wear for any occasion throughout the day and their durability to withstand dust, sand, dirt, and water.
Marketing insight: Probe the data to determine if a relationship exists between consistently wearing smart wellness products (i.e., % worn on a day-to-day basis) and calories burned. If a positive relationship does exist, then this data could both motivate additional verbiage in notifications encouraging users to wear their device in periods of noted lack of heart rate data and supply further marketing material (i.e., users who consistently wear Bellabeat products burn more calories).

Trend: Users’ heart rate data reveals several patterns that could be considered “archetypes” of different types of users.

Application to Bellabeat customers: The type of pattern that a user exhibits over an averaged period of a time could be used to profile the user into a particular “archetype” that allows for personalized recommendations and advice.
Marketing insight: Advertise that the Bellabeat app learns and understands the habits of users of the Leaf and Time products and can create personalized recommendations based on the health data that they generate.
Future analysis: Machine learning algorithms could be trained to detect these patterns and possibly identify new ones. Marketing could then boast that the Bellabeat app uses artificial intelligence (AI) to create a truly customized experience for those who use it with the Time and Leaf products. i. A drawback of the current data is that the population for the heart rate data was small (n = 14). While several “archetypes” were speculated, analyzing a larger data set and using more advanced data science techniques to generate categories based on the trends in the data would be a significant next step.

Trend: Average daily total steps taken was 8720.

Application to Bellabeat customers: Encourage users with motivating messages as they approach various milestones that have associated positive health outcomes (i.e., studies have demonstrated there are significantly better health outcomes for those that average at least 8000 steps as compared to 4000 steps daily).
Marketing insight: Promote that users of the Time and Leaf products on average are associated with significant health benefits (average is 8720, which is greater than the 8000 steps in the study referenced above).
Future analysis: Research if other step milestones are associated with health benefits or positive outcomes to provide encouragement to users and marketing material.

Trend: Average daily total distance (6.1) is much greater than the average American (1.5 – 2).

Application to Bellabeat customers: Provide users with encouraging messages with notifications or when analyzing their health data in the app as they reach various milestones associated with distance.
Marketing insight: Play up that users of wearable smart wellness products travel up to four times the daily distance on average as compared to the average American.
Future analysis: Search for additional distance benchmarks associated that are either prominent (i.e., like the comparison to average Americans) or associated with health benefits.

Trend: Average very active and fairly active minutes were 25.75 and 18.44 respectively.

Application to Bellabeat customers: CDC recommends 150 minutes of “moderate-intensity” or 75 minutes of “vigorous-intensity” activity each week, or an equivalent combination. With the previously stated averages, assuming very active is vigorous-intensity and fairly active is moderate-intensity, then the average user of wearable smart wellness products gets approximately 180 and 130 minutes of vigorous-intensity and moderate-intensity activity each week. As with the previous activity data, notifications or insights could be provided to Bellabeat users indicating how their activity compares to these threshold CDC recommendations.
Marketing insight: Call attention to the fact that using Bellabeat smart wellness products helps one meet the CDC recommendations for physical activity.
Future analysis: The daily averages were extrapolated to weekly averages by multiplying them by the days in a week; however, it would be more accurate to generate weekly averages than just multiplying the daily figures. Unfortunately, with only one month of data, this would only provide 4 weeks of data. Ideally, a greater population of users’ data would be collected and over a longer span of time.

Trend: Average daily minutes of sleep was 419 minutes.

Application to Bellabeat customers: Provide analytics to Bellabeat app users regarding their sleep in comparison to CDC recommendations based on age demographic. Allow for customizable alerts that encourage users to begin preparing for bed to help them meet sleep goals.
Marketing insight: Spotlight that using Bellabeat smart wellness products promotes getting adequate sleep every night and that the app can assist those who have trouble getting adequate sleep.
Future analysis: Establish a better understanding around sleep log entries that are low (i.e., 1, 2, or 3 hours of sleep at night) to determine what data could be considered erroneous and excluded from analysis. This would subsequently help improve averages and assist in marketing efforts.

Trend: Most users do not elect to log activity entries manually and instead rely on the tracked distances that their device calculates.

Application to Bellabeat customers: Prompt users to provide logged activity entries for time periods that no heart rate data was recorded (i.e., for times that the smart wellness device was not being worn).
Marketing insight: Similar to the recommendation for logging weight, highlight the ease of tracking health data with the Bellabeat app, emphasizing that the Leaf or Time device automatically captures and calculates the distance that the user has traveled each day.
Future analysis: Explore correlations between the subset of users that do not prefer to log activity entries that have demonstrated periods of non-use of their smart wellness device, as demonstrated by heart rate data, and other health data.

Trend: Very active minutes is moderately correlated with burned calories.

Application to Bellabeat customers: Provide positive reinforcement via the Bellabeat app through the Time smart wellness device for users when their heart rate activity demonstrates they are “very active” by stating that they are now burning calories at a high rate.
Marketing insight: Focus on how the Leaf and Time smart wellness devices promote burning calories and weight loss by automatically tracking how much time you spend being active and allowing consumers to then analyze that data and set and adjust their exercise goals accordingly.
Future analysis: Break out the correlation data by user and explore the differences as compared to the average to identify if any confounding variables exist that are weakening the overall average correlations.

Limitations

There were several limitations with the data used in the analysis as well as other hurdles:

The data sets had relatively few individual users, n = 33 was the highest in the activity data set, n = 14 in heart rate data set, and at worst there was effectively n = 2 in the weight log data set. As the data only had a limited amount of people, there is a possibility that it is not representative of smart wellness device user trends as a whole.
All of the data sets were only over a single month and therefore could possibly be not representative of user trends as a whole over longer periods of time.
The data sets were generated in 2016 and therefore could possibly be not representative of current user trends as of this report in 2023.
There is a general difficulty in obtaining publicly available data generated from smart wellness devices. All of Us Research Hub has public Fitbit data; however, this organization requires being part of an institution (primarily academic) to access it and specific terms to which must be agreed.