1 Introduction

1.1 Framework

The case study “How Can a Wellness Technology Company Play It Smart?” is an optional capstone project that is part of the Google Data Analytics Professional Certificate program. The data analytics program developed by Google focuses on job-ready skills that are in high demand, such as how to analyze and process data to gain key business insights, and how to communicate such insights effectively to empower informed decision-making.

The focal point of this case study is the high-tech company Bellabeat, which is a manufacturer of health-focused products for women. Bellabeat’s leading product is a wellness tracker that can be worn as an elegant bracelet, necklace, or clip. By connecting to the Bellabeat app, the device has the ability to track activity, sleep, and stress levels.

With the use of data analytics and informed decision-making, Bellabeat has the potential to become a larger player in the global market of smart devices, and the present analysis provides some insights and recommendations to help achieve that goal.

1.2 Business Task

The objective of this project is to analyze activity trackers data to gain insight into how consumers are using their non-bellabeat smart devices for their health, and use these insights to provide high-level recommendations to help guide the company’s marketing strategy.

Stakeholders

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer

  • Sando Mur: mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team

  • Bellabeat marketing analytics team: data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy

2 Description of the Data

2.1 Data Source

Initially, as hinted by one of the stakeholders, this analysis was going to be based on the Fitbit Fitness Tracker Data made available through the Kaggle user Mobius, which can be found at https://www.kaggle.com/datasets/arashnic/Fitbit; however, after the initial exploratory phase of the data, we realized that the Mobius’ datasets were incomplete in relation to the original source. This explains why the time frame of the CSV files available on Kaggle’s does not correspond with the description of the datasets, as the Kaggle’s files were missing the data from March 12, 2016 to April 11, 2016.

To address this issue, we determined that it was more appropriate to obtain the data from the original source Furberg et al. (2016), which has the complete datasets available under a Creative Commons license (Attribution 4.0 International) on the open repository Zenodo at https://zenodo.org/record/53894. By using the complete datasets, this analysis was be based on twice the number of observations than if it were based on the Kaggle’s datasets. This fact did not only give us a greater level of confidence, but it was also useful to identify additional patterns since the analysis included two months instead of just one.

2.2 Description of the Datasets

The original crowd-sourced Fitbit datasets were collected from thirty-five Fitbit users via Amazon Mechanical Turk and included two months of data; that is, one month of retrospective data extending from March 12 to April 11, 2016, and one month of prospective data, from April 12 to May 12, 2016. The datasets included minute-level output for physical activity, heart rate, and sleep monitoring.

The complete datasets consisted of 29 CSV files divided into two folders. The first folder contained 11 CSV files with retrospective data, while the second folder had 18 CSV files with prospective data. Some of these CSV files were just the aggregated values of the more detailed data contained in the same folder.

It must also be noted that many of the variables included in those files were calculated as functions of only a few independent variables that were collected either via the sensors of the Fitbit device or through data logged by the user (Fitbit MyHelp 2016). For instance, the values for daily calories burned are obtained from a function that uses as inputs the user’s daily steps, heart rate, and logged activity. In that sense, to avoid circular reference and achieve meaningful results that are relevant to the business task, this analysis was focused only on the variables that meet one of the following conditions:

  • Independent variables
  • Variables relevant to the services provided by Bellabeat’s Leaf device, which tracks activity, sleep, and stress
  • Dependent variables that could serve an illustrative purpose for key insights

In that respect, the variables of the Fitbit datasets that this analysis was more concerned with were: features usage, number of steps, intensity level, sleep data, and heart rate.

2.3 Data Limitations and Strengths

The Fitbit Fitness Tracker Data presented some limitations that are explained below. Nevertheless, the data could still be acceptable for the purpose of this analysis if the right measures are put in place and if the stakeholders take these limitations into consideration during the decision-making process.

Data reliability: It must be noted that the source of the data is not an organization nor a company; it was collected by Furberg et al. (2016) via Amazon Mechanical Turk, and though MTurk is a reliable tool for generating sample responses comparable to more conventional means (see Mortensen and Hughes 2018, 533).], there were some issues with the accuracy and consistency of the data (see Section 4) that we had to address before using it for this analysis. In this regard, after making the necessary corrections discussed herein in later sections, the data was reliable enough for this analysis.

However, when making consequential decisions based on this analysis, the stakeholders should be aware of another limiting factor of the data: the files do not provide any information on gender or age (see “data comprehensiveness” paragraph below).

Original source: At the beginning of this analysis there were some concerns with respect to the datasets that were made available through Kaggle since it was not clear how the user Mobius was involved in the collection of the original data, but this issue was resolved by finding the original source of the datasets (see Furberg et al. 2016), which was available on Zenodo.

Data comprehensiveness: As mentioned before, the files were missing information on gender and age, and, in the case of the Kaggle’s files, they were also missing the data from the retrospective period. The latter issue was resolved by using the original source of the datasets, while the former remark must be addressed by following sound and prudent decision-making practices due to the fact that the missing information on gender and age makes it impossible to determine if the sample was representative of the larger population. In addition, the files only covered two months of data, which is a very short time frame to analyze behavioral consistency and patterns.

Timeliness (not current): The data was collected seven years prior to the date of this analysis, which could be considered a long time in the tech industry, as trends tend to change rapidly in this market. For instance, users today might experience a different level of motivation in relation to activity trackers. New and more advanced devices may have a more positive effect on users’ motivation, but it must also be noted that the basic concept has been in the market for some time already, and thus the excitement among users could be different in some respects. These unknowns were not within the scope of this analysis, but they should be considered when designing a robust marketing strategy.

Cited source: The original datasets have been cited by at least six academic publications according to both Google Scholar and Dimensions databases. Therefore, we may conclude that these datasets are reliable enough for this analysis provided that the stakeholders understand their limitations.

3 Data Cleaning

This analysis used the original non-aggregated files (minute-by-minute data) as the starting point due to some inconsistencies that were found in the aggregated files (daily and hourly data) that were created by the original source (see Section 4). In that sense, using the more detailed files was appropriate for two reasons: First, this approach ensured a higher degree of accuracy and consistency; and second, our analysis on resting heart rate required these smaller time intervals.

It must be noted that this analysis was not concerned with variables such as calories burned by the user or distance, considering that these values were obtained from Fitbit equations that were already based on the independent variables analyzed herein, and hence any positive correlation would be due the underlying equation from which these values were calculated in the first place.

3.1 Environment: R Packages

The environment for the analysis was set up by loading the following R packages.

library(janitor)    # Data cleaning: clean_names, remove_empty
library(tidyverse)  # Packages: dplyr, tibble, readr, tidyr, ggplot2,
library(dplyr)      # Manipulation: pipe, head, filter, order_by,
                    #group_by, distinct, glimpse, rbind, stringr
library(readr)      # Fast way to read CSV and TSV files: read_csv
library(lubridate)  # To manage date variables
library(knitr)
library(kableExtra) # To create tables
library(stringr)    # To work with strings
library(scales)     # To display percentages
library(devtools)
library(ggrepel)    # To separate graph labels from data

3.2 Data Cleaning Methodology

The original source grouped the datasets into two different folders, corresponding to retrospective and prospective data; however, the file names did not change across such folders. To make the data frames distinguishable from one another, we started by creating a process that renamed all the relevant data frames according to the folder in which they were located (“Fitbit_data_retro” or “Fitbit_data_prosp”), so that the “retro” suffix would be added to the names of the retrospective data frames (2016-03-12 to 2016-04-11), while the “prosp” suffix would be added to the names of the prospective data frames (2016-04-12 to 2016-05-12).

To automate a large part of the data cleaning process, all the relevant Fitbit files that were used for this analysis were imported into two master lists of data frames: dflist_retro_orig and dflist_prosp_orig. By creating these two master lists, many cleaning tasks could be done automatically for all the data frames through the use of the lapply function and loops. This also means that the same global cleaning process could be applied to any number of Fitbit files by simply adding or deleting files from the previously specified folder.

This approach proved to be advantageous for three reasons:

  1. The same code could be reused to clean future Fitbit data.
  2. The number of files can also be changed to broaden or narrow the scope of the analysis without affecting the global cleaning process.
  3. This approach reduces the amount of code and facilitate the reading of code blocks.

Importing the relevant files and assigning names to data frames

To import all the relevant files at once, the function lapply was used in conjunction with read_csv. The datasets were then placed into the master lists of data frames to automate the cleaning steps that could be carried out for the datasets as a whole.

The following code also created and assigned the corresponding names to all the data frames based on the original file names.

# Obtain all the relevant file names
# saved in the retrospective and prospective data folders
# and assign them to the data frames

filepath_retro_orig <-
  list.files("Fitbit_data_retro",
             pattern = ".csv$",
             full.names = TRUE)

filepath_prosp_orig <-
  list.files("Fitbit_data_prosp",
             pattern = ".csv$",
             full.names = TRUE)

# Load all the relevant csv files at once to master lists of data frames
dflist_retro_orig <- lapply(filepath_retro_orig, read_csv)
dflist_prosp_orig <- lapply(filepath_prosp_orig, read_csv)

# Generate names for the original data based on the file names.
dfnames_retro_orig <- basename(filepath_retro_orig) %>%
  stringr::str_replace_all(c(
    "_merged" = "",
    "Narrow" = "",
    "narrow" = "",
    "LogInfo" = "",
    ".csv" = ""
  )) %>%
  make_clean_names() %>%
  paste("retro_orig", sep = "_")

dfnames_prosp_orig <- basename(filepath_prosp_orig) %>%
  stringr::str_replace_all(c(
    "_merged" = "",
    "Narrow" = "",
    "narrow" = "",
    "LogInfo" = "",
    ".csv" = ""
  )) %>%
  make_clean_names() %>%
  paste("prosp_orig", sep = "_")

# For flexibility, create individual data frames by assigning names
for (i in seq_along(dflist_retro_orig)) {
  assign(dfnames_retro_orig[[i]], dflist_retro_orig[[i]])
}

for (i in seq_along(dflist_prosp_orig)) {
  assign(dfnames_prosp_orig[[i]], dflist_prosp_orig[[i]])
}

# Assign names to elements in list of data frames
names(dflist_retro_orig) <- dfnames_retro_orig
names(dflist_prosp_orig) <- dfnames_prosp_orig

For illustrative purposes, two reference tables (Table 3.2.1 and Table 3.2.2) were generated containing the files that were selected for the analysis and their corresponding data frames. The information contained in such tables are generated automatically depending on which files are saved in the data folders previously specified.

Table 3.2.1 Retrospective data used for the analysis
2016-03-12 to 2016-04-11
List Index Data Frames Original File Names
1 daily_activity_retro_orig dailyActivity_merged.csv
2 heartrate_seconds_retro_orig heartrate_seconds_merged.csv
3 minute_intensities_retro_orig minuteIntensitiesNarrow_merged.csv
4 minute_sleep_retro_orig minuteSleep_merged.csv
5 minute_steps_retro_orig minuteStepsNarrow_merged.csv
6 weight_retro_orig weightLogInfo_merged.csv
Table 3.2.2 Prospective data used for the analysis
2016-04-12 to 2016-05-12
List Index Data Frames Original File Names
1 daily_activity_prosp_orig dailyActivity_merged.csv
2 heartrate_seconds_prosp_orig heartrate_seconds_merged.csv
3 minute_intensities_prosp_orig minuteIntensitiesNarrow_merged.csv
4 minute_sleep_prosp_orig minuteSleep_merged.csv
5 minute_steps_prosp_orig minuteStepsNarrow_merged.csv
6 sleep_day_prosp_orig sleepDay_merged.csv
7 weight_prosp_orig weightLogInfo_merged.csv

The column “Data Frames” in Table 3.2.1 and Table 3.2.2 also represents the various data frames contained within the master lists dflist_retro_orig and dflist_props_orig, which, as mentioned previously, were used to automate several data cleaning steps.

3.3 Initial Exploration of the Data

In order to create table previews that are not only useful but also consistent and visually appealing, we created a custom configuration for all the tables that were used for such previews.

# Create a custom configuration for Kable tables 
kable_custom = function(kable_input)
  kable_classic(
    kable_input,
    "striped",
    font_size = 14,
    full_width = F,
    position = "left"
  )

We created the df_preview function to facilitate and automate the generation of table reports highlighting the data that was needed prior to data cleaning. This function can take as input either a single data frame or an entire master list of data frames to process all the information at the same time.

# Function to create table preview:
# dfs can be a single data frame or a list of data frames
# tbl_num and preview_name are used to create the table caption
df_preview <-
  function(dfs, compact = FALSE, tbl_num ="", preview_name =""){
    
  prev <<- data.frame()
  
  # inner function to be used by outer conditions 1 or 2
  sub_preview <- function(df, df_name, tbl_num) {
    df <- eval(as.data.frame(df))
    
    # Determine the class type for each column of the input data frame(s)
    dfcol_types = ""
    i <- 1
    for (i in 1:ncol(df)) {
      dfcol_types <-
        paste(dfcol_types,
              paste0(
                colnames(df[i]), " <",
                class(df[[colnames(df[i])]]), ">"), sep = " | ")
      
    }
    
    # POSIXct columns have two classes. Use [1] to extract only one class
    dfcol_types <- dfcol_types[1]
    
    # Identify if date-time column has date, POSIXct or character format
    if (length(which(sapply(df, is.Date))) != 0L) {
      date_col_index <- which(sapply(df, is.Date))
      
      start_date <-
        df %>% summarise_at(.vars = names(date_col_index), min)
      start_date <- start_date[[1]]
      
      end_date <-
        df %>% summarise_at(.vars = names(date_col_index), max)
      end_date <- end_date[[1]]
      
    } else if (length(which(sapply(df, is.POSIXct))) != 0L) {
      date_col_index <- which(sapply(df, is.POSIXct))
      
      start_date <-
        df %>% summarise_at(.vars = names(date_col_index), min)
      start_date <- as.POSIXct(start_date[[1]])
      
      end_date <-
        df %>% summarise_at(.vars = names(date_col_index), max)
      end_date <- as.POSIXct(end_date[[1]])
      
    } else {
      date_col_index <- which(str_count(df[1, ], "/") == 2)
      start_date <- "convert format"
      end_date <- "convert format"
    }
    
    # Get the relevant information for each data frame
    df_cols <- ncol(df)
    df_rows <- nrow(df)
    id_index <-  which(colnames(df) == "id" | colnames(df) == "Id")
    df_ids <- n_distinct(df[, id_index])
    df_empty <- sum(df[, -date_col_index] == "", na.rm = TRUE)
    df_na <- sum(is.na(df))
    df_dupl <- sum(duplicated(df))
    
    # If the option compact (default = FALSE) is FALSE,
    # create individual preview table for each data frame
    if (compact == FALSE) {
      print(
        df[1:4, ] %>%
          mutate(across(-1,~format(.x, big.mark = ",", digits = 2))) %>%
          kbl(caption = paste0(
            "<left>Table ", tbl_num,
            " Data preview: ", df_name, "</i>"),
          escape = F
        ) %>% kable_custom() %>%
          scroll_box(width = "100%")%>% footnote(
            general = paste0(
              "\nStart date: ", start_date[[1]],
              "\nEnd date: ", end_date[[1]],
              "\nColumns: ", df_cols,
              "\nRows: ", formatC(df_rows, big.mark = ","),
              "\nUnique IDs: ", df_ids,
              "\nEmpty cells: ", df_empty,
              "\nN/A cells: ", df_na,
              "\nDuplicate rows: ", df_dupl,
              "\n",
              "\nColumn types: ", dfcol_types
            ),
            general_title = ""
          )
      )
    
      cat("___\n")
    
    # If the option compact is TRUE,
    # create one single preview table for all data frames
    } else if (compact == TRUE) {
      prev <<-
        rbind(
          prev,
          data.frame(
            data_frame = c(df_name),
            columns = c(df_cols),
            rows = c(df_rows),
            unique_ids = c(df_ids),
            empty_cells = c(df_empty),
            na_cells = c(df_na),
            duplicate_rows = c(df_dupl),
            start_date = c(start_date),
            end_date = c(end_date)
          )
        )
      
      return(prev)
      
    }
  }
  
  # OUTER CONDITION 1: input is a single data frame
  # Apply inner function sub_preview
  if (is.data.frame(dfs) == TRUE) {
    sub_preview(dfs, deparse(substitute(dfs)), tbl_num)
    
  }
  
  # OUTER CONDITION 2: input is a list of data frames
  # Apply inner function sub_preview
  else{
    df_overview <-
      lapply(seq_along(dfs), function(i) {
        sub_preview(dfs[[i]],
                    names(dfs)[i],
                    tbl_num = paste0(tbl_num, " (", letters[i], ")"))
        
      })
  }
  
  # If the option compact is TRUE,
  # assign this table format to the preview table
  if (compact == TRUE) {
    print(
      prev %>% kbl(
        format.args = list(big.mark = ","),
        caption = paste0(
          "<left>Table ", tbl_num,
          " Data preview: ", preview_name),
        escape = F
      ) %>% kable_custom() %>%
        column_spec(1:ncol(prev), width_min = "100px") %>%
        row_spec(0, font_size = 14) %>% scroll_box(width = "100%")
    )
    cat("___\n")
  }
}
# Generate a table preview for one the data frames 
# Set results = 'asis' to display HTML table from function
df_preview(daily_activity_retro_orig, tbl_num ="3.3.1")
Table 3.3.1 Data preview: daily_activity_retro_orig
Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
1503960366 3/25/2016 11,004 7.1 7.1 0 2.6 0.46 4.1 0 33 12 205 804 1,819
1503960366 3/26/2016 17,609 11.6 11.6 0 6.9 0.73 3.9 0 89 17 274 588 2,154
1503960366 3/27/2016 12,736 8.5 8.5 0 4.7 0.16 3.7 0 56 5 268 605 1,944
1503960366 3/28/2016 13,231 8.9 8.9 0 3.2 0.79 4.9 0 39 20 224 1,080 1,932

Start date: convert format
End date: convert format
Columns: 15
Rows: 457
Unique IDs: 35
Empty cells: 0
N/A cells: 0
Duplicate rows: 0

Column types: | Id <numeric> | ActivityDate <character> | TotalSteps <numeric> | TotalDistance <numeric> | TrackerDistance <numeric> | LoggedActivitiesDistance <numeric> | VeryActiveDistance <numeric> | ModeratelyActiveDistance <numeric> | LightActiveDistance <numeric> | SedentaryActiveDistance <numeric> | VeryActiveMinutes <numeric> | FairlyActiveMinutes <numeric> | LightlyActiveMinutes <numeric> | SedentaryMinutes <numeric> | Calories <numeric>

The preview of the daily_activity_retro_orig data frame showed that the columns names were written in camelCase, which could be considered less than ideal not only because many databases do not allow column names with capital letters, but also because, as stated by Rawat (2022), in the case of R, it is recommended to use snake_case instead, due to the success of tidyverse and the fact that many other packages are also implementing the same naming convention. Therefore, during the cleaning process we converted the column names to snake_case as recommended by Rawat (2022).

The preview of the data also revealed that the date column was formatted as character. This was the case for all the original files after they were imported into R. Thus, before further cleaning or manipulation of the data, we formatted the date columns as date or POSIXct to be able to link the data across files more appropriately and to have a more universal and code-friendly format.

To accomplish the cleaning tasks mentioned in the previous paragraphs, we created the following function to clean the column names and convert the date column to the correct format (date or POSIXct). This function can be used for either a single data frame or an entire list of data frames.

# Function to carry out several cleaning steps at once 
clean_dflist <- function(dflist){

# Clean the column names
dflist_clean <- lapply(dflist, clean_names)

# Create data frames' names by adding the "clean" suffix
names(dflist_clean) <- names(dflist) %>%
  stringr::str_replace("orig", "clean")

# Sequence through each data frame to clean them all
for (i in seq_along(dflist_clean)) {
  df_clean <- eval(as.data.frame(dflist_clean[[i]]))
  
  # Change data type of ID column to "character"
  df_clean <- df_clean %>% mutate_at(vars(contains("Id") | contains("id")), as.character)
  
  # Check if date column is formatted appropriately
  date_test <- which(sapply(df_clean, is.Date))
  posixct_test <- which(sapply(df_clean, is.POSIXct))
  
  # If no column is formatted as date or POSIXct,
  # find the index of column that needs to be formatted
  if (length(date_test) == 0L & length(posixct_test) == 0L) {
    date_col_index <- which(str_count(df_clean[1, ], "/") == 2)
    
    # If the date column includes time, convert to POSIXct
    if (str_detect(df_clean[1, date_col_index], ":") == TRUE) {
      df_clean[, date_col_index] <-
        as.POSIXct(
          df_clean[, date_col_index],
          tz = "UTC",
          tryFormats = c(
            "%m/%d/%Y %I:%M:%S %p",
            "%m/%d/%Y %I:%M %p",
            "%m/%d/%Y %I %p",
            "%m/%d/%Y"
          )
        )
      
      # else, convert to just date
    } else {
      df_clean[, date_col_index] <-
        as.Date(df_clean[, date_col_index], format = "%m/%d/%Y")
      
    }
  }
  
  dflist_clean[[i]] <- as_tibble(df_clean)
  
  # For flexibility, create individual data frames by assigning names
  assign(names(dflist)[[i]] %>%
           stringr::str_replace("orig", "clean"),
         dflist_clean[[i]], envir = .GlobalEnv)
  
}
return(dflist_clean)
}
dflist_retro_clean <- clean_dflist(dflist_retro_orig)
dflist_prosp_clean <- clean_dflist(dflist_prosp_orig)

3.4 Global Preview and Additional Cleaning Steps

Additional cleaning requirements for several data frames were identified by enabling the “compact” option of the df_preview function that we created for this analysis.

# Set results = 'asis' to display HTML table from function
df_preview(dflist_retro_clean,
           compact = TRUE,
           tbl_num = "3.4.1",
           preview_name = "retrospective data")
Table 3.4.1 Data preview: retrospective data
data_frame columns rows unique_ids empty_cells na_cells duplicate_rows start_date end_date
daily_activity_retro_clean 15 457 35 0 0 0 2016-03-12 2016-04-12
heartrate_seconds_retro_clean 3 1,154,681 14 0 0 0 2016-03-29 2016-04-12
minute_intensities_retro_clean 3 1,445,040 34 0 0 0 2016-03-12 2016-04-12
minute_sleep_retro_clean 4 198,559 23 0 0 525 2016-03-11 2016-04-12
minute_steps_retro_clean 3 1,445,040 34 0 0 0 2016-03-12 2016-04-12
weight_retro_clean 8 33 11 0 31 0 2016-03-30 2016-04-12

df_preview(dflist_prosp_clean,
           compact = TRUE,
           tbl_num = "3.4.2",
           preview_name = "prospective data")
Table 3.4.2 Data preview: prospective data
data_frame columns rows unique_ids empty_cells na_cells duplicate_rows start_date end_date
daily_activity_prosp_clean 15 940 33 0 0 0 2016-04-12 2016-05-12
heartrate_seconds_prosp_clean 3 2,483,658 14 0 0 0 2016-04-12 2016-05-12
minute_intensities_prosp_clean 3 1,325,580 33 0 0 0 2016-04-12 2016-05-12
minute_sleep_prosp_clean 4 188,521 24 0 0 543 2016-04-11 2016-05-12
minute_steps_prosp_clean 3 1,325,580 33 0 0 0 2016-04-12 2016-05-12
sleep_day_prosp_clean 5 413 24 0 0 3 2016-04-12 2016-05-12
weight_prosp_clean 8 67 8 0 65 0 2016-04-12 2016-05-12

The global preview of the datasets revealed that there were other minor cleaning issues that needed to be examined before performing the analysis, such as duplicate rows in the minute_sleep_retro and minute_sleep_prosp data frames, N/A values in the weight_retro and weight_prosp data frames, and differences in the number of IDs across files that should have had listed the same users, which was the case for minute_steps_retro (34 IDs) versus daily_activity_retro (35 IDs) and minute_steps_prosp (33 IDs), from which daily_activity_prosp was also derived. In addition, there were inconsistencies within the dates of the retrospective and prospective datasets since both included April 12, 2016, in their records, and this could become a source of duplicates after merging the data. However, to reduce the number of cleaning steps, the duplicates arising from inconsistencies in the dates were removed in Section 5 (“Data Manipulation and Transformation”), after merging the retrospective and prospective data.

Since the daily_activity_retro data frame needed to be re-aggregated due to the fact that the original file showed a major discrepancy in the number of active minutes when compared to the minute_intensities_retro data frame (see Section 4), the inconsistencies in terms of user IDs were also addressed during the data manipulation stage. For more details, see Section 5.

It is important to point out that in the case of the daily aggregated data, another cleaning measure that was taken during the data manipulation stage (Section 5) was to remove the rows in which the daily steps were equal to zero since under normal circumstances it would be very unlikely for someone not to take one single step during an entire day. When the daily steps were equal to zero, it was probably because the user did not wear the tracker device on that day and, therefore, could become a misrepresentation of the data. It must be noted that this reasoning can only be applied to daily steps, not to daily activity minutes, because the latter is measured differently and it is normal to have days in which activity minutes are equal to zero, considering that the user must do at least 10 minutes of continuous moderate-to-intense activity for the Fitbit device to take them into account (see “What Are Active Zone Minutes or Active Minutes on My Fitbit Device?” n.d.).

In the case of the weight data frames, further exploration showed that all the N/A values corresponded to the column named “fat”, which is a variable that is not part of the analysis; consequently, the column “fat” was removed during the data manipulation phase as well.

The following code confirmed which column in the weight data frame contained all the N/A values.

# Determine which columns contain the N/A values
cbind(
  lapply(lapply(weight_retro_clean, is.na), sum))
##                  [,1]
## id               0   
## date             0   
## weight_kg        0   
## weight_pounds    0   
## fat              31  
## bmi              0   
## is_manual_report 0   
## log_id           0
cbind(
  lapply(lapply(weight_prosp_clean, is.na), sum))
##                  [,1]
## id               0   
## date             0   
## weight_kg        0   
## weight_pounds    0   
## fat              65  
## bmi              0   
## is_manual_report 0   
## log_id           0

Whereas the above observations were addressed during the manipulation stage, the data that, in particular, required a closer look in terms of cleaning in this section was minute_sleep, which had 525 duplicate rows in the retrospective set and 543 duplicate rows in the prospective set.

The following code was used to remove all the duplicate rows from the minute_sleep_retro and minute_sleep_prosp data frames.

# Remove duplicate rows
minute_sleep_retro_clean <- minute_sleep_retro_clean %>% 
  distinct(.keep_all = TRUE)

minute_sleep_prosp_clean <- minute_sleep_prosp_clean %>% 
  distinct(.keep_all = TRUE)

# Replace the data frames in the master list of data frames  
dflist_retro_clean[["minute_sleep_retro_clean"]] <-
  minute_sleep_retro_clean 
dflist_prosp_clean[["minute_sleep_prosp_clean"]] <-
  minute_sleep_prosp_clean 

4 Consistency Evaluation of the Original Datasets

4.1 Discrepancy in Active Minutes within the Original Aggregated Data

As discussed earlier, the original source generated the “daily_activity” files for the retrospective and prospective sets by aggregating the minute-by-minute data. In order to evaluate the accuracy of the original aggregated data within the same time frame, we compared the total active minutes of the “daily_activity” data frame with the total active minutes of the “minute_intensities” data frame for both the retrospective and prospective sets.

# Compare active minutes: original non-aggregated vs aggregated files

# Calculate total active minutes in retrospective non-aggregated data
total_minute_intensities_retro <-
  minute_intensities_retro_clean %>% summarize(
    very_active_minutes = sum(intensity == 3),
    fairly_active_minutes = sum(intensity == 2),
    lightly_active_minutes = sum(intensity == 1)
  )

# Transpose to place 3 categories in 1 column for later comparison
total_minute_intensities_retro <-
  data.frame(
    intensity = names(total_minute_intensities_retro),
    t(total_minute_intensities_retro),
    row.names = NULL
  )

# Calculate total active minutes from the aggregated data
total_intensities_daily_activity_retro <-
  daily_activity_retro_clean %>% summarize(
    very_active_minutes = sum(very_active_minutes),
    fairly_active_minutes = sum(fairly_active_minutes),
    lightly_active_minutes = sum(lightly_active_minutes)
  )

# Transpose to place 3 categories in 1 column for later comparison
total_intensities_daily_activity_retro <-
  data.frame(
    intensity = names(total_intensities_daily_activity_retro),
    t(total_intensities_daily_activity_retro),
    row.names = NULL
  )

# Merge totals from non-aggregated and aggregated data for comparison
active_minutes_diff_retro  <-
  inner_join(total_intensities_daily_activity_retro,
             total_minute_intensities_retro,
             by = "intensity")

names(active_minutes_diff_retro)[2:3] <-
  c("daily_activity_file", "minute_intensities_file")

# Calculate percent error in the retrospective data
# by using the non-aggregated file as the base
active_minutes_diff_retro  <-
  active_minutes_diff_retro  %>%
  mutate(percent_error = percent(
    (daily_activity_file - minute_intensities_file)/minute_intensities_file)
  )
                                                        
# Calculate total active minutes from the prospective non-aggregated data
total_minute_intensities_prosp <-
  minute_intensities_prosp_clean %>% summarize(
    very_active_minutes = sum(intensity == 3),
    fairly_active_minutes = sum(intensity == 2),
    lightly_active_minutes = sum(intensity == 1)
  )

# Transpose to place 3 categories in 1 column for later comparison
total_minute_intensities_prosp <-
  data.frame(
    intensity = names(total_minute_intensities_prosp),
    t(total_minute_intensities_prosp),
    row.names = NULL
  )

# Calculate total active minutes from the aggregated data
total_intensities_daily_activity_prosp <-
  daily_activity_prosp_clean %>% summarize(
    very_active_minutes = sum(very_active_minutes),
    fairly_active_minutes = sum(fairly_active_minutes),
    lightly_active_minutes = sum(lightly_active_minutes)
  )

# Transpose to place 3 categories in 1 column for later comparison
total_intensities_daily_activity_prosp <-
  data.frame(
    intensity = names(total_intensities_daily_activity_prosp),
    t(total_intensities_daily_activity_prosp),
    row.names = NULL
  )

# Merge totals from non-aggregated and aggregated data for comparison
active_minutes_diff_prosp  <-
  inner_join(total_intensities_daily_activity_prosp,
             total_minute_intensities_prosp,
             by = "intensity")

names(active_minutes_diff_prosp)[2:3] <-
  c("daily_activity_file", "minute_intensities_file")

# Calculate percent error in the prospective data
# by using the non-aggregated file as the base
active_minutes_diff_prosp  <-
  active_minutes_diff_prosp  %>% mutate(
    percent_error = percent((daily_activity_file - minute_intensities_file) /  minute_intensities_file)
  )
# Tables. Discrepancy in total steps: daily_activity vs minute_steps
kbl(
  active_minutes_diff_retro,
  format.args = list(big.mark = ","),
  caption = "<left>Table 4.1.1 Discrepancy in total active minutes: retrospective data</left>"
) %>%
  kable_classic(full_width = F,
                position = "left")
Table 4.1.1 Discrepancy in total active minutes: retrospective data
intensity daily_activity_file minute_intensities_file percent_error
very_active_minutes 7,597 19,098 -60.2%
fairly_active_minutes 5,973 12,337 -51.6%
lightly_active_minutes 77,722 178,773 -56.5%
kbl(
  active_minutes_diff_prosp,
  format.args = list(big.mark = ","),
  caption = "<left>Table 4.1.2 Discrepancy in total active minutes: prospective data</left>"
) %>%
  kable_classic(full_width = F,
                position = "left")
Table 4.1.2 Discrepancy in total active minutes: prospective data
intensity daily_activity_file minute_intensities_file percent_error
very_active_minutes 19,895 19,838 0.287%
fairly_active_minutes 12,751 12,749 0.016%
lightly_active_minutes 181,244 180,891 0.195%

As shown in tables 4.1.1 and 4.1.2, while the percent error in the original aggregated file of the prospective data was relatively small, in the case of the retrospective files it ranged from -51.6 to -60.2%. These results revealed a large discrepancy between the aggregated and the non-aggregated data in the retrospective set that needed to be addressed before proceeding with our analysis. This inconsistency in the daily_activity_retro_orig data frame can also be confirmed by noticing how much the total active minutes differ from its prospective counterpart daily_activity_prosp_orig (Table 4.1.2 versus Table 4.1.1).

By exploring further into the data, we were able to determine that the source of the discrepancy within the retrospective aggregated file was coming from the fact that the total number of days for each user was considerably less in the daily_activity_retro data frame compared to the minute_intensities_retro data frame. Table 4.1.3 compares the total number of days for each ID for both files of the retrospective set.

# Calculate user's number of days in the minute_intensities data frame
days_minute_intensities_retro <-
  minute_intensities_retro_clean %>%
  mutate (activity_date = date(activity_minute)) %>%
  group_by(id, activity_date) %>% summarize(cur_group()) %>%
  count(id, name = "days_minute_intensities")

# Calculate user's number of days in the daily_activity data frame
days_daily_activity_retro <-
  daily_activity_retro_clean %>% 
  group_by(id, activity_date) %>%
  summarize(cur_group()) %>%
  count(id, name = "days_daily_activity")

days_diff_retro <-
  merge(
    x = days_minute_intensities_retro,
    y = days_daily_activity_retro,
    by.x = "id",
    by.y = "id",
    all = TRUE
  )

kbl(days_diff_retro, align = "c",
    caption = "<left>Table 4.1.3 Discrepancy in number of days: retrospective data</left>") %>%
  kable_custom %>%
  add_header_above(c(" " = 1, "Number of days" = 2)) %>%
  scroll_box(width = "100%", height = "200px")
Table 4.1.3 Discrepancy in number of days: retrospective data
Number of days
id days_minute_intensities days_daily_activity
1503960366 31 19
1624580081 32 19
1644430081 30 10
1844505072 32 12
1927972279 32 12
2022484408 32 12
2026352035 32 12
2320127002 32 12
2347167796 32 15
2873212765 31 12
2891001357 1 8
3372868164 30 10
3977333714 32 12
4020332650 32 32
4057192912 32 32
4319703577 28 12
4388161847 NA 8
4445114986 32 15
4558609924 32 12
4702921684 31 15
5553957443 32 12
5577150313 31 11
6117666160 29 10
6290855005 22 10
6391747486 29 9
6775888955 29 9
6962181067 32 14
7007744171 32 12
7086361926 32 12
8053475328 32 11
8253242879 31 12
8378563200 32 12
8583815059 28 8
8792009665 32 12
8877689391 32 12


To confirm that the days that were missing in the retrospective aggregated file were the source of the discrepancy, we identified the exact dates that were missing for each user and calculated the total active minutes for those users during such dates.

# Obtain daily totals for each user from the minute_intensities_retro data frame
daily_intensities_retro <-
  minute_intensities_retro_clean %>% mutate (activity_date = date(activity_minute)) %>% group_by(id, activity_date) %>% summarize (
    very_active_minutes = sum(intensity == 3),
    fairly_active_minutes = sum(intensity == 2),
    lightly_active_minutes = sum(intensity == 1),
    sedentary_minutes = sum(intensity == 0)
  )

# Identify the combinations of user and date that are missing
# in the aggregated file (daily_activity_retro)
missing_dates_retro <-
  data.frame(
    setdiff(daily_intensities_retro[, 1:2],
            daily_activity_retro_clean[, 1:2]))

# Create data frame with the intensities for the missing dates
intensities_missing_retro <-
  merge(
    x = missing_dates_retro,
    y = daily_intensities_retro,
    by.x = c("id", "activity_date"),
    by.y = c("id", "activity_date")
  )

# Calculate total intensities
total_intensities_missing_retro <-
  intensities_missing_retro %>% summarize(
    very_active_minutes = sum(very_active_minutes),
    fairly_active_minutes = sum(fairly_active_minutes),
    lightly_active_minutes = sum(lightly_active_minutes)
  )

# Transpose total intensities to add values to comparison table
total_intensities_missing_retro <-
  data.frame(
    intensity = names(total_intensities_missing_retro),
    excluded_values = t(total_intensities_missing_retro),
    row.names = NULL
  )

active_minutes_diff_revised_retro <-
  merge(x = active_minutes_diff_retro,
        y = total_intensities_missing_retro,
        by.x = "intensity",
        by.y = "intensity")

active_minutes_diff_revised_retro <-
  active_minutes_diff_revised_retro %>% mutate(
    daily_activity_and_excluded = daily_activity_file + excluded_values) %>%
  relocate(new_percent_error = percent_error, .after = last_col()) %>%
  relocate(minute_intensities_file, .before = new_percent_error) %>%
  mutate(new_percent_error = percent(
    (daily_activity_and_excluded - minute_intensities_file)/minute_intensities_file
  ))


kbl(
  active_minutes_diff_revised_retro,
  caption = "<left>Table 4.1.4 Percent error after reintegration of excluded values: retrospective data</left>",
  format.args = list(big.mark = ",")
) %>% kable_custom
Table 4.1.4 Percent error after reintegration of excluded values: retrospective data
intensity daily_activity_file excluded_values daily_activity_and_excluded minute_intensities_file new_percent_error
fairly_active_minutes 5,973 7,096 13,069 12,337 5.93%
lightly_active_minutes 77,722 102,392 180,114 178,773 0.75%
very_active_minutes 7,597 11,605 19,202 19,098 0.54%


When the missing days were added to the original aggregated data (daily_activity_retro) by obtaining the values from the non-aggregated data frame, the percent errors between those two data frames for very active, fairly active, and lightly active minutes were reduced from 60.2%, 51.6%, and 56.5% (Table 4.1.1) to 5.93%, 0.75%, and 0.54% (Table 4.1.4), respectively, which confirms that the discrepancy is originating from the number of days that were not included in the aggregated file. On that account, it was concluded that it was appropriate to re-aggregate the retrospective data to resolve these inconsistencies within the data (see Section 5).

Furthermore, it must be emphasized that the initial large negative discrepancy found in the original aggregated retrospective data did not arise from any homogenization efforts in terms of number of users among the different files since the daily_activity_retro data frame had more ID numbers than any of the other data frames (see Table 3.4.1) and included all the 34 IDs that were present in the minute_intensities_retro data frame, and yet, as illustrated in Table 4.1.1, the total active minutes in the daily_acivity_retro data frame was almost half of that of the minute_intensities_retro data frame.

It is worth mentioning that the validity of the procedure we used to calculate the number of days for each user (Table 4.1.3) could be confirmed by applying the same method to the case of the prospective data. As shown in Table 4.1.5, with regard to the prospective data, the number of days corresponding to each ID of the daily_activity_prosp was mostly the same as the values calculated for the minute_intensities_prosp, which not only confirmed that the prospective aggregated file was much more consistent than the retrospective file, but it also validated the method that we used to calculate the number of days for each user.

# Calculate user's number of days in the minute_intensities data frame
days_minute_intensities_prosp <- 
  minute_intensities_prosp_clean %>% 
  mutate (activity_date = date(activity_minute)) %>%
  group_by(id, activity_date) %>%
  summarize(cur_group()) %>%
  count(id, name = "days_minute_intensities")

# Calculate user's number of days in the daily_activity data frame
days_daily_activity_prosp <-
  daily_activity_prosp_clean %>% 
  group_by(id, activity_date) %>% 
  summarize(cur_group()) %>%
  count(id, name = "days_daily_activity")

days_diff_prosp <-
  merge(
    x = days_minute_intensities_prosp,
    y = days_daily_activity_prosp,
    by.x = "id",
    by.y = "id",
    all = TRUE
  )

kbl(days_diff_prosp, align = "c",
    caption = "<left>Table 4.1.5 Discrepancy in number of days: prospective data</left>") %>%
  kable_custom %>%
  add_header_above(c(" " = 1, "Number of days" = 2)) %>%
  scroll_box(width = "100%", height = "200px")
Table 4.1.5 Discrepancy in number of days: prospective data
Number of days
id days_minute_intensities days_daily_activity
1503960366 30 31
1624580081 31 31
1644430081 30 30
1844505072 31 31
1927972279 31 31
2022484408 31 31
2026352035 31 31
2320127002 31 31
2347167796 18 18
2873212765 31 31
3372868164 20 20
3977333714 29 30
4020332650 31 31
4057192912 4 4
4319703577 31 31
4388161847 31 31
4445114986 31 31
4558609924 31 31
4702921684 31 31
5553957443 31 31
5577150313 30 30
6117666160 28 28
6290855005 28 29
6775888955 26 26
6962181067 31 31
7007744171 26 26
7086361926 31 31
8053475328 31 31
8253242879 18 19
8378563200 31 31
8583815059 30 31
8792009665 28 29
8877689391 31 31

Even though the discrepancy was much smaller in the case of the aggregated prospective data (daily_activity_prosp), it was determined that, since its retrospective counterpart (daily_activity_retro) already needed to be re-aggregated to address the large discrepancy that was found in the original file, it was appropriate to also re-aggregate the prospective data in order to make the procedure consistent across the different time frames and ensure that no discrepancy was found in the analysis irrespective of how small it might be.

4.2 Inconsistency in Sedentary Minutes within the Original Aggregated Files

The original aggregated files also had inconsistencies in the total number of sedentary minutes, but since sedentary minutes were factored into the original daily aggregated data in a different way than active minutes, it was appropriate to evaluate this variable separate from the others.

During the data manipulation stage of this analysis (Section 5), it was shown that the daily sedentary minutes were factored into the original daily activity data by first aggregating the sedentary minutes for each user, and then subtracting the time in bed to get the actual daily sedentary minutes, which is a reasonable step to take; however, this became an inconsistency issue because only 23 out of 35 users had sleep data in the retrospective set, and only 24 out of 33 users in the prospective data.

Consequently, the users who did not have sleep data ended up displaying unrealistic and considerably larger numbers for daily sedentary minutes in the aggregated data because their time in bed, since it was missing, was not subtracted from the unadjusted values.

To highlight how the original source subtracted the time in bed from the initial daily sedentary minutes to get the actual daily sedentary minutes, we calculated the difference in daily sedentary minutes between the daily_activity_prosp and minute_intensities data frames for the users who provided sleep data, and then compared this amount to their corresponding time in bed found in the sleep_day_prosp data frame.

# Subset IDs in daily_activity that also have sleep data
ids_with_sleep_data <-
  subset(daily_activity_prosp_clean,
         (id %in% sleep_day_prosp_clean$id))

# Calculate daily sedentary minutes in the non-aggregated data frame
daily_sedentary_minutes_prosp_clean <-
  minute_intensities_prosp_clean %>% mutate (activity_date = date(activity_minute)) %>% group_by(id, activity_date) %>%
  summarize (sedentary_minutes_base = sum(intensity == 0))

# Add daily sedentary_minutes_base from non-aggregated data frame
# to daily_activity data frame and isolate sedentary variables
sedentary_minutes_comparison_01 <-
  merge(
    x = daily_sedentary_minutes_prosp_clean[, c(
      "id", "activity_date", "sedentary_minutes_base")],
    y = ids_with_sleep_data[, c(
      "id", "activity_date", "sedentary_minutes")],
    by.x = c("id", "activity_date"),
    by.y = c("id", "activity_date")
  )

# Calculate the difference in daily sedentary minutes
# between non-aggregated file and aggregated file
sedentary_minutes_comparison_01 <-
  sedentary_minutes_comparison_01 %>%
  mutate(
    sedentary_diff = sedentary_minutes_base - sedentary_minutes)

# Add the time_in_bed variable from the sleep-day data frame
sedentary_minutes_comparison_01 <-
  merge(
    x = sedentary_minutes_comparison_01,
    y = sleep_day_prosp_clean[, c(
      "id", "sleep_day", "total_time_in_bed")],
    by.x = c("id", "activity_date"),
    by.y = c("id", "sleep_day")
  )

names(sedentary_minutes_comparison_01)[3:4] <-
  c("minutes_intensities_file", "daily_activity_file")

kbl(sedentary_minutes_comparison_01[1:7,],
    align = "c",
    caption = "<left>Table 4.2.1 Daily sedentary minutes in original files for users with sleep data: prospective</left>") %>%
  kable_custom %>% add_header_above(c(
    " " = 1,
    " " = 1,
    "Daily sedentary minutes" = 2,
    " " = 1,
    " " = 1
  )) %>% scroll_box(width = "100%", height = "200px")
Table 4.2.1 Daily sedentary minutes in original files for users with sleep data: prospective
Daily sedentary minutes
id activity_date minutes_intensities_file daily_activity_file sedentary_diff total_time_in_bed
1503960366 2016-04-12 1074 728 346 346
1503960366 2016-04-13 1183 776 407 407
1503960366 2016-04-15 1168 726 442 442
1503960366 2016-04-16 1173 773 400 367
1503960366 2016-04-17 1218 539 679 712
1503960366 2016-04-19 1095 775 320 320
1503960366 2016-04-20 1195 818 377 377


A similar procedure was followed to compare the daily sedentary minutes in the case of the users who did not have sleep data.

# Subset IDs in daily_activity that also have sleep data
ids_no_sleep_data <-
  subset(daily_activity_prosp_clean,
         !(id %in% sleep_day_prosp_clean$id))

# Add daily sedentary_minutes_base from non-aggregated data frame
# to daily_activity data frame and isolate sedentary variables
sedentary_minutes_comparison_02 <-
  merge(
    x = daily_sedentary_minutes_prosp_clean[, c(
      "id", "activity_date", "sedentary_minutes_base")],
    y = ids_no_sleep_data[, c(
      "id", "activity_date", "sedentary_minutes")],
    by.x = c("id", "activity_date"),
    by.y = c("id", "activity_date")
  )

# Calculate the difference in daily sedentary minutes
# between non-aggregated file and aggregated file
sedentary_minutes_comparison_02 <-
  sedentary_minutes_comparison_02 %>%
  mutate(
    sedentary_diff = sedentary_minutes_base - sedentary_minutes)

names(sedentary_minutes_comparison_02)[3:4] <-
  c("minutes_intensities_file", "daily_activity_file")

kbl(sedentary_minutes_comparison_02[1:7, ],
    align = "c",
    caption = "<left>Table 4.2.2 Daily sedentary minutes in original files for users with no sleep data: prospective</left>") %>%
  kable_custom %>% add_header_above(c(
    " " = 1,
    " " = 1,
    "Daily sedentary minutes" = 2,
    " " = 1
  )) %>% scroll_box(width = "100%", height = "200px")
Table 4.2.2 Daily sedentary minutes in original files for users with no sleep data: prospective
Daily sedentary minutes
id activity_date minutes_intensities_file daily_activity_file sedentary_diff
1624580081 2016-04-12 1294 1294 0
1624580081 2016-04-13 1292 1292 0
1624580081 2016-04-14 1204 1204 0
1624580081 2016-04-15 1344 1344 0
1624580081 2016-04-16 1264 1264 0
1624580081 2016-04-17 1276 1276 0
1624580081 2016-04-18 1214 1214 0


When tables 4.2.1 and 4.2.2 are compared with each other, it becomes clear that the values for daily sedentary minutes in the original aggregated file were considerably greater in the case of users who did not have sleep data, which is an inconsistency that was resolved during the manipulation stage of our analysis.

This point can be illustrated further by comparing the average daily sedentary minutes in the daily_activity data frame between the users who provided sleep data and the users for whom this kind of information was missing.

# Calculate original daily average for sedentary minutes of users with sleep data
avg_sedentary_min_with_sleep <-
  subset(daily_activity_prosp_clean,
         (id %in% sleep_day_prosp_clean$id)) %>% 
  summarize(avg_sedentary_minutes = mean(sedentary_minutes)) %>%
  round(., digits = 0)

# Calculate original daily average for sedentary minutes of users with no sleep data
avg_sedentary_min_no_sleep <-
  subset(daily_activity_prosp_clean,
         !(id %in% sleep_day_prosp_clean$id)) %>%
  summarize(avg_sedentary_minutes = mean(sedentary_minutes)) %>%
  round(., digits = 0)

cat(
  paste(
    "**Average daily sedentary minutes in the original aggregated data**",
    "\n \nUsers with sleep data: ",
    avg_sedentary_min_with_sleep,
    " minutes",
    "\nUsers with no sleep data: ",
    avg_sedentary_min_no_sleep,
    " minutes"
  )
)
## **Average daily sedentary minutes in the original aggregated data** 
##  
## Users with sleep data:  933  minutes 
## Users with no sleep data:  1175  minutes

For the daily aggregated data to be consistent in terms of sedentary minutes, the following two conditions should be taken into consideration when performing the analysis:

  1. When the number of minutes for time in bed are available and, hence, subtracted from the daily sedentary minutes to obtain more realistic values (as it was done by the original source), then the daily activity data must be limited in the number of IDs to include only the participants who provided sleep data, which was 23 in the case of the retrospective data and 24 in the prospective group.

  2. When all the 35 participants are taken into account for an analysis, the daily sedentary minutes variable should be excluded from the data, as it will have either unrealistic values or inconsistencies since only a fraction of the group provided sleep data.

5 Data Manipulation and Transformation

5.1 Merging Retrospective and Prospective Data

To reduce the number of steps in the data manipulation phase, it was decided to merge the retrospective data with its prospective counterpart prior to any aggregation procedures. In this manner, aggregation steps do not have to be carried out twice for each one of the relevant variables.

It is worth mentioning that additional cleaning steps were taken during the data manipulation stage to make our analysis more consistent and comparable to real-world observations (see Section 5.6).

Merging retrospective and prospective data on steps and intensities

# Remove date 2016-04-12 because it was included in the prospective data
minute_steps_retro_clean <-
  minute_steps_retro_clean %>%
  filter(!str_detect(activity_minute, "2016-04-12"))

minute_intensities_retro_clean <-
  minute_intensities_retro_clean %>%
  filter(!str_detect(activity_minute, "2016-04-12"))

#Merge retrospective and prospective steps and intensities
minute_steps_merged_clean <-
  rbind(minute_steps_retro_clean, minute_steps_prosp_clean)

minute_intensities_merged_clean <-
  rbind(minute_intensities_retro_clean,
        minute_intensities_prosp_clean)

Merging retrospective and prospective sleep data

It is important to note that although the prospective data included an aggregated file for daily sleep, that was not the case for the retrospective dataset. Therefore, the aggregation of the minute_sleep_retro_clean data frame was a necessary step; however, to be consistent throughout our analysis, this data frame was merged with its prospective counterpart prior to aggregation.

# Remove 2016-04-12 in retrospective data; since it is included in prospective data
minute_sleep_retro_clean <-
  minute_sleep_retro_clean %>%
  filter(!str_detect(date, "2016-04-12"))

# Remove 2016-04-11 in prospective data,
# since it is included in retrospective data
minute_sleep_prosp_clean <-
  minute_sleep_prosp_clean %>%
  filter(!str_detect(date, "2016-04-11"))

minute_sleep_merged_clean <-
  rbind(minute_sleep_retro_clean, minute_sleep_prosp_clean)

Merging retrospective and prospective heart rate data

# Remove date 2016-04-12
# because it was included in the prospective data
heartrate_seconds_retro_clean <-
  heartrate_seconds_retro_clean %>%
  filter(!str_detect(time, "2016-04-12"))

heartrate_seconds_merged_clean <-
  rbind(heartrate_seconds_retro_clean,
        heartrate_seconds_prosp_clean)

Merging retrospective and prospective weight data

Concerning the data on weight, even though only a few users logged in their weight information (see unique IDs in tables 3.4.1 and 3.4.2), and despite the fact that no major differences in weight were found for any particular user across time, the data on weight was still useful to analyze how the participants used the different features available to them.

# Remove date 2016-04-12 because it was included in the prospective data
weight_retro_clean <- weight_retro_clean %>%
  filter(!str_detect(date, "2016-04-12"))

weight_merged_clean <- rbind(weight_retro_clean, weight_prosp_clean)

5.2 Daily Aggregation of Steps, Intensities, and Sleep Data

The following paragraphs illustrate the methods that were used in this analysis for re-aggregating the steps and intensities datasets. When the data was re-aggregated to create a corrected version of the daily_activity data frame, the complementary data on sleep and weight were also added to the main data frame, while omitting the variables that were not relevant to this analysis for reasons that were explained in previous sections.

The data frames and their corresponding variables that were aggregated to generate the new daily_activity data frame are listed below:

  • minute_sleep: minutes asleep and minutes in bed
  • minute_steps: total steps
  • minute_intensities: very active minutes, fairly active minutes, lightly active minutes, and sedentary minutes
  • weight: weight in pounds

Daily aggregation of steps and intensities

# Aggregate minute_steps to obtain daily steps by user
daily_steps_clean <- minute_steps_merged_clean %>%
  mutate (date = date(activity_minute)) %>%
  group_by(id, date) %>% summarize(total_steps = sum(steps))

# Aggregate minute_intensities_retro
# to obtain daily intensities by user
# 0 -> sedentary;
# 1-> lightly_active;
# 2-> fairly_active;
# 3-> very_active
daily_intensities_clean <- minute_intensities_merged_clean %>%
  mutate (date = date(activity_minute)) %>%
  group_by(id, date) %>% summarize (
    very_active_minutes = sum(intensity == 3),
    fairly_active_minutes = sum(intensity == 2),
    lightly_active_minutes = sum(intensity == 1),
    sedentary_minutes = sum(intensity == 0),
  )

Aggregation test for the daily sleep data

As mentioned earlier, there were no records of the methods used by the original source to create the files, and the only column in the original minute_sleep data frames, apart from ID and date, was named “value”, which could take a number from 1 to 3. We were able to determine what these values represented by running an aggregation test with the prospective data and comparing its results with the aggregated daily sleep data that was included in the prospective files.

The test was performed as follows:

# First aggregation of minute_sleep_prosp_clean by date
daily_sleep_prosp_test <-
  minute_sleep_prosp_clean %>% mutate (date = date(date)) %>%
  group_by(id, date, log_id) %>% summarize (
    subtotal_1 = sum(value == 1),
    subtotal_2 = sum(value == 2),
    subtotal_3 = (sum(value == 3))
  )

# If sleep log_id spans two dates (night-morning),
# assign all minutes to a.m. date
daily_sleep_prosp_test$date <-
  if_else(
    daily_sleep_prosp_test$log_id !=
      lead(daily_sleep_prosp_test$log_id) |
      is.na(lead(daily_sleep_prosp_test$log_id)),
    daily_sleep_prosp_test$date,
    lead(daily_sleep_prosp_test$date)
  )

# Restore date format because "ifelse" changes the format
class(daily_sleep_prosp_test$date) <- "Date"

# Aggregate sleep to evaluate values 1, 2, and 3
daily_sleep_prosp_test <-
  daily_sleep_prosp_test %>% group_by(id, date) %>%
  summarize(
    value_1 = sum(subtotal_1),
    value_2 = sum(subtotal_2),
    value_3 = sum(subtotal_3)
  )

# Compare test with original aggregated sleep data
daily_sleep_prosp_test[1:4,] %>% arrange(id, date) %>%
  kbl(caption = "Table 5.2.1 Aggregation test: prospective daily sleep") %>%
  kable_custom()
Table 5.2.1 Aggregation test: prospective daily sleep
id date value_1 value_2 value_3
1503960366 2016-04-12 327 13 6
1503960366 2016-04-13 384 11 12
1503960366 2016-04-15 412 22 8
1503960366 2016-04-16 340 19 8
sleep_day_prosp_clean[1:4,] %>% arrange(id, sleep_day)%>%
  kbl(caption = "Table 5.2.2 Actual data: prospective daily sleep") %>%
  kable_custom()
Table 5.2.2 Actual data: prospective daily sleep
id sleep_day total_sleep_records total_minutes_asleep total_time_in_bed
1503960366 2016-04-12 1 327 346
1503960366 2016-04-13 2 384 407
1503960366 2016-04-15 1 412 442
1503960366 2016-04-16 2 340 367

By comparing the test results shown in Table 5.2.1 with the values from the original aggregated file from the prospective set, listed in Table 5.2.1, it became evident that a value of 1 represented minutes asleep, while the summation of the values 1, 2, and 3 represented total minutes in bed.

Daily aggregation of sleep data

Since the test we performed to aggregate the sleep data of the prospective set was accurate and precise, the same process can be used to aggregate the merged sleep data.

# First aggregation of minute_sleep_merged_clean by date
daily_sleep_clean <-
  minute_sleep_merged_clean %>% mutate (date = date(date)) %>%
  group_by(id, date, log_id) %>% summarize (
    subtotal_asleep = sum(value == 1),
    subtotal_in_bed = sum(value == 1) + sum(value == 2) +
      sum(value == 3)
  )

# If sleep log_id spans two dates (night-morning),
# assign all minutes to a.m. date
daily_sleep_clean$date <-
  if_else(
    daily_sleep_clean$log_id != lead(daily_sleep_clean$log_id) |
      is.na(lead(daily_sleep_clean$log_id)),
    daily_sleep_clean$date,
    lead(daily_sleep_clean$date)
  )

# Restore date format because "ifelse" changes the format
class(daily_sleep_clean$date) <- "Date"

# Aggregate the subtotals for minutes asleep and time in bed
daily_sleep_clean <-
  daily_sleep_clean %>% group_by(id, date) %>%
  summarize(
    minutes_asleep = sum(subtotal_asleep),
    time_in_bed = sum(subtotal_in_bed)
  )

daily_sleep_clean <-
  daily_sleep_clean %>% arrange(id, date)

# Set results = 'asis' to display HTML table from function
df_preview(daily_sleep_clean, tbl_num = "5.2.3")
Table 5.2.3 Data preview: daily_sleep_clean
id date minutes_asleep time_in_bed
1503960366 2016-03-13 411 426
1503960366 2016-03-14 354 386
1503960366 2016-03-15 312 335
1503960366 2016-03-16 333 366

Start date: 2016-03-12
End date: 2016-05-12
Columns: 4
Rows: 832
Unique IDs: 25
Empty cells: 0
N/A cells: 0
Duplicate rows: 0

Column types: | id <character> | date <Date> | minutes_asleep <integer> | time_in_bed <integer>

5.3 Per-minute Aggregation of Heart Rate Data

Considering that the Bellabeat’s Leaf device provides guidance on stress, this analysis addressed such an important subject by exploring the relationship between stress and Fitbit’s data on heart rate. However, to factor out the increases in heart rate due to intense activity levels, heart rate values were analyzed during periods of low or no activity (lightly active minutes or sedentary minutes). For this purpose, the data on heart rate, which tracked such a variable several times per minute, was averaged on a minute-by-minute basis to match the time intervals of the data on intensity levels.

Aggregation of heart rate data

# Minute aggregation of heart rate 
minute_heartrate_clean <-
  heartrate_seconds_merged_clean %>%
  mutate(time = floor_date(ymd_hms(time), unit = "1 min")) %>%
  group_by(id, time) %>%
  summarize(heartrate_minute_avg = mean(value)) %>%
  arrange(id, time)

5.4 Merging Heart Rate Per-minute Average and Intensities

After deriving the average heart rate for each minute from the heartrate_seconds_merged_clean data frame, the heart rate data could be merged with the data on intensity levels, minute_intensities, which tracked activity information on minute-by-minute intervals.

Merging heart rate and intensities

# Merge heart rate minute averages and minute intensities
heartrate_intensities <-
  merge(
    x = minute_heartrate_clean,
    y = minute_intensities_merged_clean,
    by.x = c("id", "time"),
    by.y = c("id", "activity_minute")
  )

After merging the heart rate and intensities data, resting heart rate, which is a more appropriate indicator in matters related to health and stress, was derived by filtering out periods of moderate or high physical activity.

# Calculate daily resting heart rate
daily_resting_heartrate <-
  heartrate_intensities %>% mutate(date = as.Date(time)) %>%
  filter(intensity == 0) %>% group_by(id, date) %>%
  summarize(avg_resting_heartrate = mean(heartrate_minute_avg))

# Set results = 'asis' to display HTML table from function
df_preview(daily_resting_heartrate, tbl_num = "5.4.1")
Table 5.4.1 Data preview: daily_resting_heartrate
id date avg_resting_heartrate
2022484408 2016-04-01 73
2022484408 2016-04-02 67
2022484408 2016-04-03 65
2022484408 2016-04-04 67

Start date: 2016-03-29
End date: 2016-05-12
Columns: 3
Rows: 469
Unique IDs: 15
Empty cells: 0
N/A cells: 0
Duplicate rows: 0

Column types: | id <character> | date <Date> | avg_resting_heartrate <numeric>

5.5 New Daily Activity: Merging Steps, Intensities, and Sleep Data

As discussed earlier, the original aggregated data was inconsistent in both the retrospective and prospective sets due to the fact that the revised sedentary minutes were calculated only for the participants who provided sleep data—24 users—, whereas the users who did not have sleep data had unrevised values for this variable.

To ensure that the data was consistent, we created two different daily activity data frames for distinct purposes: one included all the participants but excluded the sedentary minutes variable, while the second data frame included only the users who had sleep data and, therefore, had the revised (actual) values for sedentary minutes. The former data frame is appropriate for categorization analysis concerning the device features used by all participants, whereas the latter is ideal for a more precise analysis on sedentary behaviors that could be compared to real-world numbers.

Merging daily activity for users with sleep data

# Merge sleep, steps, and intensities for users with sleep data
daily_activity_sleep_users <-
  Reduce(
    function(x, y)
      merge(
        x = x,
        y = y,
        by = c("id", "date")
      ),
    list(
      daily_steps_clean,
      daily_intensities_clean,
      daily_sleep_clean
    )
  )

Sleep efficiency and actual sedentary minutes

# Calculate actual sedentary minutes and sleep efficiency
daily_activity_sleep_users <-
  daily_activity_sleep_users %>%
  mutate(
    sedentary_minutes = sedentary_minutes - time_in_bed,
    sleep_efficiency = minutes_asleep / time_in_bed) %>%
  arrange(id, date)

Merging daily activity for users without sleep data

# Merge steps, and intensities for all users (including users with no sleep data)
daily_activity_all_users <-
  Reduce(
    function(x, y)
      merge(
        x = x,
        y = y,
        by = c("id", "date"),
        all.x = TRUE),
    list(daily_steps_clean, daily_intensities_clean))

# When users without sleep data are included,
# sedentary minutes are not realistic, remove values
daily_activity_all_users <- daily_activity_all_users %>%
  select(!sedentary_minutes) %>% arrange(id, date)

5.6 Additional Cleaning Steps: Unrealistic Data and IDs Inconsistencies

Considering that, under normal circumstances, it is highly unlikely for a participant to have zero steps during an entire day, the analysis would be more accurate by removing the rows in which the daily steps were equal to zero, as these cases were probably the result of not wearing the tracking device during those days, rather than an indication of sedentary behavior.

# Remove unrealistic values in daily activity for users with sleep data
daily_activity_sleep_users <-
  daily_activity_sleep_users[daily_activity_sleep_users$total_steps != 0, ]

# Remove unrealistic values in daily activity for users without sleep data
daily_activity_all_users <-
  daily_activity_all_users[daily_activity_all_users$total_steps != 0, ]

To ensure the integrity of the analysis, a closer look was taken at key data frames to make sure that the data was collected from the same users across the two different periods (retrospective and prospective). In order to determine this, we evaluated the retrospective and prospective data frames to identify the ID numbers that were not common between the two sets of data.

# Find IDs in the retrospective data that are not in the prospective data
uncommon_ids_retro <-
  distinct(anti_join(
    minute_steps_retro_clean, minute_steps_prosp_clean,
    by = "id"), id)

# Find IDs in the prospective data that are not in the retrospective data
uncommon_ids_prosp <-
  distinct(anti_join(
    minute_steps_prosp_clean, minute_steps_retro_clean,
    by = "id"), id)

uncommon_ids <-
  merge(uncommon_ids_retro, uncommon_ids_prosp, all = TRUE)

uncommon_ids
##           id
## 1 2891001357
## 2 4388161847
## 3 6391747486
# Remove the 3 IDs that are not common across the time frames
daily_activity_sleep_users <- daily_activity_sleep_users[
      daily_activity_sleep_users$id != "2891001357" &
      daily_activity_sleep_users$id != "4388161847" &
      daily_activity_sleep_users$id != "6391747486",]

# Remove the three IDs that are not common across the time frames
daily_activity_all_users <- daily_activity_all_users[
    daily_activity_all_users$id != "2891001357" &
    daily_activity_all_users$id != "4388161847" &
    daily_activity_all_users$id != "6391747486",]

df_preview(daily_activity_sleep_users, tbl_num = "5.6.1")
Table 5.6.1 Data preview: daily_activity_sleep_users
id date total_steps very_active_minutes fairly_active_minutes lightly_active_minutes sedentary_minutes minutes_asleep time_in_bed sleep_efficiency
1503960366 2016-03-13 17,106 105 13 216 680 411 426 0.96
1503960366 2016-03-14 10,023 33 5 260 756 354 386 0.92
1503960366 2016-03-15 15,384 37 18 348 702 312 335 0.93
1503960366 2016-03-16 13,498 44 28 246 756 333 366 0.91

Start date: 2016-03-12
End date: 2016-05-12
Columns: 10
Rows: 808
Unique IDs: 24
Empty cells: 0
N/A cells: 0
Duplicate rows: 0

Column types: | id <character> | date <Date> | total_steps <numeric> | very_active_minutes <integer> | fairly_active_minutes <integer> | lightly_active_minutes <integer> | sedentary_minutes <integer> | minutes_asleep <integer> | time_in_bed <integer> | sleep_efficiency <numeric>

df_preview(daily_activity_all_users, tbl_num = "5.6.2")
Table 5.6.2 Data preview: daily_activity_all_users
id date total_steps very_active_minutes fairly_active_minutes lightly_active_minutes
1503960366 2016-03-12 19,675 94 24 266
1503960366 2016-03-13 17,106 105 13 216
1503960366 2016-03-14 10,023 33 5 260
1503960366 2016-03-15 15,384 37 18 348

Start date: 2016-03-12
End date: 2016-05-12
Columns: 6
Rows: 1,651
Unique IDs: 32
Empty cells: 0
N/A cells: 0
Duplicate rows: 0

Column types: | id <character> | date <Date> | total_steps <numeric> | very_active_minutes <integer> | fairly_active_minutes <integer> | lightly_active_minutes <integer>

5.7 Merging Device Additional Features

The usage of the various device features, which could be a key component in the development of a marketing strategy, was obtained by merging the data from following data frames:

  • Sleep monitoring
  • Heart monitoring
  • Distance log
  • Weight log.

The above data frames were merged to obtained the usage according to both the number of users and number of days used.

Merging daily device features according to number of users

# Get logged activities distance feature from original data
daily_logged_distance <-
  rbind(daily_activity_retro_clean[, c(
    "id",
    "activity_date",
    "logged_activities_distance")], daily_activity_prosp_clean[, c(
    "id", "activity_date",
    "logged_activities_distance")])

daily_logged_distance <-
  daily_logged_distance %>% rename(date = activity_date)

# Select only relevant variables from the weight data
daily_weight <-
  weight_merged_clean %>% mutate(date = as.Date(date)) %>%
  select(id, date, weight_pounds)


# Merge remaining device features
daily_features_all_users <-
  Reduce(
    function(x, y)
      merge(
        x = x,
        y = y,
        by = c("id", "date"),
        all.x = TRUE
      ),
    list(
      daily_activity_all_users,
      daily_sleep_clean,
      daily_resting_heartrate,
      daily_weight,
      daily_logged_distance
    )
  )

# When users without sleep data are included,
# sedentary minutes are not realistic. Remove values
daily_features_all_users <-
  daily_features_all_users %>% select(-time_in_bed)

# Create features usage for each user: 0 = not used, 1 = used,
feature_usage_check <-
  daily_features_all_users %>% rename(
    sleep_monitoring = minutes_asleep,
    heart_rate_monitoring = avg_resting_heartrate,
    distance_log = logged_activities_distance,
    weight_log = weight_pounds) %>%
  mutate(
    sleep_monitoring = ifelse(is.na(sleep_monitoring), 0, 1),
    heart_rate_monitoring = ifelse(
      is.na(heart_rate_monitoring), 0, 1),
    distance_log = ifelse(
      is.na(distance_log) | distance_log == 0, 0, 1),
    weight_log = ifelse(
      is.na(weight_log), 0, 1)
  )

users_per_feature <-
  feature_usage_check %>% group_by(date) %>% summarize(
    sleep_monitoring = sum(sleep_monitoring),
    heart_rate_monitoring = sum(heart_rate_monitoring),
    weight_log = sum(weight_log),
    distance_log = sum(distance_log)
  )

kbl(users_per_feature,
    align = "c",
    caption = "<left>Table 5.7.1 Number  of users of each device feature</left>") %>%
  kable_custom %>% scroll_box(width = "100%", height = "200px")
Table 5.7.1 Number of users of each device feature
date sleep_monitoring heart_rate_monitoring weight_log distance_log
2016-03-12 14 0 0 0
2016-03-13 14 0 0 0
2016-03-14 15 0 0 0
2016-03-15 16 0 0 0
2016-03-16 14 0 0 0
2016-03-17 14 0 0 0
2016-03-18 12 0 0 0
2016-03-19 11 0 0 0
2016-03-20 11 0 0 0
2016-03-21 13 0 0 0
2016-03-22 9 0 0 0
2016-03-23 12 0 0 0
2016-03-24 13 0 0 0
2016-03-25 13 0 0 0
2016-03-26 14 0 0 0
2016-03-27 13 0 0 0
2016-03-28 14 0 0 0
2016-03-29 15 1 0 0
2016-03-30 16 2 2 1
2016-03-31 15 2 1 0
2016-04-01 16 12 2 1
2016-04-02 17 13 1 0
2016-04-03 14 11 2 0
2016-04-04 13 11 3 2
2016-04-05 15 12 3 2
2016-04-06 16 12 3 3
2016-04-07 14 12 4 3
2016-04-08 11 12 3 3
2016-04-09 12 12 2 0
2016-04-10 12 10 2 0
2016-04-11 13 10 2 2
2016-04-12 25 19 4 4
2016-04-13 14 10 3 2
2016-04-14 13 11 2 2
2016-04-15 16 12 1 0
2016-04-16 13 11 2 0
2016-04-17 11 12 3 0
2016-04-18 9 10 3 2
2016-04-19 13 9 2 2
2016-04-20 14 11 2 2
2016-04-21 14 10 3 2
2016-04-22 12 11 1 1
2016-04-23 14 10 2 0
2016-04-24 12 11 2 0
2016-04-25 12 10 3 3
2016-04-26 13 11 1 1
2016-04-27 13 10 2 1
2016-04-28 15 11 2 0
2016-04-29 15 10 2 1
2016-04-30 14 10 2 0
2016-05-01 15 10 3 0
2016-05-02 12 11 3 2
2016-05-03 12 9 3 2
2016-05-04 11 9 3 0
2016-05-05 11 9 1 2
2016-05-06 12 10 2 1
2016-05-07 12 8 1 0
2016-05-08 13 7 2 0
2016-05-09 10 9 3 2
2016-05-10 11 7 1 1
2016-05-11 10 7 2 1
2016-05-12 8 6 3 0

Participants’ Daily Activity Average

# Calculate activity daily average for all users
avg_activity_all_users<- daily_features_all_users %>% group_by(id) %>% summarize(
  daily_steps_avg = mean(total_steps),
  very_active_minutes_avg = mean(very_active_minutes),
  fairly_active_minutes_avg = mean(fairly_active_minutes),
  lightly_active_minutes_avg = mean(lightly_active_minutes))

# Calculate activity daily average for users with sleep data
avg_activity_sleep_users<- daily_activity_sleep_users %>% group_by(id) %>% summarize(
  daily_steps_avg = mean(total_steps),
  very_active_minutes_avg = mean(very_active_minutes),
  fairly_active_minutes_avg = mean(fairly_active_minutes),
  lightly_active_minutes_avg = mean(lightly_active_minutes),
  sedentary_minutes_avg = mean(sedentary_minutes),
  minutes_asleep_avg = mean(minutes_asleep),
  time_in_bed_avg = mean(time_in_bed))

Days of Usage of the Additional Features

# Calculate days of usage for each additional feature
feature_days_per_user <- feature_usage_check %>% group_by(id) %>%
  summarize(
    sleep_monitoring = sum(sleep_monitoring),
    heart_rate_monitoring = sum(heart_rate_monitoring),
    weight_log = sum(weight_log),
    distance_log = sum(distance_log))

avg_and_features <- merge(
  x = avg_activity_all_users,
  y = feature_days_per_user,
  by = "id")

6 Analysis and Visualizations

6.1 Popularity of Additional Features

The main device features, such as the pedometer and the intensity levels, may be considered basic functionalities, as they were largely used by all participants in a similar manner, irrespective of their lifestyles. On the other hand, a measurement that could be useful for the development of a marketing strategy is the popularity of the additional features.

In that respect, a large part of this analysis consisted in measuring the usage of the additional features not only in a general sense but also across different user lifestyles or health goals. The additional features that were analyzed in this manner were: sleep monitoring, heart rate monitoring, distance log, and weight log.

Additional features used during the retrospective period

theme_set(theme_classic())

# Select retrospective data and convert to long for bar chart
feature_users_retrospective_long <- users_per_feature %>%
  gather(additional_feature, users,
         sleep_monitoring:distance_log,
         factor_key = TRUE) %>%
  filter(date < "2016-04-12", users != 0)

# Order features by level to be consistent across different plots
feature_users_retrospective_long$additional_feature <-
  factor(
    feature_users_retrospective_long$additional_feature,
    levels = c(
      "sleep_monitoring",
      "heart_rate_monitoring",
      "weight_log",
      "distance_log"
    )
  )

ggplot(data = feature_users_retrospective_long,
       aes(
         x = date,
         y = users,
         fill = additional_feature,
         label = users
       )) +
  geom_bar(stat = "identity") +
  geom_text(size = 2, position = position_stack(vjust = 0.5)) +
  ggtitle("Figure 6.1.1 Retrospective: Additional Features Used by the Participants") +
  scale_x_date(date_breaks = "6 days" , date_labels = "%b %d")+
  ylab("Number of users") +
  theme(
    plot.title = element_text(size = 11),
    axis.title.x = element_blank()) +
  scale_fill_discrete(
    name = "Additional Features",
    labels = c(
      "Sleep monitoring",
      "Heart rate monitoring",
      "Weight log",
      "Distance log"
    )
  )

During the first half of the retrospective period (see Figure 6.1.1), sleep monitoring was the only additional feature used by the participants, and the largest number of participants who used such a feature in that time frame was 17; nevertheless, after the 18th day, many of the participants started using the other device features, especially heart rate monitoring, which suggests that they likely were unaware of the extra functionalities or simply did not take the time to learn how to use them until they received further instructions.

Based on the total number of users, heart rate monitoring was the second-most used additional feature, while the weight log and distance log came in a third and fourth place, respectively. This pattern continued in the prospective period with no major changes (see Figure 6.1.2).

Additional features used during the prospective period

# Select prospective data and convert to long for bar chart
feature_users_prospective_long <- users_per_feature %>%
  gather(additional_feature, users,
         sleep_monitoring:distance_log,
         factor_key = TRUE) %>%
  filter(date >= "2016-04-12", users != 0)

ggplot(data = feature_users_prospective_long,
       aes(
         x = date,
         y = users,
         fill = additional_feature,
         label = users
       )) +
  geom_bar(stat = "identity") +
  geom_text(size = 2, position = position_stack(vjust = 0.5)) +
  ggtitle("Figure 6.1.2 Prospective: Additional Features Used by the Participants") +
  scale_x_date(date_breaks = "7 days" , date_labels = "%b %d")+
  ylab("Number of users") +
  theme(
    plot.title = element_text(size = 11),
    axis.title.x = element_blank()) +
  scale_fill_discrete(
    name = "Additional features",
    labels = c(
      'Sleep monitoring',
      'Heart rate monitoring',
      'Weight log',
      'Distance log'
    )
  )

It must be noted that, while the participants who used the heart rate monitoring feature were mostly the same users from one day to the other, in the case of the weight log feature they were spread across different dates due to the nature of this functionality. This is why, despite the large difference in the number of participants who used these two features on each particular day, the total number of users across the entire period was very similar for both features. A total of 12 participants used the weight log during the entire period, while 13 used heart rate monitoring (see Figure 6.5.1).

6.2 User Categorization Based on Health Targets

Throughout the years, many organizations and researchers have established various measures to categorize individuals based on their physical activity habits; however, such definitions do not necessarily agree with each other and are often contradictory. For instance, even though past studies have concluded that 10,000 steps per day were required to obtain significant health benefits or that intensity levels were more important than steps, new findings imply that such recommendations must be revised.

A recent research study (see Saint-Maurice et al. 2020), conducted by investigators from the National Institutes of Health (NIH), National Cancer Institute (NCI), National Institute on Aging (NIA), and the Centers for Disease Control and Prevention (CDC), showed that not only were 8,000 daily steps enough to achieve significant health benefits but also that the intensity level did not influence the risk of death once the number of daily steps are accounted for.

Consequently, the number of daily steps is a good measure to classify users according to the magnitude of the health benefits they can derive from physical activity; nevertheless, there are circumstances in which it is advantageous to classify participants based on the intensity level because this measure reflects users’ preferences for different types of physical activities and could be used not only as a health indicator but also for marketing purposes.

On that account, this analysis categorized users based on two criteria, daily steps on the one hand, and the preference for different intensity levels on the other.

User categorization based on daily steps recommendations

Considering that according to Saint-Maurice et al. (2020), 8,000 daily steps is enough to achieve significant results in terms of health, this figure was taken as a guide for this analysis to define the baseline of what could be considered an active person based on the number of steps. Similarly, the same research study stated that the users with over 12,000 steps daily reached higher benefits by reducing even further the mortality rate, from 6.9 per 1000 persons-years resulting from the 8,000 daily steps to just 4.8 for those who walked more than 12,000 steps per day. Whereas the users with a number of steps ranging from 4,000 to 7,999 per day had a mortality rate of 21.4. Since the latter is still considered a high mortality rate, in our analysis the participants with less than 8,000 steps daily were grouped in the same category as the sedentary participants, and such category was named sedentary-low active.

Therefore, the user categorization based on steps was as follows:

  • Sedentary-low active: less than 8,000 steps daily
  • Active: 8,000 to 12,000 steps daily
  • Highly active: more than 12,000 steps daily
categories_steps <- avg_and_features %>%
  mutate(
    lifestyle = case_when(
      daily_steps_avg > 12000 ~ "Highly active",
      daily_steps_avg >= 8000 & daily_steps_avg <= 12000 ~ "Active",
      daily_steps_avg < 8000 ~ "Sedentary-low active"
    )
  )

categories_steps$lifestyle <-
  factor(
    categories_steps$lifestyle,
    levels = c("Highly active", "Active", "Sedentary-low active")
  )

steps_distribution <-
  categories_steps %>% count(lifestyle, name = "users") %>%
  mutate(
    users_percent = percent(users / sum(users)),
    distribution_type = "steps guidelines")

User categorization based on intensity level guidelines

As discussed previously, there are cases in which is more appropriate to classify users based on their preference for a certain type of exercise or intensity level.

In that sense, and considering that both Fitbit algorithms and the U.S. Department of Health and Human Services (HSS) use the same criteria in terms of metabolic equivalent of task (MET) when referencing different intensity levels, we classified the participants based on the recommendations specified by the HSS in its Physical Activity Guidelines for Americans, which states that, “for substantial health benefits, adults should do 150 to 300 minutes a week of moderate-intensity physical activity, or 75 to 150 minutes a week of vigorous-intensity physical activity”. The guideline also affirms that individuals can achieve additional health benefits by “engaging in physical activity beyond the equivalent of 300 minutes of moderate-intensity physical activity a week”.

Thus, we also classified the participants based on those recommendations, but we grouped inactive and insufficiently active users under the same category because the health outcome would still be undesirable in both of those cases. Hence, the user categorization based on intensity level recommendations was as follows:

  • Sedentary-low active: less than 150 minutes of moderate-intensity physical activity per week (less than 21 minutes per day)
  • Active: 150 to 300 minutes of moderate-intensity physical activity per week (21 to 43 minutes per day) or 75 to 150 minutes of vigorous-intensity physical activity per week (11 to 21 minutes per day)
  • Highly active: more than 300 minutes of moderate-intensity physical activity per week (more than 43 minutes per day) or more than 150 minutes of vigorous-intensity physical activity per week (more than 21 minutes per day)
categories_intensities <- avg_and_features %>%
  mutate(
    lifestyle = case_when (
      very_active_minutes_avg > 21 |
        fairly_active_minutes_avg > 43
      ~ "Highly active",
      very_active_minutes_avg >= 11 &
        very_active_minutes_avg <= 21 |
        fairly_active_minutes_avg >= 21 &
        fairly_active_minutes_avg <= 43
      ~ "Active",
      TRUE
      ~ "Sedentary-low active")
)

categories_intensities$lifestyle <-
  factor(categories_intensities$lifestyle,
         levels = c(
           "Highly active",
           "Active",
           "Sedentary-low active"))

intensities_distribution <- categories_intensities %>%
  count(lifestyle, name = "users") %>%
  mutate(
    users_percent = percent(users/sum(users)),
    distribution_type = "intensity guidelines")
recommendations_distribution <-
  rbind(steps_distribution, intensities_distribution)

# Create labels for facet wrap,
# use back ticks when numbers or special characters
labl_guideline <- as_labeller(
  c("steps guidelines" = "Based on daily steps guidelines",
    "intensity guidelines"= "Based on intensity/duration guidelines"))

ggplot(recommendations_distribution,
       aes(x = "", y = users, fill = lifestyle)) +
  geom_bar(stat = "identity", colour="white", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(
    aes(label = users_percent),position = position_stack(vjust = 0.5)) +
  labs(title = "Figure 6.2.1 Participants' Lifestyles Based on Official Guidelines") +
  scale_fill_discrete(name = "Lifestyle") +
  facet_wrap( ~ distribution_type, labeller = labl_guideline) +
  theme_bw() +
  theme(
    plot.title = element_text(size = 11),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid  = element_blank()
  )

Figure 6.2.1 shows that when the participants were grouped according to the benefits they could achieve by following the daily steps recommendations specified by Saint-Maurice et al. (2020), most of them, 53%, would fall below the threshold of 8,000 steps daily. However, when the users were classified according to the guidelines given by the HSS, which makes recommendations based on the duration and intensity of physical activities instead of daily steps, most of the participants were within the recommended target.

In addition, under the HSS criteria the number of participants considered to be highly active was much larger, 31.2% (left) compared to 16% (right). These results imply that, if well-informed, most of the users would be more likely to follow the guidelines that are based on duration and intensity levels rather than just daily steps, especially when the time dedicated to exercise is constrained by other factors such as work, household duties, or family care responsibilities.

6.3 User Categorization Based on Exercise Preferences

With the objective of categorizing users in a way that is more in alignment with real-world scenarios, we also classified the participants according to their preference for different types of exercise. These kinds of criteria enabled us to define a classification of users that could be useful for marketing purposes and corporate strategies.

In that sense, individuals who on average spent more than 21 minutes performing vigorous physical activity were classified as “runners-aerobics”, while those who spent fewer minutes at that intensity level but on average had more than 8,000 steps per day were defined as “walkers. The rest were grouped under the”little exercise” classification.

# Create categories based on type of exercise
categories_preference <- avg_and_features %>%
  mutate (exercise_type = case_when (
    very_active_minutes_avg > 20 ~ "Runner-aerobics",
    very_active_minutes_avg <= 20 &
      daily_steps_avg >= 8000  ~ "Walker",
    TRUE ~ "Little exercise"))

categories_preference$exercise_type <-
  factor(
    categories_preference$exercise_type,
    levels = c("Runner-aerobics", "Walker", "Little exercise")
  )

preference_distribution <-
  categories_preference %>%
  count(exercise_type, name = "user_preference") %>%
  mutate(users_percent = percent(user_preference / sum(user_preference)))

ggplot(preference_distribution, 
       aes(x="", y=user_preference, fill=exercise_type)) +
  geom_bar(stat="identity", colour="white", width=1) +
  coord_polar("y", start=0)+
  theme_void()+
  geom_text(aes(label = users_percent),
            position = position_stack(vjust = 0.5))+
  labs(title="Figure 6.3.1 Exercise Preferences")+
  theme(plot.title = element_text(size=11))+
scale_fill_discrete(name = "Lifestyle")

Although the percentage distribution of the runner-aerobics and walker groups was very similar to that of the highly active and active groups from Figure 6.2.1 (left), there were differences in the way these various classifications used the additional device features (see figures 6.4.1 and 6.5.1).

6.4 Additional Features Usage According to Exercise Preferences

# Convert to long for bar chart
feature_preference <- categories_preference %>%
  gather(tracking_features, days, sleep_monitoring:distance_log) %>%
  filter(days != 0) %>% arrange(id)

# Arrange order to match previous plot
feature_preference$tracking_features <-
  factor(
    feature_preference$tracking_features,
    levels = c(
      "sleep_monitoring",
      "heart_rate_monitoring",
      "weight_log",
      "distance_log"
    )
  )

positions <- c("Runner-aerobics", "Walker", "Little exercise")

ggplot(data = feature_preference) +
  geom_bar(
    mapping = aes(x = exercise_type, fill = tracking_features),
    position = position_dodge(width = 0.6),
    width = 0.50,
    colour = "black"
  ) +
  ggtitle("Figure 6.4.1 Additional Features Usage According to Exercise Preferences") +
  ylab("Number of users") +
  theme(
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()) +
  scale_x_discrete(limits = positions) +
  scale_y_continuous(breaks = seq(0, 20, by = 1), limits = c(0, 17)) +
  scale_fill_discrete(
    name = "Additional features",
    labels = c(
      'Sleep monitoring',
      'Heart rate monitoring',
      'Weight log',
      'Distance log'
    )
  )

Sleep monitoring was the most used additional feature regardless of exercise preference, and it was considerably more popular among the participants who exercised less.

The heart rate monitoring feature ranked second in the runner-aerobic group, while in the walker and little exercise categories it was tied with the weight log feature in second place based on the number of participants who used such functionalities. Whereas the least used feature in all the groups was distance log, which was used only by three participants in the runner-aerobic classification, and only two in the little exercise group.

6.5 Additional Features Usage According to Lifestyle

# Convert to long for bar chart
feature_intensities <- categories_intensities %>%
  gather(tracking_features, days, sleep_monitoring:distance_log) %>%
  filter(days != 0) %>% arrange(id)

# Arrange order to match previous plot
feature_intensities$tracking_features <-
  factor(
    feature_intensities$tracking_features,
    levels = c(
      "sleep_monitoring",
      "heart_rate_monitoring",
      "weight_log",
      "distance_log"
    )
  )

positions <- c("Highly active", "Active", "Sedentary-low active")

ggplot(data = feature_intensities) +
  geom_bar(
    mapping = aes(x = lifestyle, fill = tracking_features),
    position = position_dodge(width = 0.6),
    width = 0.50,
    colour = "black"
  ) +
  ggtitle("Figure 6.5.1 Additional Features Usage According to Lifestyle") +
  ylab("Number of users") +
  theme(
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()) +
  scale_x_discrete(limits = positions) +
  scale_y_continuous(breaks = seq(0, 20, by = 1), limits = c(0, 17)) +
  scale_fill_discrete(
    name = "Additional features",
    labels = c(
      'Sleep monitoring',
      'Heart rate monitoring',
      'Weight log',
      'Distance log'
    )
  )

Figures 6.5.1 and 6.4.1 are similar in nature, but they differ in that their users’ classifications are based on different criteria. The runner-aerobics category of Figure 6.4.1 included only the participants who, on average, had more than 21 minutes of vigorous physical activities per day, while the highly active classification of figure 6.5.1 included participants who had either more than 21 minutes of very active activity per day or more than 43 minutes per day of fairly active minutes, as per previous specifications.

6.6 Days of Usage of Additional Features

# Calculate average number of days for each additional features
usage_days_per_user <-  feature_intensities %>% group_by(lifestyle, tracking_features) %>% summarize(days_per_user = round(mean(days), 1))

positions <- c("Highly active", "Active", "Sedentary-low active")

ggplot(data = usage_days_per_user,
  aes(
    x = lifestyle,
    y = days_per_user,
    fill = tracking_features,
    label = days_per_user))+
  geom_bar(stat = "identity") +
  ggtitle("Figure 6.6.1 Days of Usage of Additional Features") +
  geom_text(size = 2.5, position = position_stack(vjust = 0.5)) +
  ylab("Average number of days per user") +
  theme(
    plot.title = element_text(size = 11),
    axis.title.x = element_blank())+
  scale_fill_discrete(
    name = "Additional features",
    labels = c(
      'Sleep monitoring',
      'Heart rate monitoring',
      'Weight log',
      'Distance log'))

Figure 6.6.1 sheds more light on the usage of the additional features by illustrating how often such functionalities were used. The average number of days per user for each additional feature is an indicator of how popular these features were among those participants who used them; however, we should be more flexible when evaluating the usage of the weight log because, under normal circumstances, such a feature is not expected to be used with the same frequency as the other ones. Nevertheless, Figure 6.6.1 shows that the highly active group used all the additional features regularly, while the active and sedentary-low active groups only used the sleep and heart rate monitoring features in a consistent manner.

It is important to highlight that, even though the weight and distance log were used by very few participants (see Figure 6.5.1), Figure 6.6.1 suggests that those participants who were highly active and had the opportunity to interact with such features used them fairly frequently.

6.7 Sleep Eficiency and Vigorous Physical Activity

# convert to long to use facet_wrap on activity type
daily_activity_sleep_users_long <-
  daily_activity_sleep_users %>% select(
    -c(
      total_steps,
      lightly_active_minutes,
      sedentary_minutes,
      minutes_asleep,
      time_in_bed
    )
  ) %>% pivot_longer (-c(id, date, sleep_efficiency),
                      names_to = 'activity_type',
                      values_to = 'activity_minutes')

activity_sleep <-
  merge(x = daily_activity_sleep_users_long,
        y = categories_intensities[, c(1, 10)],
        by = "id",
        all.x = TRUE)

# Create labels for facet wrap
labl_activity_type <- as_labeller(
  c("very_active_minutes" = "Very active minutes",
    "fairly_active_minutes"= "Fairly active minutes"))
  

ggplot(data = activity_sleep) +
  geom_smooth(
    mapping = aes(x = activity_minutes, y = sleep_efficiency))+
  geom_point(
    mapping = aes(x = activity_minutes, y = sleep_efficiency))+
  ggtitle("Figure 6.7.1 Sleep Efficiency and Physical Activity")+
  labs(x = "Activity minutes") +
  scale_y_continuous(name = "Sleep efficiency",
                     breaks = seq(0, 1, by = .1),
                     limits = c(0, 1)) +
  facet_wrap( ~ activity_type, labeller = labl_activity_type)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 6.7.1 illustrates how the number of minutes of vigorous physical activity (very active minutes) may guarantee a good sleep, as it shows that none of the participants who were above the 50-minute mark of vigorous activity in a particular day experienced a sleep efficiency below 0.85. That was not the case, however, with fairly active minutes, which implies that a high intensity level and its duration are the key indicators that should be encouraged among users with sleep disorders.

Therefore, although it would be more appropriate to use the total number of steps as a guideline for better overall health according to Figure 6.2.1, with regard to addressing sleep disorders, the activity intensity level plays the dominant role.

6.8 Hourly Activity Trend Based on the Number of Steps

# Set hourly intervals
total_hourly_steps <-
  minute_steps_merged_clean %>%
  mutate(time = floor_date(activity_minute, unit = "hour")) %>%
  group_by(id, time) %>% summarize(hourly_steps = sum(steps))

# Calculate hourly steps average
hourly_steps_trend <- total_hourly_steps %>%
  mutate(hour = tolower(format(time, "%I %p"))) %>% group_by(hour) %>%
  summarize(avg_hourly_steps = mean(hourly_steps))

hourly_steps_trend$hour <- factor(hourly_steps_trend$hour, level = c(
  "01 am","02 am","03 am","04 am","05 am","06 am","07 am","08 am",
  "09 am","10 am","11 am","12 pm","01 pm","02 pm","03 pm","04 pm",
  "05 pm","06 pm","07 pm","08 pm","09 pm","10 pm","11 pm", "12 am"))

hourly_steps_trend <- hourly_steps_trend %>% arrange(hour)

# Get maximum and minimum values for hourly steps
peaks_n_valleys <- hourly_steps_trend %>%
  mutate(
    slope_dir = avg_hourly_steps - lag(avg_hourly_steps),
    inflection_point = case_when(
      slope_dir >= 0 & lead(slope_dir) < 0 | slope_dir <= 0 & lead(slope_dir) > 0
      ~ T, T ~ F)) %>%
  filter(inflection_point) %>%
  arrange(desc(avg_hourly_steps)) %>% slice_head(n = 10)

hourly_steps_trend$peak_label <-
  ifelse(
    hourly_steps_trend$hour %in% peaks_n_valleys$hour,
    as.character(hourly_steps_trend$hour),
    NA
  )

spline.hourly <- as.data.frame(spline(hourly_steps_trend$hour, hourly_steps_trend$avg_hourly_steps))

ggplot(data = hourly_steps_trend,
       aes(x = hour, y = avg_hourly_steps, group = 1)) +
  labs(x = "Hour", y = "Hourly steps average") +
  geom_point() +
  ggtitle("Figure 6.8.1 Hourly Activity Trend Based on Steps")+
  geom_line(data=spline.hourly, aes(x = x, y = y), color = "blue") +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 0.95,
    size = 8
  )) +
  geom_label_repel(
    aes(x = hour, y = avg_hourly_steps, label = peak_label),
    size = 3,
    fill = "darkblue",
    colour="white",
    point.size = 0.1,
    hjust = 0.7,
    label.size = .1
  )

The peaks and valleys shown in Figure 6.8.1 not only serve to illustrate the activity patterns of most participants in a general sense, but they also help to identify the best times of the day to promote or advertise activity trackers among potential clients.

In the recommendations section we explain in further detail why the valleys in the data (except during sleeping hours) and the transition period from active to sedentary are the best times to run an advertisement campaign.

6.9 Resting Heart Rate Based on Lifestyles

To develop an effective marketing strategy for Bellabeat’s products, it is important to understand how the capabilities of tracking devices can be utilized to provide practical guidance or encouragement to attain tangible health benefits. For this reason, we decided to analyze the heart rate data that was available in the Fitbit files, specially because such a variable is both objective and easy to monitor, and hence it could enhance user experience in the context of achieving health goals.

According to Harvard Health Publishing, a normal resting heart rate usually lies between 60 and 90 beats per minute, and numbers above 90 are considered high (see LeWine 2011). In fact, Aune et al. (2017) showed that higher resting heart rates lead to an increased risk of cancer, coronary heart disease, heart failure, sudden cardiac death, stroke, and cardiovascular disease.

It must be noted that at the lower end of the spectrum, resting heart rates that are below 60 beats per minute are considered common among athletes and desirable in general terms unless it is caused by a specific underlying condition such as bradycardia. For instance, guidelines from the American Heart Association state that the resting heart rate of active people could be as low as 40 beats per minute (Heart Association n.d.). The same organization indicates that these lower resting heart rates are usually indicators of an efficient heart function and better cardiovascular fitness.

Accordingly, we proceeded to analyze the heart rate data by filtering out the values that corresponded to periods of moderate or high physical activity. In that context, by calculating the average resting heart rate for different lifestyles, we were in a better position to derive meaningful results from such an important variable.

Resting heart rate

# Merge resting heart rate and users' categories data
resting_heartrate_dist <- merge(
    x = daily_resting_heartrate,
    y = categories_intensities[, c("id", "lifestyle")],
    by = "id")

# Calculate resting heart rate average for different lifestyles
resting_heartrate_dist %>% group_by(lifestyle) %>%
  summarise(very_active_mean = mean(avg_resting_heartrate))
## # A tibble: 3 × 2
##   lifestyle            very_active_mean
##   <fct>                           <dbl>
## 1 Highly active                    68.9
## 2 Active                           71.4
## 3 Sedentary-low active             76.0
# Generate normal distribution for resting heart rate
resting_heartrate_dist[, 3:4] %>%
  group_by(lifestyle) %>%
  nest(data = c(avg_resting_heartrate)) %>%
  mutate(y = map(data, ~ dnorm(
      .$avg_resting_heartrate,
      mean = mean(.$avg_resting_heartrate),
      sd = sd(.$avg_resting_heartrate)
    ) * 3 * sum(!is.na(.$avg_resting_heartrate)))) %>%
  unnest(c(data, y)) %>% drop_na(lifestyle) %>% 
  ggplot(aes(x = avg_resting_heartrate)) +
  geom_histogram(
    data = resting_heartrate_dist,
    binwidth = 2.5,
    colour = "black",
    fill = "blue"
  ) +
  geom_line(aes(y = y), color = "black") +
  facet_wrap( ~ lifestyle) +
  geom_vline(
    xintercept = 90,
    linetype = "dashed",
    size = .75,
    colour = "red"
  ) +
  geom_vline(
    xintercept = 60,
    linetype = "dashed",
    size = .75,
    colour = "darkgreen"
  ) +
  labs(x = "Resting heart rate (beats per minute)", y = "Frequency")+
  ggtitle("Figure 6.9.1 Resting Heart Rate According to Lifestyles")

The normal distribution shown in Figure 6.9.1, in which the green and red lines represent the lower and higher end of what is considered a normal resting heart rate, illustrates two major findings: first, it shows that, even though previous calculations showed that the heart rate monitoring feature was used by the same number of participants in the highly active and sedentary-low active groups (see Figure 6.5.1), the most active participants used this feature much more frequently; and second, and most importantly, Figure 6.9.1 confirms that active people tend to have lower resting heart rates than sedentary individuals. Very active participants, for instance, had an average resting heart rate of 69 beats per minute, whereas the sedentary-low active group had an average of 76 beats per minute. These results corroborate the findings of the scientific community and, in turn, prove the utility of personal tracking devices in health-related matters in terms of heart rate.

Considering that our analysis is in agreement with the findings of different organizations in this subject, we may conclude that tracking devices can be promoted as a pragmatic source of encouragement for the users since resting heart rate is easy to track and could be used as a way to present clear and tangible results arising from adopting an active lifestyle.

7 Recommendations

7.1 Summary of Recommendations

  • Develop two sets of advertisements to target different audiences’ lifestyles based on the device features they are more likely to use according to the insights presented in Section 7.2.

  • Run the promotional campaign during the periods of 1) 12:00 p.m. to 1:00 p.m., 2) 2:00 p.m. to 3:00 p.m., and 3) 7:00 p.m. to 10:00 p.m.

  • Spark the interest of the audience by promoting engaging, yet overlooked, facts about how a low resting heart rate and a high sleep efficiency attenuate the detrimental effects of stress and disease, and how the former two indicators are correlated with the level of physical activity, which is a fact supported by the findings of this analysis.

  • Make the weight log feature more appealing to potential customers by featuring Bellabeat app’s compatibility with smart scales and by partnering with manufacturers to offer promotional prices for bundles that include both a tracking device and a smart scale.

The criteria behind the above recommendations are explained in detail in the following sections.

7.2 Customized Advertisement According to Audience’s Lifestyle

Since it was shown that participants with different lifestyles and exercise preferences used the additional device features in a different manner, it is recommended to customize the advertisements by prioritizing different features according to the lifestyle of the audience. The most practical way to accomplish this is by running various kinds of advertisements on the internet depending on the keywords typed by the users in websites such as Google, YouTube, Bing, and social media in general.

To that effect, we recommend running two kinds of advertisements, one for the highly active and moderately active audience, and another type of advertisement for the sedentary-low active audience. According to our findings on the usage of the additional device features by different lifestyles, the advertisements should be developed based on the following criteria.

Advertisement A: It would target the highly active and moderately active audiences, and it would promote all the additional features, that is, heart rate monitoring, sleep monitoring, weight log, and distance log; however, a greater emphasis should be placed on the first two. And in the case of the weight log feature, it should be promoted following the recommendations given in Section 7.5. This advertisement would be displayed for the users who have performed online searches with keywords such as running, climbing, running shoes, energy drinks, marathon, electrolyte drinks, hydration drink, soccer, basketball, and sports in general, among other words suggested by the marketing team.

Advertisement B: It would target the sedentary-low active audience, and it would place a greater emphasis on promoting the sleep monitoring and heart rate monitoring features, but it would prioritize the former. This advertisement would be displayed for the users who have performed online searches with keywords such as sleep monitoring, sleep disorders, stress, anxiety, insomnia, cardiovascular health, among other words suggested by the marketing team.

7.3 Best Times to Run Promotional Campaign

By analyzing trends in the hourly activity of the participants, as illustrated by Figure 6.8.1, we were able to identify the best times of the day to run the advertisement campaign.

Since people are less likely to pay attention to the internet or social media during periods of high physical activity, the advertisements should not be run during the peaks of the hourly activity data. Likewise, although people would be a little more likely to browse the internet or check social media while transitioning from their sedentary hours to their active routines, such periods would still not be optimal for advertisement purposes because the gradually increasing level of physical activity prevents individuals from using the internet to some extent.

Consequently, the best times for advertisement or promotion through the internet or social media is either during the most sedentary periods or when people are finishing their exercise routine, which is illustrated by the valleys and the negative slopes in Figure 6.8.1, respectively.

Therefore, the best times to run the promotional campaign are:

  • 12:00 p.m. - 1:00 p.m.
  • 2:00 p.m. - 3:00 p.m.
  • 7:00 p.m. - 10:00 p.m.

7.4 Findings and Observations that Make Tracking Devices More Appealing

Our analysis also provided some key insights that can be used to strengthen the device’s image and credibility among the potential users. These insights are useful in guiding a marketing strategy that is reinforced by promoting interesting and, at the same time, straightforward capabilities of the tracking devices.

In that sense, and considering that our findings proved that resting heart rate was lower among the active participants, the marketing strategy could use such a fact to educate the public about the importance of measuring resting heart rate and the practicability of tracking devices in this subject. This could be a highly motivating factor for users because a lower resting heart rate is an indicator of improved health even when there are no visible changes in weight or muscular appearance.

Another observation that could reinforce or make a marketing strategy more appealing is the fact that long sessions of highly active exercise help maintain a good sleep efficiency. Our findings showed that none of the participants who had more than 50 minutes of vigorous exercise in one day experienced a sleep efficiency below 0.85. These preliminary results imply that activity and sleep monitoring features can work in conjunction to provide guidance or tips to individuals who are developing unhealthy sleep patterns or are already struggling with stress-related sleep disorders.

7.5 Improving the Appeal of the Weight Log Feature

Despite the fact that “weight loss” is a widely searched term on the web, only 12 participants used the weight log feature. This could have been due to several factors, such as having a pool of participants in which most users were not interested in weight monitoring because they were already at their ideal weight, or simply because of the inconvenience of having to enter this information manually.

Nevertheless, considering that women are a segment of the population that is particularly concerned with weight and that they make up the totality of Bellabeat’s clients, it would be appropriate to provide recommendations on this matter. In that regard, we concluded that in order to make the weight log feature more appealing to potential customers, the marketing strategy should take into consideration the following suggestions:

  • Highlight Bellabeat app’s compatibility with smart scales

  • Partner with a manufacturer of smart scales to offer promotional prices for bundles that include both an activity tracking device and a smart scale.

  • Promote the user-friendly functionalities of the weight log

  • Educate the potential customers on how weight is correlated with both stress and overall health by featuring quick tips using the tracking device.

8 Conclusion

By shedding light not only on how the participants were using their tracking devices but also on how these devices proved to be useful both as indicators of overall health and as a source of motivation, the present analysis allowed us to provide practical insights that can be used to develop a marketing strategy that is effective and appealing to potential Bellabeat’s users.

The recommendations presented in the previous section can be implemented without major budget implications, and they laid out a clear and pragmatic approach to impact the audience’s perception in a positive way with regard to activity trackers. In light of this, we are confident that if such recommendations are implemented properly, they will become a valuable tool in the development of a more robust marketing strategy for Bellabeat devices.

References

Aune, D., A. Sen, B. ó’Hartaigh, I. Janszky, P. R. Romundstad, S. Tonstad, and L. J. Vatten. 2017. “Resting Heart Rate and the Risk of Cardiovascular Disease, Total Cancer, and All-Cause Mortality – a Systematic Review and Dose–Response Meta-Analysis of Prospective Studies.” Nutrition, Metabolism and Cardiovascular Diseases 27 (6): 504–17. https://doi.org/10.1016/j.numecd.2017.04.004.
Firke, Sam. 2021. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://github.com/sfirke/janitor.
Fitbit MyHelp. 2016. “Fitbit Help.” Fitbit LCC. April 28, 2016. https://help.fitbit.com/articles/en_US/Help_article/1141.htm.
Furberg, Robert, Julia Brinton, Michael Keating, and Alexa Ortiz. 2016. “Crowd-Sourced Fitbit Datasets 03.12.2016-05.12.2016.” Zenodo. https://doi.org/10.5281/ZENODO.53894.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Heart Association. n.d. “Target Heart Rates Chart. Www.heart.org.” Accessed October 28, 2022. https://www.heart.org/en/healthy-living/fitness/fitness-basics/target-heart-rates.
Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.
LeWine, Howard E. 2011. “Increase in Resting Heart Rate Is a Signal Worth Watching. Harvard Health.” December 21, 2011. https://www.health.harvard.edu/blog/increase-in-resting-heart-rate-is-a-signal-worth-watching-201112214013.
Mortensen, Karoline, and Taylor L. Hughes. 2018. “Comparing Amazon’s Mechanical Turk Platform to Conventional Data Collection Methods in the Health and Medical Research Literature.” Journal of General Internal Medicine 33 (4): 533–38. https://doi.org/10.1007/s11606-017-4246-0.
Müller, Kirill, and Hadley Wickham. 2022. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Ranjan, Chitta, and Devleena Banerjee. 2022. Nlcor: Compute Nonlinear Correlations.
Rawat, Vikram Singh. 2022. Best Coding Practices for r. https://bookdown.org/content/d1e53ac9-28ce-472f-bc2c-f499f18264a3/names.html.
Saint-Maurice, Pedro F., Richard P. Troiano, David R. Bassett, Barry I. Graubard, Susan A. Carlson, Eric J. Shiroma, Janet E. Fulton, and Charles E. Matthews. 2020. “Association of Daily Step Count and Step Intensity with Mortality Among US Adults.” JAMA 323 (12): 1151. https://doi.org/10.1001/jama.2020.1382.
Slowikowski, Kamil. 2021. Ggrepel: Automatically Position Non-Overlapping Text Labels with Ggplot2. https://github.com/slowkow/ggrepel.
Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2021. Lubridate: Make Dealing with Dates a Little Easier. https://CRAN.R-project.org/package=lubridate.
“What Are Active Zone Minutes or Active Minutes on My Fitbit Device?” n.d. Accessed October 11, 2022. https://help.fitbit.com/articles/en_US/Help_article/1379.htm.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
———. 2021. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.
———. 2022. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain Françoi, Garrett Grolemun, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Jennifer Bryan, and Malcolm Barrett. 2022. Usethis: Automate Package and Project Setup. https://CRAN.R-project.org/package=usethis.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2022. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain Francois, Lionel Henry, and Kirill Muller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Maximilian Girlich. 2022. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2022. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Wickham, Hadley, Jim Hester, Winston Chang, and Jennifer Bryan. 2022. Devtools: Tools to Make Developing r Packages Easier. https://CRAN.R-project.org/package=devtools.
Wickham, Hadley, and Dana Seidel. 2022. Scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2022. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.