How Can a Wellness Technology Company Play It Smart?

A Case Study by Arturo Cardiel

1 Introduction/Business Task

Bellabeat is a technology company that specializes in creating health-focused tech products specifically designed for women. Among its products are the Bellabeat app, the Leaf wearable wellness tracker, the Time wellness watch, the Spring smart water bottle, and the Bellabeat membership. For the purpose of this study, the focus will be on the Leaf product, which tracks activity, sleep, and stress levels.

The dataset used for this study is sourced from Kaggle and contains data on the FitBit Fitness Tracker, a similar wellness tracker product to Bellabeat’s Leaf. It is important to note that this dataset is listed under public domain, which means that it is free to use for research purposes.

The goal of this study is to explore the data to identify trends in smart device usage and provide insights on how these trends may be relevant to Bellabeat customers. By analyzing these trends, it is hoped that this study can inform Bellabeat’s marketing strategy, leading to more effective targeting of potential customers and a better understanding of their needs and preferences.


2 Preparing Data

2.1 Setting up the Environment

Installing necessary packages and loading them.

#Loading packages
library(tidyverse)
library(janitor)
library(lubridate)
library(ggplot2)
2.2 Importing Data

Load the data

filepaths <- list.files(path = "~/R/case_studies/fitabase_data", pattern ="*.csv", full.names=TRUE)

filenames <- list.files(path = "~/R/case_studies/fitabase_data", pattern ="*.csv") %>% 
  str_sub(1, -12) 

filecontents <- filepaths %>%
  # Using the path supplied by each element in filecontents, read in csvs
  map(read_csv) %>% 
   # Rename each element using names from filenames vector
  set_names(filenames)
# Add each list item from filecontents into the Global environment 
list2env(filecontents, envir = .GlobalEnv)
## <environment: R_GlobalEnv>

Running a for loop to identify all the unique ids to tell us which datasets are missing entries

for (i in names(filecontents)) {
  df <- filecontents[[i]]
  unique_ids <- n_distinct(df$Id)
  print(paste(i, "has", unique_ids, "unique IDs"))
}
## [1] "dailyActivity has 33 unique IDs"
## [1] "dailyCalories has 33 unique IDs"
## [1] "dailyIntensities has 33 unique IDs"
## [1] "dailySteps has 33 unique IDs"
## [1] "heartrate_seconds has 14 unique IDs"
## [1] "hourlyCalories has 33 unique IDs"
## [1] "hourlyIntensities has 33 unique IDs"
## [1] "hourlySteps has 33 unique IDs"
## [1] "minuteCaloriesNarrow has 33 unique IDs"
## [1] "minuteCaloriesWide has 33 unique IDs"
## [1] "minuteIntensitiesNarrow has 33 unique IDs"
## [1] "minuteIntensitiesWide has 33 unique IDs"
## [1] "minuteMETsNarrow has 33 unique IDs"
## [1] "minuteSleep has 24 unique IDs"
## [1] "minuteStepsNarrow has 33 unique IDs"
## [1] "minuteStepsWide has 33 unique IDs"
## [1] "sleepDay has 24 unique IDs"
## [1] "weightLogInfo has 8 unique IDs"

This output shows that a few data frames are missing some unique ids because there should be 33 unique ids since the data is from 33 Fitbit users. Given this information we will be focusing on daily activity, hourly calories, hourly steps, and sleep data. The sleep data only had 24 unique ids but since Bellabeat has products that also track sleep its important we at least explore the data.

We will now take a peak of the data frames

head(dailyActivity)
## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹​ActivityDate, ²​TotalSteps, ³​TotalDistance,
## #   ⁴​TrackerDistance, ⁵​LoggedActivitiesDistance, ⁶​VeryActiveDistance,
## #   ⁷​ModeratelyActiveDistance, ⁸​LightActiveDistance, ⁹​SedentaryActiveDistance
head(hourlyCalories)
## # A tibble: 6 × 3
##           Id ActivityHour          Calories
##        <dbl> <chr>                    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366 4/12/2016 1:00:00 AM        61
## 3 1503960366 4/12/2016 2:00:00 AM        59
## 4 1503960366 4/12/2016 3:00:00 AM        47
## 5 1503960366 4/12/2016 4:00:00 AM        48
## 6 1503960366 4/12/2016 5:00:00 AM        48
head(hourlySteps)
## # A tibble: 6 × 3
##           Id ActivityHour          StepTotal
##        <dbl> <chr>                     <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366 4/12/2016 1:00:00 AM        160
## 3 1503960366 4/12/2016 2:00:00 AM        151
## 4 1503960366 4/12/2016 3:00:00 AM          0
## 5 1503960366 4/12/2016 4:00:00 AM          0
## 6 1503960366 4/12/2016 5:00:00 AM          0
head(sleepDay)
## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹​TotalTimeInBed

3 Cleaning and Processing Data

Using a combination of tools from tidyverse and janitor to clean, process, filter, and sort the data. The lubridate package we also used to manipulate dates for easier analysis.

3.1 Daily Activity Data Frame

Reformatting dates, adding columns for week day, dropping unnecessary columns, and shorten the 10 digit id

dailyNewActivity<- dailyActivity %>%
  clean_names() %>% 
  mutate(activity_date = mdy(activity_date), day_week = weekdays(activity_date)) %>% 
  rename(date = activity_date) %>%
  select(-c(5:14))  
dailyNewActivity$id <- as.numeric(factor(dailyNewActivity$id, levels = unique(dailyNewActivity$id)))
head(dailyNewActivity)
## # A tibble: 6 × 6
##      id date       total_steps total_distance calories day_week 
##   <dbl> <date>           <dbl>          <dbl>    <dbl> <chr>    
## 1     1 2016-04-12       13162           8.5      1985 Tuesday  
## 2     1 2016-04-13       10735           6.97     1797 Wednesday
## 3     1 2016-04-14       10460           6.74     1776 Thursday 
## 4     1 2016-04-15        9762           6.28     1745 Friday   
## 5     1 2016-04-16       12669           8.16     1863 Saturday 
## 6     1 2016-04-17        9705           6.48     1728 Sunday
3.2 Merging hourly data frames

Merging both hourlySteps and hourly calories, reformatting dates, creating the weekday column, and shortening ids

hourlyNewActivity <- hourlySteps %>%
  left_join(hourlyCalories, by = c("Id", "ActivityHour")) %>%
  clean_names() %>%
  mutate(activity_hour = mdy_hms(activity_hour), day_week = weekdays(activity_hour)) %>%
  separate(col = activity_hour, into = c("date", "time"), sep = " ") %>%
  mutate(date = ymd(date))

hourlyNewActivity$id <- as.numeric(factor(hourlyNewActivity$id, levels = unique(hourlyNewActivity$id)))

head(hourlyNewActivity)
## # A tibble: 6 × 6
##      id date       time     step_total calories day_week
##   <dbl> <date>     <chr>         <dbl>    <dbl> <chr>   
## 1     1 2016-04-12 00:00:00        373       81 Tuesday 
## 2     1 2016-04-12 01:00:00        160       61 Tuesday 
## 3     1 2016-04-12 02:00:00        151       59 Tuesday 
## 4     1 2016-04-12 03:00:00          0       47 Tuesday 
## 5     1 2016-04-12 04:00:00          0       48 Tuesday 
## 6     1 2016-04-12 05:00:00          0       48 Tuesday
3.1 Sleep Data Frame

Reformatting dates, renaming columns, changing format of hours into minutes, dropping unnecessary columns, and shortening ids

newSleepDay <- sleepDay %>%
  clean_names() %>%
  mutate(sleep_day = mdy_hms(sleep_day), sleep_day = ymd(sleep_day)) %>%
  rename("sleep_records" = "total_sleep_records",
         "minutes_asleep" = "total_minutes_asleep",
         "total_bed_minutes" = "total_time_in_bed") %>%
  mutate(hours_asleep = floor(minutes_asleep / 60),
         minutes_asleep = minutes_asleep %% 60,
         total_bed_hours = floor(total_bed_minutes / 60),
         bed_minutes = total_bed_minutes %% 60) %>%
  mutate(sleep_time = paste0(hours_asleep, ":", sprintf("%02d", minutes_asleep)),
         total_bed_time = paste0(total_bed_hours, ":", sprintf("%02d", bed_minutes)))%>%
  select(-c("minutes_asleep", "hours_asleep", "total_bed_hours", "bed_minutes"))

newSleepDay$id <- as.numeric(factor(newSleepDay$id, levels = unique(newSleepDay$id))) 



head(newSleepDay)
## # A tibble: 6 × 6
##      id sleep_day  sleep_records total_bed_minutes sleep_time total_bed_time
##   <dbl> <date>             <dbl>             <dbl> <chr>      <chr>         
## 1     1 2016-04-12             1               346 5:27       5:46          
## 2     1 2016-04-13             2               407 6:24       6:47          
## 3     1 2016-04-15             1               442 6:52       7:22          
## 4     1 2016-04-16             2               367 5:40       6:07          
## 5     1 2016-04-17             1               712 11:40      11:52         
## 6     1 2016-04-19             1               320 5:04       5:20

4 Analyze Data

4.1 Daily Data Analysis

Created a new data frame as backup where only necessary columns are shown, deleting duplicates and entries with 0, and aggregating the data into a new data frame where averages are calculated for the steps.

dailyFinal <- dailyNewActivity %>% 
  select(id, date, day_week, total_steps, total_distance, calories)
dailyFinal <- subset(dailyNewActivity, total_steps != 0 & total_distance != 0 & calories != 0 & !duplicated(dailyFinal)) 

dailyFinal$day_week <- as.factor(dailyFinal$day_week)

head(dailyFinal)
## # A tibble: 6 × 6
##      id date       total_steps total_distance calories day_week 
##   <dbl> <date>           <dbl>          <dbl>    <dbl> <fct>    
## 1     1 2016-04-12       13162           8.5      1985 Tuesday  
## 2     1 2016-04-13       10735           6.97     1797 Wednesday
## 3     1 2016-04-14       10460           6.74     1776 Thursday 
## 4     1 2016-04-15        9762           6.28     1745 Friday   
## 5     1 2016-04-16       12669           8.16     1863 Saturday 
## 6     1 2016-04-17        9705           6.48     1728 Sunday
Daily Plot Showing Entries
ggplot(dailyFinal, aes(x = id)) + 
  geom_bar(stat = "count") +
  xlab("ID") + 
  ylab("Number of Entries") + 
  ggtitle("Number of Entries per ID") +
  scale_x_discrete(limits = factor(unique(dailyFinal$id)))

It is important to note that each data point in the plot represents a daily entry, which implies that 20 out of the 33 participants had 30 or more entries, while the remaining participants had around 15-25 entries. Ideally, it would be desirable to have complete data for all the days of the month from all participants. Nonetheless, the available data will be further explored, and conclusions will be drawn by calculating relevant averages.

In the next code chunk, an aggregated version of the dailyFinal data frame is created to allow for the generation of a plot that displays the average number of steps taken by day of the week. This will provide insight into whether certain days of the week exhibit distinct activity patterns, enabling the identification of potential trends or patterns that may inform further analysis.

dailyFinal_agg <- dailyFinal %>%
  group_by(id, day_week) %>%
  summarise(average_steps = mean(total_steps))

head(dailyFinal_agg)
## # A tibble: 6 × 3
## # Groups:   id [1]
##      id day_week average_steps
##   <dbl> <fct>            <dbl>
## 1     1 Friday          11466.
## 2     1 Monday          13781.
## 3     1 Saturday        13426.
## 4     1 Sunday          10102.
## 5     1 Thursday        11876.
## 6     1 Tuesday         13947.
4.2 Hourly Data Analysis

Creating aggregated data frame for the hourly data.

hourlyNewActivity_agg <- hourlyNewActivity%>%
  mutate(hour = as.numeric(format(strptime(paste(date, time), "%Y-%m-%d %H:%M:%S"), "%H"))) %>%
  group_by(id, hour) %>%
  summarize(step_total = sum(step_total))%>%
  group_by(id) %>%
  mutate(min_hour = hour[which.min(step_total)],
         max_hour = hour[which.max(step_total)])
head(hourlyNewActivity_agg)
## # A tibble: 6 × 5
## # Groups:   id [1]
##      id  hour step_total min_hour max_hour
##   <dbl> <dbl>      <dbl>    <dbl>    <dbl>
## 1     1     0       4280        5       18
## 2     1     1       1503        5       18
## 3     1     2        870        5       18
## 4     1     3        355        5       18
## 5     1     4        108        5       18
## 6     1     5         63        5       18

The current structure of the data frame suggests that each unique identifier is associated with 24 rows, corresponding to each hour of the day. For instance, the first row indicates that id 1 recorded a total of 4280 steps at 0 (12:00 AM). Furthermore, the min hour and max hour columns provide information on the specific hours during which the minimum and maximum number of steps were taken over the course of the month.

To build upon these initial observations, an additional data frame was constructed that aggregates the hourly activity data and calculates the average number of steps taken for each hour of the day across the entire month. This provides a more comprehensive view of how activity levels vary across different times of the day, allowing for the identification of patterns and trends that may be useful for further analysis and interpretation.

#Average steps taken at each hour of the day
hourly_average <- hourlyNewActivity_agg %>%
  group_by(hour) %>%
  summarize(avg_steps = mean(step_total))
head(hourly_average)
## # A tibble: 6 × 2
##    hour avg_steps
##   <dbl>     <dbl>
## 1     0     1194.
## 2     1      653.
## 3     2      484.
## 4     3      182.
## 5     4      359.
## 6     5     1239.
4.3 Sleep Data Analysis

In a prior analysis, it was observed that the sleep data frame contained only 24 unique identifiers, indicating the presence of missing data. Nonetheless, given the importance of sleep tracking for BellaBeat’s health and wellness monitoring, it was deemed necessary to explore this metric further. In order to achieve this without introducing biases in the analysis, the minutes in bed across all sleep records were aggregated to obtain a total figure. Subsequently, this total was transformed into a more meaningful metric, namely the total time spent in bed, through appropriate calculations. Additionally, the average time spent in bed was calculated to uncover potential trends and patterns. These exploratory findings will then be communicated through visualizations to enable effective data interpretation.

#Totaling the amount of sleep records and time by id
sleep_record_counts <- newSleepDay %>% 
  group_by(id) %>% 
  summarize(total_sleep_records = sum(sleep_records),
            total_bed_minutes = sum(total_bed_minutes))
sleep_record_counts$total_time <- sleep_record_counts$total_bed_minutes / 60
sleep_record_counts$average_time_in_bed <-  as.numeric(sprintf("%.2f", sleep_record_counts$total_time / sleep_record_counts$total_sleep_records))


head(sleep_record_counts)
## # A tibble: 6 × 5
##      id total_sleep_records total_bed_minutes total_time average_time_in_bed
##   <dbl>               <dbl>             <dbl>      <dbl>               <dbl>
## 1     1                  27              9580     160.                  5.91
## 2     2                   4              1384      23.1                 5.77
## 3     3                   3              2883      48.0                16.0 
## 4     4                   8              2189      36.5                 4.56
## 5     5                  28             15054     251.                  8.96
## 6     6                   1                69       1.15                1.15

4 Share/Data Visualization

4.1 Daily Data

This box plot provides some interesting insights into the average total steps taken by 33 users over the course of a month, broken down by each day of the week. One key observation we can make is that the difference between the medians in each box plot is relatively small, ranging from just over 6,250 average total steps to 8,750 steps. This suggests that, on average, users are fairly consistent in their activity levels throughout the week, with relatively little variation in the number of steps taken from day to day.

However, we can also see that there are some outliers in the data, particularly on Saturdays and Sundays, which have the largest interquartile range (IQR) of all the days of the week. This indicates that there is a wider range of activity levels on these days, with some users being much more active than others. This is not surprising, as many people have more free time on the weekends and may be more likely to engage in physical activity.

It’s important to keep in mind that this data is limited to just 33 users over a relatively short time period, so further research may be needed to confirm these trends and explore potential interventions in more detail.

ggplot(dailyFinal_agg, aes(x=day_week, y=average_steps, color = day_week)) + 
  geom_boxplot() +
  xlab("Day Of Week") + ylab("Average Total Steps") + 
  ggtitle("Average Total Steps by Day of Week") +
  scale_color_brewer(type="qual", palette="Set1") +
  theme(legend.position="none")

4.2 Hourly Data

By examining the hourly data and plotting the steps taken at each hour of the day, we gain a more detailed understanding of the activity patterns of these users. The resulting line plot reveals some interesting trends and insights into what the average day of a user might look like.

We observe that activity levels are quite low during the early morning hours, from midnight to 4:00 AM, and then begin to increase steadily from 5:00 AM to 10:00 AM. This may reflect the morning routines of these users, including waking up, getting ready for the day, and commuting to work or other activities.

One noteworthy observation is the significant drop in activity levels around 2:00 PM, followed by a spike in activity and then a gradual decline through the evening hours. This dip in activity around midday may suggest that users are taking a break, perhaps to have lunch or engage in other non-physical activities. The subsequent spike in activity may reflect a return to work or other tasks, followed by a tapering off in the evening as users wind down and prepare for sleep.

These patterns are consistent with what one might expect from a working person, and the observed activity levels align with the broader literature on physical activity patterns in adults. By better understanding these patterns, we can design interventions and strategies that encourage physical activity during key periods, such as midday when activity levels are typically lower.

It is important to keep in mind that the trends observed in this data are based on a limited sample of 33 users, and further research may be needed to confirm these patterns and explore their implications for health and well-being.

ggplot(hourly_average, aes(x = hour, y = avg_steps, color = "Average Steps")) + 
  geom_line() +
  scale_x_continuous(breaks = 0:23, labels = 0:23) +
  labs(x = "Hour", y = "Average Steps Total") +
  ggtitle("Hourly Average Steps") +
  scale_color_discrete(name = "Group") +
  theme_minimal()

4.3 Sleep Data

The analysis of the sleep data revealed some noteworthy observations. Notably, the dataset was incomplete as it lacked inputs from 9 of the 33 participants. Moreover, of the 24 participants who reported at least one sleep record, only 13 participants reported at least 20 sleep records. Given that the study was intended to be conducted monthly, it is reasonable to expect a minimum of 30 sleep record entries for each participant, which would have provided a more comprehensive dataset for drawing sound conclusions. These findings underscore the importance of collecting complete and consistent data to obtain meaningful insights from statistical analyses.

# Plot the number of sleep records for each person using a bar graph
ggplot(sleep_record_counts, aes(x = id, y = total_sleep_records, fill = total_sleep_records)) +
  geom_col() +
  xlab("Person ID") +
  ylab("Number of Sleep Records") +
  ggtitle("Number of Sleep Records per Person")+
  scale_x_discrete(limits = factor(sleep_record_counts$id))+
  scale_fill_gradient(low = "lightblue", high = "darkblue")

Although the sleep data was incomplete and inconsistent, I was able to analyze the reported average bed hours among users. Based on the visualization, it was observed that the range of average bed hours was not quite large for consistent users, with most users reporting an average between 5 and 10 hours. This observation is not surprising, as the commonly accepted healthy sleep duration is between 7 and 9 hours. However, it’s important to note that the limited and dirty nature of the data makes it difficult to draw definitive conclusions and any further inferences or generalizations should be avoided.

my_colors <- c("darkblue", "gray")

# Create a new column indicating if the average time in bed is between 5 and 10
sleep_record_counts$color <- ifelse(sleep_record_counts$average_time_in_bed >= 5 & sleep_record_counts$average_time_in_bed <= 10, "Between 5 and 10", "Other")

# Convert the color column to a factor
sleep_record_counts$color <- factor(sleep_record_counts$color, levels = c("Between 5 and 10", "Other"))

# Create the plot
ggplot(sleep_record_counts, aes(x = id, y = average_time_in_bed, fill = color)) +
  geom_col() +
  xlab("Person ID") +
  ylab("Average Bed Hours") +
  ggtitle("Average Bed Time per record") +
  scale_x_discrete(limits = factor(sleep_record_counts$id)) +
  scale_fill_manual(values = my_colors) +
  guides(fill = guide_legend(title = "Average Bed Hours"))