Bellabeats Case Study

Bellabest leaf tracker

A. Introduction & Business Task:

To gain insight into how consumers use non-Bellabeat smart devices, and apply these insights to one Bellabeat product.

Bellabeats makess digital health products for women
Ready to grow in the global digital smart device market due to past success
ASK: How consumers are using health-related smart devices?

Stakeholders:

Urška Sršen: Bellabeats cofounder and Chief Creative Officer
Sando Mur: Bellabeats cofounder and mathematician.

Goals:

Explore the patterns in consumer usage of health-related smart devices.
Present findings to Bellabeats cofounders directly
Recommend actions based on insights

B. ASK - Guiding Questions:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Bigdata

C. PREPARE - Data Sources:

The FitBit Fitness Tracker Data provided by Mobius was recommended by the cofounders. This dataset is stored on the Kaggle platform. It is contains 17 Excel files that are available for download and in a variety of formats including narrow and wide formats for calories and steps per minute.

Licensing & Accessibility

The FitBit Fitness Tracker is: * a publicly available dataset with a CCO Public Domain designation * is available on the Kaggle platform which is an open-source platform

The data was obtained from people who responded to a survey that was distributed by Amazon Mechanical Turk between March 12 and May 12, 2016. The data for 30 consenting consumers were submitted to this dataset.

Data Limitations using ROCCC ANALYSIS

Reliability: Without knowing how the 30 people who chose to submit their health data for this survey were selected, the reliability of the data will be questionable. Random selection is important in data analysis and if this was not done then there may be a selection bias in the data.
Originality: A link to the original dataset was provided on the Kaggle website and it showed the original data source on Zenodo.org https://zenodo.org/record/53894#.Y2v1l-TMJPZ
Comprehensiveness: Gender is not specified. Bellabeats consumers are women and the Fitbit dataset does not specify the gender of the consumers in the dataset. A dataset of female consumers of health-related smart devices would be more applicable to the business task at hand.
Current: The data was collected in 2016.
Cited: The citation for the datasource is Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.53894

Overall, the dataset comes from an original, currently cited data source.

Pros:

Original
Current
Cited

Cons:

lacks reliability (selection method)
lacks comprehensiveness (not female specific) in that it is not specific to the I will be using RStudio and Excel to conduct the analyses The following packages were installed.

Load packages-tidyverse, skimr, & janitor

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("skimr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

Loading packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(skimr)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

D. PROCESS:

Used R to process the data.The following data cleaning techniques were used: Created dataframes for a) sleep activity, b) day sleep, c) weight log, d) hourly calories,and e) daily calories

Creating dataframes

sleepactivity_df <- read_csv("minuteSleep_merged.csv")

## Rows: 188521 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): date
## dbl (3): Id, value, logId
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sleepday <-read_csv("sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

weightlog<-read_csv("weightLogInfo_merged.csv")

## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hourlycalories <- read_csv("hourlyCalories_merged.csv")

## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dailycalories <- read_csv("dailyCalories_merged.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

E. ANALYZE:

Used the head(), str(), colname(), and glimpse() functions to analyze all 5 dataframes. Head, colname, and glimpse functions are displayed below

Head summaries

head(dailycalories)

## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728

head(hourlycalories)

## # A tibble: 6 × 3
##           Id ActivityHour          Calories
##        <dbl> <chr>                    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366 4/12/2016 1:00:00 AM        61
## 3 1503960366 4/12/2016 2:00:00 AM        59
## 4 1503960366 4/12/2016 3:00:00 AM        47
## 5 1503960366 4/12/2016 4:00:00 AM        48
## 6 1503960366 4/12/2016 5:00:00 AM        48

head(sleepactivity_df)

## # A tibble: 6 × 4
##           Id date                 value       logId
##        <dbl> <chr>                <dbl>       <dbl>
## 1 1503960366 4/12/2016 2:47:30 AM     3 11380564589
## 2 1503960366 4/12/2016 2:48:30 AM     2 11380564589
## 3 1503960366 4/12/2016 2:49:30 AM     1 11380564589
## 4 1503960366 4/12/2016 2:50:30 AM     1 11380564589
## 5 1503960366 4/12/2016 2:51:30 AM     1 11380564589
## 6 1503960366 4/12/2016 2:52:30 AM     1 11380564589

head(sleepday)

## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹TotalTimeInBed

head(weightlog)

## # A tibble: 6 × 8
##           Id Date                  WeightKg Weight…¹   Fat   BMI IsMan…²   LogId
##        <dbl> <chr>                    <dbl>    <dbl> <dbl> <dbl> <lgl>     <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM      52.6     116.    22  22.6 TRUE    1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM      52.6     116.    NA  22.6 TRUE    1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM     134.      294.    NA  47.5 FALSE   1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.    NA  21.5 TRUE    1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.    NA  21.7 TRUE    1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     160.    25  27.5 TRUE    1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport

Column summaries

colnames(dailycalories)

## [1] "Id"          "ActivityDay" "Calories"

colnames(hourlycalories)

## [1] "Id"           "ActivityHour" "Calories"

colnames(sleepactivity_df)

## [1] "Id"    "date"  "value" "logId"

colnames(sleepday)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

colnames(weightlog)

## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Glimpse summaries

glimpse(dailycalories)

## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ Calories    <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775…

glimpse(hourlycalories)

## Rows: 22,099
## Columns: 3
## $ Id           <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ Calories     <dbl> 81, 61, 59, 47, 48, 48, 48, 47, 68, 141, 99, 76, 73, 66, …

glimpse(sleepactivity_df)

## Rows: 188,521
## Columns: 4
## $ Id    <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503…
## $ date  <chr> "4/12/2016 2:47:30 AM", "4/12/2016 2:48:30 AM", "4/12/2016 2:49:…
## $ value <dbl> 3, 2, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1…
## $ logId <dbl> 11380564589, 11380564589, 11380564589, 11380564589, 11380564589,…

glimpse(sleepday)

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

glimpse(weightlog)

## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date           <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2…
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat            <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…

Noticed that the number of rows for the weight log data was markedly lower than the other data. Explored how many distinct IDs are in each spreadsheet

n_distinct(dailycalories$Id)

## [1] 33

n_distinct(hourlycalories$Id)

## [1] 33

n_distinct(sleepactivity_df$Id)

## [1] 24

n_distinct(sleepday$Id)

## [1] 24

n_distinct(weightlog$Id)

## [1] 8

The weight log data was collected for 8 participants while the other data was collected from 24 or 33 participants. This indicates an issue with the usage of the smart device.

Exploring Relationship between Sleep & Weight

merged the datasets for sleep and weight

sleepday_weight <- merge(sleepday, weightlog, by=c('Id'))
head(sleepday_weight)

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366  5/8/2016 12:00:00 AM                 1                594
## 2 1503960366  5/8/2016 12:00:00 AM                 1                594
## 3 1503960366  5/7/2016 12:00:00 AM                 1                331
## 4 1503960366  5/7/2016 12:00:00 AM                 1                331
## 5 1503960366 4/26/2016 12:00:00 AM                 1                245
## 6 1503960366 4/26/2016 12:00:00 AM                 1                245
##   TotalTimeInBed                 Date WeightKg WeightPounds Fat   BMI
## 1            611 5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 2            611 5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 3            349 5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 4            349 5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 5            274 5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 6            274 5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
##   IsManualReport        LogId
## 1           TRUE 1.462320e+12
## 2           TRUE 1.462234e+12
## 3           TRUE 1.462320e+12
## 4           TRUE 1.462234e+12
## 5           TRUE 1.462320e+12
## 6           TRUE 1.462234e+12

Distinct ID in this merged sleep-weight dataset

n_distinct(sleepday_weight$Id)

## [1] 6

Weight log entry data was 1/5 the size of the other datasets
Used Excel and created a pivot table for sleep and weight
Found duplicate IDs and created a new dataset of those IDs and minutes and sleep data
Imported that dataset

sleep_weight <-read_csv("sleep_weight_merged.csv")

## Rows: 6 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): ID, Weight, MinutesAsleep
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

colnames(sleep_weight)

## [1] "ID"            "Weight"        "MinutesAsleep"

glimpse(sleep_weight)

## Rows: 6
## Columns: 3
## $ ID            <dbl> 1503960366, 1927972279, 4319703577, 4558609924, 55771503…
## $ Weight        <dbl> 115.9631, 294.3171, 159.5045, 153.5299, 199.9593, 135.70…
## $ MinutesAsleep <dbl> 360.2800, 417.0000, 476.6538, 127.6000, 432.0000, 448.00…

F. SHARE - Graphs:

WEIGHT

Summary statistics & Bar Graph of Weight (N=8)

Average weight of all participants is 138 pounds, with a minimum weight of 115 pounds and a maximum weight of 160 pounds. (The graph shows a different picture than the summary stats, needs more processing)

weightlog %>% drop_na() %>% summarize(min(WeightPounds), max(WeightPounds), mean(WeightPounds))

## # A tibble: 1 × 3
##   `min(WeightPounds)` `max(WeightPounds)` `mean(WeightPounds)`
##                 <dbl>               <dbl>                <dbl>
## 1                116.                160.                 138.

ggplot(data = weightlog) +
  geom_bar(mapping = aes(x = WeightPounds))

Summary statistics & Bar Graph of Weight for Subsample (N=6)

The average weight of these 6 participants is 176 pounds with a minimum weight of 116 pounds and a maximum weight of 294 pounds.

sleep_weight %>% drop_na() %>% summarize(min(Weight), max(Weight), mean(Weight))

## # A tibble: 1 × 3
##   `min(Weight)` `max(Weight)` `mean(Weight)`
##           <dbl>         <dbl>          <dbl>
## 1          116.          294.           176.

ggplot(data = sleep_weight) +
  geom_bar(mapping = aes(x = Weight))

SLEEP

Summary Statistics & Bar Graph of Minutes Asleep (N=24)

The average time spent sleeping is approximately 7 hours with a minimum of less than one hour of sleep (58 minutes) and a maximum of 13.2 hours.

sleepday %>% drop_na() %>% summarize(min(TotalMinutesAsleep), max(TotalMinutesAsleep), mean(TotalMinutesAsleep))

## # A tibble: 1 × 3
##   `min(TotalMinutesAsleep)` `max(TotalMinutesAsleep)` `mean(TotalMinutesAsleep)`
##                       <dbl>                     <dbl>                      <dbl>
## 1                        58                       796                       419.

ggplot(data = sleepday) +
  geom_bar(mapping = aes(x = TotalMinutesAsleep))

Summary Statistics & Bar Graph of Minutes Asleep for subsample (N = 6)

Average time spent sleeping is 6.3 hours, with a minimum of 2.1 hours and a maximum of 7.9 hours.

sleep_weight %>% drop_na() %>% summarize(min(MinutesAsleep), max(MinutesAsleep), mean(MinutesAsleep))

## # A tibble: 1 × 3
##   `min(MinutesAsleep)` `max(MinutesAsleep)` `mean(MinutesAsleep)`
##                  <dbl>                <dbl>                 <dbl>
## 1                 128.                 477.                  377.

ggplot(data = sleep_weight) +
  geom_bar(mapping = aes(x = MinutesAsleep))

G. ACT - Findings

Summary analyses showed that the weight log data entries came from 8 participants while the data for minutes of sleep per day and sleep activity had 24 participants. Merging the data to include only participants who entered weight data and had sleep information recorded resulted in 6 participants. This is not enough of a sample size to make any generalizable conclusions about sleep and weight.

Daily and hourly calories data had 33 participants
Sleep data had 24 participants
Data is either missing due to error or due to no entry by participants

The main finding is that the weight log data was only gathered for less than 1/4 of the partcipants (8 out of 30 participants). Why is this? Further research was conducted to see how the weight log data for FitBit is collected. I found that this data has to be manually entered by the consumer https://community.fitbit.com/t5/Flex-Flex-2/How-does-your-fitbit-track-your-weight/td-p/1811894.

Key Findings

Manual entry of weight info leads to missing data with the FitBit tracker. This is due to low usage of this feature on the device. Manual entry may deter consumers especially if it is time consuming or they are not meeting their health goals. Weight measurement is often used as a key marker of health and fitness. This can be a debatable topic given the recent trend towards focussing on health instead of weight https://www.franchisehelp.com/industry-reports/weight-loss-industry-analysis-2020-cost-trends/. However, weight measurement still remains a key part of health-based smart devices https://www.health.harvard.edu/staying-healthy/wearable-fitness-trackers-may-aid-weight-loss-efforts. Therefore, if consumers can link their activities to weight goals without manual entry they may get more utility out of the devices. Automating weight measurement can make weight data more reliable and effective in finding the relationship between all data measured with the smart devices.

H. RECOMMENDATIONS

Bellabeats can include a weight measurement feature on its app. The following steps can be used to guide implementation.

Conduct focus groups with female consumers of fitness trackers to find out their preferences for a weight measurement feature on the tracker
Include a feature that would prompt the consumer to input their weight
Developing a tool to automatically upload consumers’ weights from a linked scale
Learning how sleep, activity, stress, water, and caloric intake are directly related to markers of health such as weight, will make Bellabeats a leader in the health-related smart device industry.

Bellabeats Capstone

Ja’Nya Jenoch, Ph.D.

2022-11-09