Bellabeats Market Research Report

Project Summary

The objective was to report on how other non-Bellabeats products is being used through a data driven analysis. Based on these results a product from Bellabeats product line was selected to assist the marketing team with their marketing strategy. The data source was crowd sourced from FitBit Fitness tracker data and downloaded from Kaggle. Unfortunately, multiple datasets were incomplete and critical data was missing i.e. sex, age, weight, fitness level and height. Taking these limitations into consideration this project focused on ‘daily data’ as these datasets were complete and allowed for the most comprehensive analysis to determine how the product was used. Based on the analysis the fitness tracker was used primarily as an activity tracker, with all participants wearing the trackers during the day. Although, sleep data was incomplete it allowed for insights into the participants sleeping behaviors. The relationships of various activity levels was analyzed against calories used. A strong positive correlation was found between very active minutes and calories used, while a negative correlation was observed between sedentary minutes and calories used. According to the CDC it is recommended that a person should exercise for a minimum of 22 minutes per day, with the intensity being moderate to vigorous. The participants meet this requirement, however, on closer observation it was noted that the majority of the activity minutes being light intensity. The recommendation from the data is that Bellabeats marketing team should focus on marketing the ‘Bellabeats Time’ and ‘Bellabeats Leaf’ as both of these products are able to perform activity tracking. Based on the observations it is recommended that the app should include goals and rewards specific to the user eg. if user aims to lose weight the app should award the user each time they perform vigorous activities, as this would correspond to more calories burned.

Data Sources for Analysis

The data source used during this analysis was FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius). The dataset contains personal fitness information from thirty (30) FitBit users. Thirty eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to identify users’ habits.

The metadata provided on Kaggle can be found on https://www.kaggle.com/arashnic/fitbit/metadata. To verify the source I navigated to https://zenodo.org/record/53894#.X9oeh3Uzaao and noticed the data was captured on another application named ‘Amazon Mechanical Turk’ as part of a survey. Additional verification was not possible. Unfortunately, the original source was unavailable and the eligibility criteria was not disclosed.

Critical Analysis of the data source

Unfortunately, the data provided and the information it contained was unreliable. Due to the unavailability of metadata for the original data, which makes it difficult to verify the data’s quality. It cannot be established if the data is biased or not as the data source only states that there are thirty eligible Fitbit users. Information such as sex, age, fitness level, weight and height is absent. Furthermore, there are multiple incomplete data tables. The data appears to be pre-processed as the file name contains ‘merged’ in the file name (eg. activitydata_merged). The data used is sourced from a Third Party and cannot be validated as the original data is unavailable. There are some critical information missing to perform a more comprehensive analysis i.e. age, sex, height and weight. The data was collected in 2016 from my understanding of the fitness device market there have been multiple improvements over the past five (5) years. The data source is cited.

The reference for the sourced data is as follow:

Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.53894

Because there is missing information the creditably of the data cannot be verified. This makes it difficult to draw informed conclusions, as the data has glaring weaknesses. Without acknowledging these limitations to the analysis the decisions based on the analysis can be incorrect and cost the company revenue.

Preparing the data for analysis

The data sets were downloaded from Kaggle at https://www.kaggle.com/arashnic/fitbit.

The CSV files were imported into R.

# Import data from csv to data frames

activity_daily <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/dailyActivity_merged.csv", header = TRUE, sep = ",")
heartrate_seconds <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/heartrate_seconds_merged.csv", header = TRUE, sep = ",")
mets_minute <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/minuteMETsNarrow_merged.csv", header = TRUE, sep = ",")
sleep_daily <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/sleepDay_merged.csv", header = TRUE, sep = ",")
sleep_minute <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/minuteSleep_merged.csv", header = TRUE, sep = ",")

The focus was on ‘daily data’ with the exception of sleep data, heart rate and METs. Viewing the data to ensure that the data has successfully imported.

# View Data to ensure import was successful. The focus will be on the daily data with the exception of the heart rate and sleep data.
tibble(activity_daily)
## # A tibble: 940 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <int>         <dbl>           <dbl>            <dbl>
##  1 1.50e9 4/12/2016         13162          8.5             8.5                 0
##  2 1.50e9 4/13/2016         10735          6.97            6.97                0
##  3 1.50e9 4/14/2016         10460          6.74            6.74                0
##  4 1.50e9 4/15/2016          9762          6.28            6.28                0
##  5 1.50e9 4/16/2016         12669          8.16            8.16                0
##  6 1.50e9 4/17/2016          9705          6.48            6.48                0
##  7 1.50e9 4/18/2016         13019          8.59            8.59                0
##  8 1.50e9 4/19/2016         15506          9.88            9.88                0
##  9 1.50e9 4/20/2016         10544          6.68            6.68                0
## 10 1.50e9 4/21/2016          9819          6.34            6.34                0
## # ... with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <int>,
## #   FairlyActiveMinutes <int>, LightlyActiveMinutes <int>,
## #   SedentaryMinutes <int>, Calories <int>
tibble(heartrate_seconds)
## # A tibble: 2,483,658 x 3
##            Id Time                 Value
##         <dbl> <chr>                <int>
##  1 2022484408 4/12/2016 7:21:00 AM    97
##  2 2022484408 4/12/2016 7:21:05 AM   102
##  3 2022484408 4/12/2016 7:21:10 AM   105
##  4 2022484408 4/12/2016 7:21:20 AM   103
##  5 2022484408 4/12/2016 7:21:25 AM   101
##  6 2022484408 4/12/2016 7:22:05 AM    95
##  7 2022484408 4/12/2016 7:22:10 AM    91
##  8 2022484408 4/12/2016 7:22:15 AM    93
##  9 2022484408 4/12/2016 7:22:20 AM    94
## 10 2022484408 4/12/2016 7:22:25 AM    93
## # ... with 2,483,648 more rows
tibble(mets_minute)
## # A tibble: 1,325,580 x 3
##            Id ActivityMinute         METs
##         <dbl> <chr>                 <int>
##  1 1503960366 4/12/2016 12:00:00 AM    10
##  2 1503960366 4/12/2016 12:01:00 AM    10
##  3 1503960366 4/12/2016 12:02:00 AM    10
##  4 1503960366 4/12/2016 12:03:00 AM    10
##  5 1503960366 4/12/2016 12:04:00 AM    10
##  6 1503960366 4/12/2016 12:05:00 AM    12
##  7 1503960366 4/12/2016 12:06:00 AM    12
##  8 1503960366 4/12/2016 12:07:00 AM    12
##  9 1503960366 4/12/2016 12:08:00 AM    12
## 10 1503960366 4/12/2016 12:09:00 AM    12
## # ... with 1,325,570 more rows
tibble(sleep_daily)
## # A tibble: 413 x 5
##            Id SleepDay         TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
##         <dbl> <chr>                       <int>             <int>          <int>
##  1 1503960366 4/12/2016 12:00~                1               327            346
##  2 1503960366 4/13/2016 12:00~                2               384            407
##  3 1503960366 4/15/2016 12:00~                1               412            442
##  4 1503960366 4/16/2016 12:00~                2               340            367
##  5 1503960366 4/17/2016 12:00~                1               700            712
##  6 1503960366 4/19/2016 12:00~                1               304            320
##  7 1503960366 4/20/2016 12:00~                1               360            377
##  8 1503960366 4/21/2016 12:00~                1               325            364
##  9 1503960366 4/23/2016 12:00~                1               361            384
## 10 1503960366 4/24/2016 12:00~                1               430            449
## # ... with 403 more rows
tibble(sleep_minute)
## # A tibble: 188,521 x 4
##            Id date                 value       logId
##         <dbl> <chr>                <int>       <dbl>
##  1 1503960366 4/12/2016 2:47:30 AM     3 11380564589
##  2 1503960366 4/12/2016 2:48:30 AM     2 11380564589
##  3 1503960366 4/12/2016 2:49:30 AM     1 11380564589
##  4 1503960366 4/12/2016 2:50:30 AM     1 11380564589
##  5 1503960366 4/12/2016 2:51:30 AM     1 11380564589
##  6 1503960366 4/12/2016 2:52:30 AM     1 11380564589
##  7 1503960366 4/12/2016 2:53:30 AM     1 11380564589
##  8 1503960366 4/12/2016 2:54:30 AM     2 11380564589
##  9 1503960366 4/12/2016 2:55:30 AM     2 11380564589
## 10 1503960366 4/12/2016 2:56:30 AM     2 11380564589
## # ... with 188,511 more rows

With the tibbles I noticed a few things. First, was the use of unconventional column name like ‘ActivityDate’ rather than just ‘Date’ or the use of ‘Time’ when in fact it was ‘Date_Time’. To ensure that the headers were all in the same format, the clean_names function was used.

# Cleaning process with janitor

# Based on the results the column names for date and time to be made consistent

activity_daily_cln <- activity_daily %>% 
  rename(c("ActivityDate" = "date")) %>% 
  clean_names()

# Heart rate data column to be changed from time to date time

heartrate_seconds_cln <- heartrate_seconds %>% 
  rename(c("Time" = "date_time")) %>% 
  clean_names()

# METs data column name change from activity minutes to date time 

mets_minute_cln <- mets_minute %>% 
  rename(c("ActivityMinute" = "date_time", "METs" = "mets")) %>% 
  clean_names()

# Sleep data column names to be changed the SleepDay to Date

sleep_daily_cln <- sleep_daily %>% 
  rename(c("SleepDay" = "date")) %>% 
  clean_names()

# Sleep minute column name date to date time

sleep_minute_cln <- sleep_minute %>% 
  rename(c("date" = "date_time")) %>% 
  clean_names()

The tabyl function was used to verify the completeness of the data and provided summaries of each data frame.

# Produce a table summarizing the table based on id.

tabyl(activity_daily_cln, id)
##          id  n     percent
##  1503960366 31 0.032978723
##  1624580081 31 0.032978723
##  1644430081 30 0.031914894
##  1844505072 31 0.032978723
##  1927972279 31 0.032978723
##  2022484408 31 0.032978723
##  2026352035 31 0.032978723
##  2320127002 31 0.032978723
##  2347167796 18 0.019148936
##  2873212765 31 0.032978723
##  3372868164 20 0.021276596
##  3977333714 30 0.031914894
##  4020332650 31 0.032978723
##  4057192912  4 0.004255319
##  4319703577 31 0.032978723
##  4388161847 31 0.032978723
##  4445114986 31 0.032978723
##  4558609924 31 0.032978723
##  4702921684 31 0.032978723
##  5553957443 31 0.032978723
##  5577150313 30 0.031914894
##  6117666160 28 0.029787234
##  6290855005 29 0.030851064
##  6775888955 26 0.027659574
##  6962181067 31 0.032978723
##  7007744171 26 0.027659574
##  7086361926 31 0.032978723
##  8053475328 31 0.032978723
##  8253242879 19 0.020212766
##  8378563200 31 0.032978723
##  8583815059 31 0.032978723
##  8792009665 29 0.030851064
##  8877689391 31 0.032978723
# The outcome: there was only 21 people that wore the fitness device for 31 days

tabyl(activity_daily_cln, id) %>% 
  summarise(count(n))
##   count(n).x count(n).freq
## 1          4             1
## 2         18             1
## 3         19             1
## 4         20             1
## 5         26             2
## 6         28             1
## 7         29             2
## 8         30             3
## 9         31            21
# The heart rate was more incomplete that the activity data with only 14 participants 

tabyl(heartrate_seconds_cln,id) %>% 
  summarise(count(n))
##    count(n).x count(n).freq
## 1        2490             1
## 2       32771             1
## 3      122841             1
## 4      133592             1
## 5      152683             1
## 6      154104             1
## 7      158899             1
## 8      192168             1
## 9      228841             1
## 10     248560             1
## 11     249748             1
## 12     255174             1
## 13     266326             1
## 14     285461             1
# There was 33 observations in total of varying lengths

tabyl(mets_minute_cln, id) %>% 
  summarise(count(n))
##    count(n).x count(n).freq
## 1        5280             1
## 2       24840             1
## 3       25860             1
## 4       28320             1
## 5       36060             1
## 6       36600             1
## 7       39600             1
## 8       39900             1
## 9       40320             1
## 10      41760             1
## 11      42480             2
## 12      43020             1
## 13      43080             1
## 14      43440             1
## 15      43800             1
## 16      43860             2
## 17      43920             2
## 18      43980             3
## 19      44040             1
## 20      44100             4
## 21      44160             5
# There were 24 participants, tracking their sleep a minimum of 1 and a maximum of 32 days.

tabyl(sleep_daily_cln, id) %>% 
  summarise(count(n))
##    count(n).x count(n).freq
## 1           1             1
## 2           2             1
## 3           3             3
## 4           4             1
## 5           5             2
## 6           8             1
## 7          15             2
## 8          18             1
## 9          24             2
## 10         25             1
## 11         26             2
## 12         28             4
## 13         31             2
## 14         32             1
# There were 24 participants, tracking their sleep a minimum of 69 min and a maximum of 15682 min. This is consistent with the observations noted above.

tabyl(sleep_minute_cln, id) %>% 
  summarise(count(n))
##    count(n).x count(n).freq
## 1          69             1
## 2         143             1
## 3         700             1
## 4         905             1
## 5        1107             1
## 6        1384             1
## 7        2189             1
## 8        2883             1
## 9        3038             1
## 10       6807             1
## 11       7370             1
## 12       9183             1
## 13       9580             1
## 14       9734             1
## 15      11194             1
## 16      11671             1
## 17      11976             1
## 18      12375             1
## 19      12912             1
## 20      13051             1
## 21      14450             1
## 22      15054             1
## 23      15064             1
## 24      15682             1

The next step was to convert the character format for date and time to date formats as the focus was on ‘daily data’.

# To be able to plot the data the date formats of all the dataframes needs to be changed

# Date is changed for the two dataframes below

activity_daily_cln[[2]] <- as.Date(activity_daily_cln[[2]], "%m/%d/%Y")
tibble(activity_daily_cln)
## # A tibble: 940 x 15
##            id date       total_steps total_distance tracker_distance
##         <dbl> <date>           <int>          <dbl>            <dbl>
##  1 1503960366 2016-04-12       13162           8.5              8.5 
##  2 1503960366 2016-04-13       10735           6.97             6.97
##  3 1503960366 2016-04-14       10460           6.74             6.74
##  4 1503960366 2016-04-15        9762           6.28             6.28
##  5 1503960366 2016-04-16       12669           8.16             8.16
##  6 1503960366 2016-04-17        9705           6.48             6.48
##  7 1503960366 2016-04-18       13019           8.59             8.59
##  8 1503960366 2016-04-19       15506           9.88             9.88
##  9 1503960366 2016-04-20       10544           6.68             6.68
## 10 1503960366 2016-04-21        9819           6.34             6.34
## # ... with 930 more rows, and 10 more variables:
## #   logged_activities_distance <dbl>, very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <int>,
## #   fairly_active_minutes <int>, lightly_active_minutes <int>,
## #   sedentary_minutes <int>, calories <int>
sleep_daily_cln[[2]] <- as.Date(sleep_daily_cln[[2]], "%m/%d/%Y")
tibble(sleep_daily_cln)
## # A tibble: 413 x 5
##            id date       total_sleep_records total_minutes_asl~ total_time_in_b~
##         <dbl> <date>                   <int>              <int>            <int>
##  1 1503960366 2016-04-12                   1                327              346
##  2 1503960366 2016-04-13                   2                384              407
##  3 1503960366 2016-04-15                   1                412              442
##  4 1503960366 2016-04-16                   2                340              367
##  5 1503960366 2016-04-17                   1                700              712
##  6 1503960366 2016-04-19                   1                304              320
##  7 1503960366 2016-04-20                   1                360              377
##  8 1503960366 2016-04-21                   1                325              364
##  9 1503960366 2016-04-23                   1                361              384
## 10 1503960366 2016-04-24                   1                430              449
## # ... with 403 more rows
# The dataframes below needs datetime format

heartrate_seconds_cln[[2]] <- as.POSIXct(heartrate_seconds_cln[[2]], format = "%m/%d/%Y %H:%M:%S %p")
tibble(heartrate_seconds_cln)
## # A tibble: 2,483,658 x 3
##            id date_time           value
##         <dbl> <dttm>              <int>
##  1 2022484408 2016-04-12 07:21:00    97
##  2 2022484408 2016-04-12 07:21:05   102
##  3 2022484408 2016-04-12 07:21:10   105
##  4 2022484408 2016-04-12 07:21:20   103
##  5 2022484408 2016-04-12 07:21:25   101
##  6 2022484408 2016-04-12 07:22:05    95
##  7 2022484408 2016-04-12 07:22:10    91
##  8 2022484408 2016-04-12 07:22:15    93
##  9 2022484408 2016-04-12 07:22:20    94
## 10 2022484408 2016-04-12 07:22:25    93
## # ... with 2,483,648 more rows
mets_minute_cln[[2]] <- as.POSIXct(mets_minute_cln[[2]], format = "%m/%d/%Y %H:%M:%S %p")
tibble(mets_minute_cln)
## # A tibble: 1,325,580 x 3
##            id date_time            mets
##         <dbl> <dttm>              <int>
##  1 1503960366 2016-04-12 12:00:00    10
##  2 1503960366 2016-04-12 12:01:00    10
##  3 1503960366 2016-04-12 12:02:00    10
##  4 1503960366 2016-04-12 12:03:00    10
##  5 1503960366 2016-04-12 12:04:00    10
##  6 1503960366 2016-04-12 12:05:00    12
##  7 1503960366 2016-04-12 12:06:00    12
##  8 1503960366 2016-04-12 12:07:00    12
##  9 1503960366 2016-04-12 12:08:00    12
## 10 1503960366 2016-04-12 12:09:00    12
## # ... with 1,325,570 more rows
# Separating column for the date time in tables for heart rate and METs

heartrate_seconds_cln_v2 <- separate(heartrate_seconds_cln, col = date_time, c("date","time"), sep = " ")
heartrate_seconds_cln_v2[[2]] <- as.Date(heartrate_seconds_cln_v2[[2]], "%Y-%m-%d")
tibble(heartrate_seconds_cln_v2)
## # A tibble: 2,483,658 x 4
##            id date       time     value
##         <dbl> <date>     <chr>    <int>
##  1 2022484408 2016-04-12 07:21:00    97
##  2 2022484408 2016-04-12 07:21:05   102
##  3 2022484408 2016-04-12 07:21:10   105
##  4 2022484408 2016-04-12 07:21:20   103
##  5 2022484408 2016-04-12 07:21:25   101
##  6 2022484408 2016-04-12 07:22:05    95
##  7 2022484408 2016-04-12 07:22:10    91
##  8 2022484408 2016-04-12 07:22:15    93
##  9 2022484408 2016-04-12 07:22:20    94
## 10 2022484408 2016-04-12 07:22:25    93
## # ... with 2,483,648 more rows
mets_minute_cln_v2 <- separate(mets_minute_cln, col = date_time, c("date","time"), sep = " ")
mets_minute_cln_v2[[2]] <- as.Date(mets_minute_cln_v2[[2]], "%Y-%m-%d")
tibble(mets_minute_cln_v2)
## # A tibble: 1,325,580 x 4
##            id date       time      mets
##         <dbl> <date>     <chr>    <int>
##  1 1503960366 2016-04-12 12:00:00    10
##  2 1503960366 2016-04-12 12:01:00    10
##  3 1503960366 2016-04-12 12:02:00    10
##  4 1503960366 2016-04-12 12:03:00    10
##  5 1503960366 2016-04-12 12:04:00    10
##  6 1503960366 2016-04-12 12:05:00    12
##  7 1503960366 2016-04-12 12:06:00    12
##  8 1503960366 2016-04-12 12:07:00    12
##  9 1503960366 2016-04-12 12:08:00    12
## 10 1503960366 2016-04-12 12:09:00    12
## # ... with 1,325,570 more rows

Because there was a focus on ‘daily data’ the heart rate and METs data had to be converted to ‘daily data’.

# Summarize the min and sec data into daily data in order to use for the daily insights.

heartrate_seconds_cln_dt <- data.table(heartrate_seconds_cln_v2)
heartrate_seconds_cln_dt_2 <- heartrate_seconds_cln_dt[,list(mean_hrt = mean(value), max_hrt = max(value), min_hrt = min(value)), by = c("id,date")]
  
  
mets_minute_cln_dt <- data.table(mets_minute_cln_v2)
mets_minute_cln_dt_2 <- mets_minute_cln_dt[,list(mean_mets = mean(mets), max_mets = max(mets), min_mets = min(mets)), by = c("id,date")]

Verification to look for duplication’s, missing values and unusual values.

# Looking for any duplication or missing values

# No duplicates 
activity_daily_cln %>% 
  get_dupes() %>% 
  summarise(count(id))
## [1] count(id)
## <0 rows> (or 0-length row.names)
# Multiple duplicates  
heartrate_seconds_cln_v2 %>% 
  get_dupes() %>% 
  summarise(count(id))
##    count(id).x count(id).freq
## 1   2022484408            260
## 2   2347167796           1862
## 3   4020332650            876
## 4   4388161847           3218
## 5   4558609924            878
## 6   5553957443           4684
## 7   5577150313           1508
## 8   6117666160            940
## 9   6775888955             28
## 10  6962181067           2490
## 11  7007744171            462
## 12  8792009665           1142
## 13  8877689391            320
# Remove the duplicates from heart rate data

heartrate_seconds_cln_v3 <- heartrate_seconds_cln_v2 %>% 
  distinct()

# Duplicates observed and thus needs to be removed
mets_minute_cln_v2 %>% 
  get_dupes() %>% 
  summarise(count(id))
##    count(id).x count(id).freq
## 1   1503960366          10572
## 2   1624580081          33966
## 3   1644430081          17290
## 4   1844505072          28874
## 5   1927972279          35982
## 6   2022484408          10120
## 7   2026352035          11290
## 8   2320127002          23562
## 9   2347167796           4908
## 10  2873212765          25368
## 11  3372868164          15662
## 12  3977333714          18014
## 13  4020332650          30258
## 14  4057192912           4142
## 15  4319703577          17532
## 16  4388161847          10206
## 17  4445114986          13766
## 18  4558609924           9598
## 19  4702921684          16180
## 20  5553957443          18544
## 21  5577150313           9424
## 22  6117666160          14316
## 23  6290855005          26456
## 24  6775888955          29664
## 25  6962181067          12778
## 26  7007744171           8978
## 27  7086361926          15902
## 28  8053475328          16742
## 29  8253242879          20226
## 30  8378563200          13924
## 31  8583815059          22228
## 32  8792009665          25810
## 33  8877689391           6882
# Remove duplicates from METs data

mets_minute_cln_v3 <- mets_minute_cln_v2 %>% 
  distinct() 

# Duplicates observed in the data frame below

sleep_daily_cln %>% 
  get_dupes() %>% 
  summarise(count(id))
##   count(id).x count(id).freq
## 1  4388161847              2
## 2  4702921684              2
## 3  8378563200              2
# Remove duplicates from sleep data

sleep_daily_cln_v2 <- sleep_daily_cln %>% 
  distinct()

# Final step of the cleaning process is to summarize the min and sec data into daily data in order to use for the daily insights.

heartrate_seconds_cln_dt <- data.table(heartrate_seconds_cln_v3)
heartrate_seconds_cln_dt_2 <- heartrate_seconds_cln_dt[,list(mean_hrt = mean(value), max_hrt = max(value), min_hrt = min(value)), by = c("id,date")]
  
  
mets_minute_cln_dt <- data.table(mets_minute_cln_v3)
mets_minute_cln_dt_2 <- mets_minute_cln_dt[,list(mean_mets = mean(mets), max_mets = max(mets), min_mets = min(mets)), by = c("id,date")]


# Using the summary function you can quickly identify if there are unusual numbers  
heartrate_seconds_cln_dt_2 %>% 
  summary()
##        id                 date               mean_hrt         max_hrt     
##  Min.   :2.022e+09   Min.   :2016-04-12   Min.   : 59.38   Min.   : 80.0  
##  1st Qu.:4.388e+09   1st Qu.:2016-04-19   1st Qu.: 70.49   1st Qu.:125.0  
##  Median :5.577e+09   Median :2016-04-26   Median : 77.50   Median :135.5  
##  Mean   :5.565e+09   Mean   :2016-04-26   Mean   : 78.63   Mean   :138.7  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-03   3rd Qu.: 84.93   3rd Qu.:153.0  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :109.79   Max.   :203.0  
##     min_hrt     
##  Min.   :36.00  
##  1st Qu.:48.00  
##  Median :52.00  
##  Mean   :52.69  
##  3rd Qu.:56.00  
##  Max.   :71.00
mets_minute_cln_dt_2 %>% 
  summary()
##        id                 date              mean_mets        max_mets    
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :10.00   Min.   : 10.0  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.:13.54   1st Qu.: 58.0  
##  Median :4.445e+09   Median :2016-04-26   Median :15.73   Median : 74.0  
##  Mean   :4.847e+09   Mean   :2016-04-26   Mean   :15.53   Mean   : 73.6  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:17.54   3rd Qu.: 93.0  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :26.85   Max.   :157.0  
##     min_mets     
##  Min.   : 0.000  
##  1st Qu.:10.000  
##  Median :10.000  
##  Mean   : 9.931  
##  3rd Qu.:10.000  
##  Max.   :10.000
activity_daily_cln %>% 
  summary()
##        id                 date             total_steps    total_distance  
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Median :2016-04-26   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019   Max.   :28.030  
##  tracker_distance logged_activities_distance very_active_distance
##  Min.   : 0.000   Min.   :0.0000             Min.   : 0.000      
##  1st Qu.: 2.620   1st Qu.:0.0000             1st Qu.: 0.000      
##  Median : 5.245   Median :0.0000             Median : 0.210      
##  Mean   : 5.475   Mean   :0.1082             Mean   : 1.503      
##  3rd Qu.: 7.710   3rd Qu.:0.0000             3rd Qu.: 2.053      
##  Max.   :28.030   Max.   :4.9421             Max.   :21.920      
##  moderately_active_distance light_active_distance sedentary_active_distance
##  Min.   :0.0000             Min.   : 0.000        Min.   :0.000000         
##  1st Qu.:0.0000             1st Qu.: 1.945        1st Qu.:0.000000         
##  Median :0.2400             Median : 3.365        Median :0.000000         
##  Mean   :0.5675             Mean   : 3.341        Mean   :0.001606         
##  3rd Qu.:0.8000             3rd Qu.: 4.782        3rd Qu.:0.000000         
##  Max.   :6.4800             Max.   :10.710        Max.   :0.110000         
##  very_active_minutes fairly_active_minutes lightly_active_minutes
##  Min.   :  0.00      Min.   :  0.00        Min.   :  0.0         
##  1st Qu.:  0.00      1st Qu.:  0.00        1st Qu.:127.0         
##  Median :  4.00      Median :  6.00        Median :199.0         
##  Mean   : 21.16      Mean   : 13.56        Mean   :192.8         
##  3rd Qu.: 32.00      3rd Qu.: 19.00        3rd Qu.:264.0         
##  Max.   :210.00      Max.   :143.00        Max.   :518.0         
##  sedentary_minutes    calories   
##  Min.   :   0.0    Min.   :   0  
##  1st Qu.: 729.8    1st Qu.:1828  
##  Median :1057.5    Median :2134  
##  Mean   : 991.2    Mean   :2304  
##  3rd Qu.:1229.5    3rd Qu.:2793  
##  Max.   :1440.0    Max.   :4900
sleep_daily_cln_v2 %>% 
  summary()
##        id                 date            total_sleep_records
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :1.00       
##  1st Qu.:3.977e+09   1st Qu.:2016-04-19   1st Qu.:1.00       
##  Median :4.703e+09   Median :2016-04-27   Median :1.00       
##  Mean   :4.995e+09   Mean   :2016-04-26   Mean   :1.12       
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:1.00       
##  Max.   :8.792e+09   Max.   :2016-05-12   Max.   :3.00       
##  total_minutes_asleep total_time_in_bed
##  Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:361.0        1st Qu.:403.8    
##  Median :432.5        Median :463.0    
##  Mean   :419.2        Mean   :458.5    
##  3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :796.0        Max.   :961.0

With the cleaning process completed and the data was ready for analysis.

Analyzing the data for key insights

Descriptive statistic of the Daily Activity and Sleep Data:

# Analyze phase of the analysis 

# Starting with descriptive statistics for daily activity data

activity_daily_cln_dt <- data.table(activity_daily_cln)
  
activity_daily_cln_dt_2 <- activity_daily_cln_dt[,list(total_active_minutes = sum(very_active_minutes, fairly_active_minutes, lightly_active_minutes)), by = c("id,date")]

activity_daily_cln_dt_3 <- activity_daily_cln_dt %>% 
  inner_join(activity_daily_cln_dt_2)

activity_daily_cln_dt_3$weekday <- weekdays(activity_daily_cln_dt_2$date)

activity_daily_mets_cln_merge <- activity_daily_cln_dt_3 %>% 
  inner_join(mets_minute_cln_dt_2)

tibble(activity_daily_mets_cln_merge)
## # A tibble: 934 x 20
##            id date       total_steps total_distance tracker_distance
##         <dbl> <date>           <int>          <dbl>            <dbl>
##  1 1503960366 2016-04-12       13162           8.5              8.5 
##  2 1503960366 2016-04-13       10735           6.97             6.97
##  3 1503960366 2016-04-14       10460           6.74             6.74
##  4 1503960366 2016-04-15        9762           6.28             6.28
##  5 1503960366 2016-04-16       12669           8.16             8.16
##  6 1503960366 2016-04-17        9705           6.48             6.48
##  7 1503960366 2016-04-18       13019           8.59             8.59
##  8 1503960366 2016-04-19       15506           9.88             9.88
##  9 1503960366 2016-04-20       10544           6.68             6.68
## 10 1503960366 2016-04-21        9819           6.34             6.34
## # ... with 924 more rows, and 15 more variables:
## #   logged_activities_distance <dbl>, very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <int>,
## #   fairly_active_minutes <int>, lightly_active_minutes <int>,
## #   sedentary_minutes <int>, calories <int>, total_active_minutes <int>,
## #   weekday <chr>, mean_mets <dbl>, max_mets <int>, min_mets <int>
activity_daily_mets_cln_merge %>% 
  select(total_steps, total_distance, tracker_distance, logged_activities_distance, very_active_distance, moderately_active_distance, light_active_distance, sedentary_active_distance, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes, calories, total_active_minutes) %>% 
  summary()
##   total_steps    total_distance   tracker_distance logged_activities_distance
##  Min.   :    0   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000            
##  1st Qu.: 3843   1st Qu.: 2.655   1st Qu.: 2.655   1st Qu.:0.0000            
##  Median : 7447   Median : 5.275   Median : 5.275   Median :0.0000            
##  Mean   : 7686   Mean   : 5.524   Mean   : 5.510   Mean   :0.1089            
##  3rd Qu.:10734   3rd Qu.: 7.720   3rd Qu.: 7.718   3rd Qu.:0.0000            
##  Max.   :36019   Max.   :28.030   Max.   :28.030   Max.   :4.9421            
##  very_active_distance moderately_active_distance light_active_distance
##  Min.   : 0.000       Min.   :0.0000             Min.   : 0.000       
##  1st Qu.: 0.000       1st Qu.:0.0000             1st Qu.: 1.962       
##  Median : 0.220       Median :0.2450             Median : 3.385       
##  Mean   : 1.512       Mean   :0.5712             Mean   : 3.362       
##  3rd Qu.: 2.090       3rd Qu.:0.8000             3rd Qu.: 4.790       
##  Max.   :21.920       Max.   :6.4800             Max.   :10.710       
##  sedentary_active_distance very_active_minutes fairly_active_minutes
##  Min.   :0.000000          Min.   :  0.0       Min.   :  0.00       
##  1st Qu.:0.000000          1st Qu.:  0.0       1st Qu.:  0.00       
##  Median :0.000000          Median :  4.0       Median :  7.00       
##  Mean   :0.001617          Mean   : 21.3       Mean   : 13.65       
##  3rd Qu.:0.000000          3rd Qu.: 32.0       3rd Qu.: 19.00       
##  Max.   :0.110000          Max.   :210.0       Max.   :143.00       
##  lightly_active_minutes sedentary_minutes    calories    total_active_minutes
##  Min.   :  0.0          Min.   :   0.0    Min.   : 120   Min.   :  0.0       
##  1st Qu.:129.0          1st Qu.: 730.0    1st Qu.:1837   1st Qu.:149.2       
##  Median :199.0          Median :1057.0    Median :2148   Median :248.5       
##  Mean   :194.0          Mean   : 991.3    Mean   :2318   Mean   :229.0       
##  3rd Qu.:264.8          3rd Qu.:1226.0    3rd Qu.:2796   3rd Qu.:318.0       
##  Max.   :518.0          Max.   :1440.0    Max.   :4900   Max.   :552.0
# Descriptive statistics for sleep data 

tibble(sleep_daily_cln_v2)
## # A tibble: 410 x 5
##            id date       total_sleep_records total_minutes_asl~ total_time_in_b~
##         <dbl> <date>                   <int>              <int>            <int>
##  1 1503960366 2016-04-12                   1                327              346
##  2 1503960366 2016-04-13                   2                384              407
##  3 1503960366 2016-04-15                   1                412              442
##  4 1503960366 2016-04-16                   2                340              367
##  5 1503960366 2016-04-17                   1                700              712
##  6 1503960366 2016-04-19                   1                304              320
##  7 1503960366 2016-04-20                   1                360              377
##  8 1503960366 2016-04-21                   1                325              364
##  9 1503960366 2016-04-23                   1                361              384
## 10 1503960366 2016-04-24                   1                430              449
## # ... with 400 more rows
sleep_daily_cln_dt <- data.table(sleep_daily_cln_v2)

sleep_daily_cln_dt_2 <- sleep_daily_cln_dt[,list(mean_minutes_asleep = mean(total_minutes_asleep), mean_minutes_in_bed = mean(total_time_in_bed)), by = c("id")]

sleep_daily_cln_dt_3 <- sleep_daily_cln_dt_2[,list(mean_minutes_not_sleeping = mean_minutes_in_bed - mean_minutes_asleep, mean_hour_asleep = mean_minutes_asleep / 60, mean_hour_in_bed = mean_minutes_in_bed / 60), by = c("id")]

sleep_daily_cln_dt_4 <- sleep_daily_cln_dt_2 %>% 
  inner_join(sleep_daily_cln_dt_3)

# ID 1844505072 is suspicious as they spent 16 hour sleeping, which is unusual. As for ID 2320127002 spending 1 hour asleep also seems unusual. These two anomalies can be due to the users device being incorrectly configured. 

tibble(sleep_daily_cln_dt_4)
## # A tibble: 24 x 6
##         id mean_minutes_asl~ mean_minutes_in~ mean_minutes_not~ mean_hour_asleep
##      <dbl>             <dbl>            <dbl>             <dbl>            <dbl>
##  1  1.50e9              360.             383.              22.9             6.00
##  2  1.64e9              294              346               52               4.9 
##  3  1.84e9              652              961              309              10.9 
##  4  1.93e9              417              438.              20.8             6.95
##  5  2.03e9              506.             538.              31.5             8.44
##  6  2.32e9               61               69                8               1.02
##  7  2.35e9              447.             491.              44.5             7.45
##  8  3.98e9              294.             461.             168.              4.89
##  9  4.02e9              349.             380.              30.4             5.82
## 10  4.32e9              477.             502.              25.3             7.94
## # ... with 14 more rows, and 1 more variable: mean_hour_in_bed <dbl>
sleep_daily_cln_v2 %>% 
  select(total_sleep_records, total_minutes_asleep, total_time_in_bed) %>% 
  summary()
##  total_sleep_records total_minutes_asleep total_time_in_bed
##  Min.   :1.00        Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:1.00        1st Qu.:361.0        1st Qu.:403.8    
##  Median :1.00        Median :432.5        Median :463.0    
##  Mean   :1.12        Mean   :419.2        Mean   :458.5    
##  3rd Qu.:1.00        3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :3.00        Max.   :796.0        Max.   :961.0
sleep_daily_cln_dt_4 %>% 
  select(mean_minutes_asleep, mean_minutes_in_bed, mean_minutes_not_sleeping, mean_hour_asleep, mean_hour_in_bed) %>% 
  summary()
##  mean_minutes_asleep mean_minutes_in_bed mean_minutes_not_sleeping
##  Min.   : 61.0       Min.   : 69.0       Min.   :  3.00           
##  1st Qu.:336.3       1st Qu.:377.1       1st Qu.: 18.13           
##  Median :417.2       Median :446.0       Median : 24.18           
##  Mean   :377.4       Mean   :419.9       Mean   : 42.48           
##  3rd Qu.:449.3       3rd Qu.:487.3       3rd Qu.: 33.93           
##  Max.   :652.0       Max.   :961.0       Max.   :309.00           
##  mean_hour_asleep mean_hour_in_bed
##  Min.   : 1.017   Min.   : 1.150  
##  1st Qu.: 5.605   1st Qu.: 6.284  
##  Median : 6.954   Median : 7.434  
##  Mean   : 6.291   Mean   : 6.999  
##  3rd Qu.: 7.488   3rd Qu.: 8.121  
##  Max.   :10.867   Max.   :16.017
# Correlations found a negative correlation between sedentary min and calories burned which is expected, although the strong correlation between very active min and calories used is an interesting observation

correlation_activity <- activity_daily_mets_cln_merge %>% 
summarise(cor(total_active_minutes,calories), cor(very_active_minutes,calories), cor(fairly_active_minutes,calories), cor(lightly_active_minutes,calories), cor(sedentary_minutes,calories), cor(mean_mets,calories), cor(max_mets,calories), cor(min_mets, calories))

weekday_total_act_min <- activity_daily_cln_dt_3[,list(weekday_mean_total_act_min = mean(total_active_minutes), weekday_max_total_act_min = max(total_active_minutes), weekday_min_total_act_min = min(total_active_minutes)), by = c("weekday")] %>% 
  arrange(weekday_mean_total_act_min)

The observed trends or relationships was as follow:

1. Very high activity minutes and calories used was strongly correlated
2. Sedentary minutes and calories burned was negatively correlated
3. The data for the Heart Rate and Sleep was incomplete which shows user's use this function on occasion. 
4. Saturday was the most active day of the week
5. From the data it was established that activity tracking is the primary use, therefore both the 'Bellabeats Leaf' and 'Bellabeats Time' would be sufficient.

Visualizing the data insights

Visualizing the most active weekdays illustrated the specific days of the week where the activity levels where generally higher. Saturday was the most active day with Sunday being the least active. Traditionally, people work from Monday to Friday and have more free time on Saturday and Sunday.Most people tend to be more active on Saturdays and rest on Sundays before the next work week start.

There is a strong correlation between very active minutes and calories used in a day. As seen form the graph below:

The total active minutes and calorie usage is also correlated although it is interesting to note that the correlation is not as strong as with very active minutes.

FitBit makes use of METs which is a metric of activity minutes based on the graph below. It illustrates that their algorithm rewards more METs point to more vigorous activities and thus the correlation between METs is a strong compared to the total active minutes as seen below:

Total step shows a strong correlation against calories burned during the day which is intuitive as the more you walk the more calories you will burn.

The group slept less than the recommended 7 to 9 hour as prescribe by the CDC for adults with a mean of 6.3 hours of sleep per day.

Recommendation based on the Analysis

The CDC recommends 22 minutes per day of moderate activity the group which was analyzed had a mean of 13.6 fairly active minutes (moderate activity) per day combined with the very active minutes of 21.6 minutes. Meant that the group analysed achieved there goal of 22 minutes as set out by the CDC.

The sleep data provides an opportunity for the app developers at Bellabeats, they can add the functionality to inform the user that they had less than the recommended 7 to 9 hours of sleep as set out by the CDC. Additionally, a follow-up survey can be provided to the participants in an attempt to understand why they did not consistently sleep with their fitness device.

Based on the analysis the fitness devices used during the collection of the data was primarily used for daily activity tracking and the two products that Bellabeats marketing team can focus on are the ‘Bellabeats Leaf’ and ‘Bellabbeats Time’. Based on my experience with fitness devices one of the best features are personalizing your fitness goals with your device. With that being said the app can focus on rewarding vigorous activities especially if the produce is being used for weight loss. Similarly Bellabeats can implement an activity metric such as METs or activity minutes which aligns with the CDC recommendations. Rewarding the user once they have achieved the minimum activity of 22 minutes of moderate activity and increasing the reward points as they increase pass the 22 minutes mark.