Google Data Analytics Certificate Capstone: Exploratory Analysis of Smart Fitness Tracker Data


Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

Sršen, Bellabeat’s cofounder and Chief Creative Officer, knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Stakeholders

  1. Primary
    • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
    • Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
  2. Secondary
    • Bellabeat marketing analytics team

Bellabeat Products

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Deliverables

I will produce a report with the following deliverables:

  1. A clear summary of the business task 
  2. A description of all data sources used 
  3. Documentation of any cleaning or manipulation of data 
  4. A summary of your analysis 
  5. Supporting visualizations and key findings 
  6. Your top high-level content recommendations based on your analysis 

ASK

Business Task

  1. Analyze smart device usage data in order to gain insight into how consumers use smart devices 
  2. Apply insights obtained to improve future marketing strategies for a Bellabeat product 

Guiding questions

  1. What are some trends in smart device usage? 
  2. How could these trends apply to Bellabeat customers? 
  3. How could these trends help influence Bellabeat marketing strategy? 
  4. What metrics will you use to measure the data to achieve the objective? 
  5. Who are the stakeholders? 
  6. Who is your audience for this analysis and how does this affect your analysis process and presentation? 
  7. How will the insights obtained from this analysis help Bellabeat stakeholders improve their marketing strategy? 

PREPARE

Dataset

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius); A public data that explores smart device users’ daily habits. This data set contains personal fitness tracker from thirty (30) fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. A third-party data service provider called Fitabase LLC (San Diego, California), aggregated the self-tracker data.

The data set is cited as:
Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.53894

Limitations of the Dataset

  1. The specific population sampled may not be generalizable to other populations. However, this population is familiar with the online environment and therefore may be more adept at performing tasks with technology, thus making feasibility of the protocol administration more likely to be successful. 
  2. Individuals who can afford and use Fitbit devices are more likely to be younger (between the ages of 18 and 34 years) and affluent, thus impacting generalizability. 
  3. The data was generated in 2016. This might limit reflections that would be useful in current time. 

Setting Up Environment in R

RStudio will be used for data exploration, data cleaning, transformation, analysis and visualisation. 

All the required R packages to be installed and loaded;

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("reshape2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("tidyverse")
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("here")
## here() starts at /cloud/project
library("skimr")
library("janitor")
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library("lubridate")
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library("reshape2")
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

The datasets were downloaded and stored appropriately.

For this analysis, the following datasets were imported and used;

● the 'dailyActivity_merged' dataset
● the 'sleepDay_merged' dataset
● the 'weightLogInfo_merged' dataset

I have excluded datasets whose data are already present in the “dailyActivity_merged” table.

Loading the datasets:
setwd("/cloud/project/Google Capstone")

dailyActivity_DF <-
  read.csv("dataset/dailyActivity_merged.csv")

sleepDay_DF <-
  read.csv("dataset/sleepDay_merged.csv")

weightLogInfo_DF <-
  read.csv("dataset/weightLogInfo_merged.csv")

PROCESS

To preview and glance through the data, and Check for errors, missing values, consistent naming;

Details Observed For Each Dataset

● dailyActivity_DF
head(dailyActivity_DF)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
skim_without_charts(dailyActivity_DF)
Data summary
Name dailyActivity_DF
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ActivityDate 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09
TotalSteps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04
TotalDistance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
TrackerDistance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
LoggedActivitiesDistance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00
VeryActiveDistance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01
ModeratelyActiveDistance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00
LightActiveDistance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01
SedentaryActiveDistance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01
VeryActiveMinutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02
FairlyActiveMinutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02
LightlyActiveMinutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02
SedentaryMinutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03
Calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03
sum(duplicated(dailyActivity_DF))
## [1] 0
n_distinct(dailyActivity_DF$Id)
## [1] 33
sapply(dailyActivity_DF, function(x) n_distinct(x))
##                       Id             ActivityDate               TotalSteps 
##                       33                       31                      842 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                      615                      613                       19 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                      333                      211                      491 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        9                      122                       81 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                      335                      549                      734

This dataframe has a long form with 940 rows and 15 columns. Consistent and meaningful variable names were observed.
There are no missing values and no duplicate entries in this dataframe.
The “ActivityDate” column does not have the right datatype, to be corrected to the DateTime datatype. This dataframe has records of 33 different participants and data was collected over a maximum of 31 days.

● sleepDay_DF
head(sleepDay_DF)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
skim_without_charts(sleepDay_DF)
Data summary
Name sleepDay_DF
Number of rows 413
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
SleepDay 0 1 20 21 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1 5.000979e+09 2.06036e+09 1503960366 3977333714 4702921684 6962181067 8792009665
TotalSleepRecords 0 1 1.120000e+00 3.50000e-01 1 1 1 1 3
TotalMinutesAsleep 0 1 4.194700e+02 1.18340e+02 58 361 433 490 796
TotalTimeInBed 0 1 4.586400e+02 1.27100e+02 61 403 463 526 961
sum(duplicated(sleepDay_DF))
## [1] 3
n_distinct(sleepDay_DF$Id)
## [1] 24
sapply(sleepDay_DF, function(x) n_distinct(x))
##                 Id           SleepDay  TotalSleepRecords TotalMinutesAsleep 
##                 24                 31                  3                256 
##     TotalTimeInBed 
##                242

This dataframe has a long form with 413 rows and 5 columns. Consistent and meaningful variable names were observed.
There are no missing values, but 3 duplicate entries were observed (these duplicates will be removed).
Here, 24 participants data were recorded and collected over a maximum of 31 days. The “SleepDay” column does not have the right datatype, to be corrected to the DateTime datatype.

● weightLogInfo_DF
head(weightLogInfo_DF)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12
skim_without_charts(weightLogInfo_DF)
Data summary
Name weightLogInfo_DF
Number of rows 67
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date 0 1 19 21 0 56 0
IsManualReport 0 1 4 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1.00 7.009282e+09 1.950322e+09 1.503960e+09 6.962181e+09 6.962181e+09 8.877689e+09 8.877689e+09
WeightKg 0 1.00 7.204000e+01 1.392000e+01 5.260000e+01 6.140000e+01 6.250000e+01 8.505000e+01 1.335000e+02
WeightPounds 0 1.00 1.588100e+02 3.070000e+01 1.159600e+02 1.353600e+02 1.377900e+02 1.875000e+02 2.943200e+02
Fat 65 0.03 2.350000e+01 2.120000e+00 2.200000e+01 2.275000e+01 2.350000e+01 2.425000e+01 2.500000e+01
BMI 0 1.00 2.519000e+01 3.070000e+00 2.145000e+01 2.396000e+01 2.439000e+01 2.556000e+01 4.754000e+01
LogId 0 1.00 1.461772e+12 7.829948e+08 1.460444e+12 1.461079e+12 1.461802e+12 1.462375e+12 1.463098e+12
sum(duplicated(weightLogInfo_DF))
## [1] 0
n_distinct(weightLogInfo_DF$Id)
## [1] 8
sapply(weightLogInfo_DF, function(x) n_distinct(x))
##             Id           Date       WeightKg   WeightPounds            Fat 
##              8             56             34             34              3 
##            BMI IsManualReport          LogId 
##             36              2             56

This dataframe has a long form with 67 rows and 8 columns. Consistent and meaningful variable names were observed.
The “Fat” column has 65 missing values out of 67 rows (A just 2.99% complete rate, this column will be dropped).
No duplicate entry was observed.
The Date column does not have the right datatype, to be corrected to the DateTime datatype.
Only 8 participants weight data were recorded.


To Transform the Data;

The sleepDay_DF
  • I made a copy (sleepData) and removed duplicates. 
  • I separated the sleepDay column into Date, SleepTime and DayStatus individual columns and then dropped the SleepTime and DayStatus columns. 
  • I changed the “Id” column datatype to character and the “SleepDate” column datatype to Date. 
# remove duplicates 
sleepData <- sleepDay_DF[!duplicated(sleepDay_DF),]

# split the SleepDay column to SleepDate and SleepTime columns
sleepData <-
  sleepData %>%
  separate(SleepDay, into = c("Date", "SleepTime", "DayStatus"), sep = " ") %>%
  subset(select = -c(SleepTime,DayStatus))

# Change datatype
class(sleepData$Id) = "character"
sleepData$Date <- mdy(sleepData$Date)
The weightLogInfo_DF
  • I made a copy (weightData). 
  • I dropped/deleted the “Fat” column 
  • I separated the Date column into Date, Time and DayStatus individual columns and then dropped the Time and DayStatus columns. 
  • I changed the “Id” column datatype to character and the “Date” column datatype to Date. 
  • I created a new column “WeightStatus” showing the weight groups based on BMI values ranging from Underweight, Normal Weight, Overweight and Obese. 
weightData <-
  weightLogInfo_DF %>%
  subset(select = -Fat) %>%
  separate(Date, into = c("Date", "Time", "DayStatus"), sep = " ") %>%
  subset(select = -c(Time,DayStatus))%>%
  mutate(WeightStatus = case_when(BMI > 29.9 ~ 'Obese', 
                                  BMI > 24.9 ~ 'Overweight',
                                  BMI < 18.5 ~ 'Underweight',
                                  TRUE ~ 'Normal Weight'))

# Change datatype
class(weightData$Id) = "character"
weightData$Date <- mdy(weightData$Date)
The dailyActivity_DF
  • I made a copy (activityData). 
  • I changed the “Id” column datatype to character and the “ActivityDate” column datatype to Date. 
  • I renamed the “ActivityDate” column to “Date” 
  • I created new columns;  “Day” (showing the day of the week), “TotalActiveMinutes” (a sum of the columns; “VeryActiveMinutes”, “FairlyActiveMinues”, & “LightlyActiveMinutes”). 
  • I dropped rows/observations having zero value for “Total Steps” and “Total Distance”. 
  • I dropped the entire “LoggedActiveDistance” column since just about 4% of observations have values other than zero. 
activityData <- dailyActivity_DF %>% 
  subset(!dailyActivity_DF$TotalSteps == 0) %>%
  subset(!dailyActivity_DF$TotalDistance == 0) %>%
  subset(select = -c(LoggedActivitiesDistance))

# Change datatype
class(activityData$Id) = "character"
activityData$ActivityDate <- mdy(activityData$ActivityDate)


activityData <-
  activityData %>%
  rename("Date" = ActivityDate)

#Creating new columns
activityData <- 
  activityData %>%
  mutate(Day = weekdays(Date)) %>%
  mutate(TotalActiveMinutes = (activityData$VeryActiveMinutes + 
                                 activityData$FairlyActiveMinutes + 
                                 activityData$LightlyActiveMinutes)) %>%
  drop_na()
The fitBitData_merged
fitBitData_merged <-
  merge(sleepData,activityData, by=c("Id", "Date"), all = TRUE)

I created this dataframe by combining the activityData and the sleepData dataframes, based on the “Id” and “Date” columns.

This dataframe has a long form with 940 rows and 18 columns. Consistent and meaningful variable names were observed.   Three columns originating from the sleepData dataframe had missing values. There are no duplicate entries in this dataframe.  This dataframe has records of 33 different participants and data spanning over 31 days. 

head(fitBitData_merged)
##           Id       Date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-14                NA                 NA             NA
## 4 1503960366 2016-04-15                 1                412            442
## 5 1503960366 2016-04-16                 2                340            367
## 6 1503960366 2016-04-17                 1                700            712
##   TotalSteps TotalDistance TrackerDistance VeryActiveDistance
## 1      13162          8.50            8.50               1.88
## 2      10735          6.97            6.97               1.57
## 3      10460          6.74            6.74               2.44
## 4       9762          6.28            6.28               2.14
## 5      12669          8.16            8.16               2.71
## 6       9705          6.48            6.48               3.19
##   ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## 1                     0.55                6.06                       0
## 2                     0.69                4.71                       0
## 3                     0.40                3.91                       0
## 4                     1.26                2.83                       0
## 5                     0.41                5.04                       0
## 6                     0.78                2.51                       0
##   VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## 1                25                  13                  328              728
## 2                21                  19                  217              776
## 3                30                  11                  181             1218
## 4                29                  34                  209              726
## 5                36                  10                  221              773
## 6                38                  20                  164              539
##   Calories       Day TotalActiveMinutes
## 1     1985   Tuesday                366
## 2     1797 Wednesday                257
## 3     1776  Thursday                222
## 4     1745    Friday                272
## 5     1863  Saturday                267
## 6     1728    Sunday                222
skim_without_charts(fitBitData_merged)
Data summary
Name fitBitData_merged
Number of rows 823
Number of columns 19
_______________________
Column type frequency:
character 2
Date 1
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Id 0 1.00 10 10 0 33 0
Day 27 0.97 6 9 0 7 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 2016-04-12 2016-05-12 2016-04-26 31

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
TotalSleepRecords 413 0.50 1.12 0.35 1 1.00 1.00 1.00 3.00
TotalMinutesAsleep 413 0.50 419.17 118.64 58 361.00 432.50 490.00 796.00
TotalTimeInBed 413 0.50 458.48 127.46 61 403.75 463.00 526.00 961.00
TotalSteps 27 0.97 8375.21 4787.27 4 4932.00 7974.50 11136.25 36019.00
TotalDistance 27 0.97 6.04 3.78 0 3.38 5.59 7.97 28.03
TrackerDistance 27 0.97 6.02 3.76 0 3.38 5.59 7.91 28.03
VeryActiveDistance 27 0.97 1.69 2.82 0 0.00 0.41 2.29 21.92
ModeratelyActiveDistance 27 0.97 0.63 0.93 0 0.00 0.31 0.87 6.48
LightActiveDistance 27 0.97 3.63 1.84 0 2.34 3.55 4.88 10.71
SedentaryActiveDistance 27 0.97 0.00 0.01 0 0.00 0.00 0.00 0.11
VeryActiveMinutes 27 0.97 23.63 34.42 0 0.00 7.00 36.00 210.00
FairlyActiveMinutes 27 0.97 14.96 20.73 0 0.00 8.00 21.00 143.00
LightlyActiveMinutes 27 0.97 209.27 95.77 0 147.00 206.00 268.25 518.00
SedentaryMinutes 27 0.97 953.12 280.83 0 721.75 1018.50 1190.25 1440.00
Calories 27 0.97 2366.88 721.52 52 1850.50 2195.50 2859.25 4900.00
TotalActiveMinutes 27 0.97 247.87 104.22 0 182.75 257.00 321.00 552.00
sum(duplicated(fitBitData_merged))
## [1] 0
n_distinct(fitBitData_merged$Id)
## [1] 33
sapply(fitBitData_merged, function(x) n_distinct(x))
##                       Id                     Date        TotalSleepRecords 
##                       33                       31                        4 
##       TotalMinutesAsleep           TotalTimeInBed               TotalSteps 
##                      257                      243                      779 
##            TotalDistance          TrackerDistance       VeryActiveDistance 
##                      588                      586                      321 
## ModeratelyActiveDistance      LightActiveDistance  SedentaryActiveDistance 
##                      207                      466                       10 
##        VeryActiveMinutes      FairlyActiveMinutes     LightlyActiveMinutes 
##                      123                       82                      325 
##         SedentaryMinutes                 Calories                      Day 
##                      523                      669                        8 
##       TotalActiveMinutes 
##                      356
  • I replaced the missing values found in “TotalSleepRecords”, “TotalMinutesAsleep”, & “TotalTimeInBed” columns with 0 value.
fitBitData_merged <-
  fitBitData_merged %>%
  replace(is.na(fitBitData_merged), 0)

ANALYSE

Quick Summary Statistics of The data:
activityData %>%  
  select(TotalSteps,
         TotalDistance,
         VeryActiveDistance,
         ModeratelyActiveDistance,
         LightActiveDistance,
         SedentaryActiveDistance,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes,
         Calories) %>%
  summary()
##    TotalSteps    TotalDistance    VeryActiveDistance ModeratelyActiveDistance
##  Min.   :    4   Min.   : 0.000   Min.   : 0.000     Min.   :0.0000          
##  1st Qu.: 4932   1st Qu.: 3.380   1st Qu.: 0.000     1st Qu.:0.0000          
##  Median : 7974   Median : 5.590   Median : 0.415     Median :0.3100          
##  Mean   : 8375   Mean   : 6.036   Mean   : 1.687     Mean   :0.6288          
##  3rd Qu.:11136   3rd Qu.: 7.973   3rd Qu.: 2.292     3rd Qu.:0.8700          
##  Max.   :36019   Max.   :28.030   Max.   :21.920     Max.   :6.4800          
##  LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
##  Min.   : 0.000      Min.   :0.000000        Min.   :  0.00   
##  1st Qu.: 2.337      1st Qu.:0.000000        1st Qu.:  0.00   
##  Median : 3.550      Median :0.000000        Median :  7.00   
##  Mean   : 3.628      Mean   :0.001796        Mean   : 23.63   
##  3rd Qu.: 4.880      3rd Qu.:0.000000        3rd Qu.: 36.00   
##  Max.   :10.710      Max.   :0.110000        Max.   :210.00   
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0.0   Min.   :  52  
##  1st Qu.:  0.00      1st Qu.:147.0        1st Qu.: 721.8   1st Qu.:1850  
##  Median :  8.00      Median :206.0        Median :1018.5   Median :2196  
##  Mean   : 14.96      Mean   :209.3        Mean   : 953.1   Mean   :2367  
##  3rd Qu.: 21.00      3rd Qu.:268.2        3rd Qu.:1190.2   3rd Qu.:2859  
##  Max.   :143.00      Max.   :518.0        Max.   :1440.0   Max.   :4900
sleepData %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0
weightData %>%  
  select(WeightKg,
         WeightPounds,
         BMI) %>%
  summary()
##     WeightKg       WeightPounds        BMI       
##  Min.   : 52.60   Min.   :116.0   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:135.4   1st Qu.:23.96  
##  Median : 62.50   Median :137.8   Median :24.39  
##  Mean   : 72.04   Mean   :158.8   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:187.5   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :294.3   Max.   :47.54
Aggregated Values

I aggregated the dataset to obtain average values for each participant.

weightAggregated <-
  weightData %>%
  group_by(Id) %>%
  summarise(BodyWeightKg = mean(WeightKg),
            BMIv = mean(BMI)) %>%
  mutate(WeightStatus = case_when(BMIv > 29.9 ~ 'Obese', 
                                  BMIv> 24.9 ~ 'Overweight',
                                  BMIv < 18.5 ~ 'Underweight',
                                  TRUE ~ 'Normal Weight'))


sleepAggregated <-
  sleepData %>%
  group_by(Id) %>%
  summarise(SleepRecord = sum(TotalSleepRecords),
            TotalHoursAsleep = (sum(TotalMinutesAsleep)/60),
            AveHoursAsleep = (mean(TotalMinutesAsleep)/60),
            AveTimeInBed = (mean(TotalTimeInBed)/60)) %>%
  mutate(NatureofSleep = case_when(AveHoursAsleep < 7 ~ 'Poor Sleep',
                                   TRUE ~ 'Good Sleep'))


activityAggregated <-
  activityData %>%
  group_by(Id) %>%
  summarise(AverageSteps = mean(TotalSteps),
            AverageDistance = mean(TotalDistance),
            VeryActiveHours = (mean(VeryActiveMinutes)/60),
            FairlyActiveHours = (mean(FairlyActiveMinutes)/60),
            LightlyActiveHours = (mean(LightlyActiveMinutes)/60),
            AveSedentaryHours = (mean(SedentaryMinutes)/60),
            TotalActiveHours = (VeryActiveHours + FairlyActiveHours + LightlyActiveHours),
            AveCaloriesSpent = mean(Calories))


aggregatedData <-
  list(sleepAggregated, weightAggregated, activityAggregated) %>%
  reduce(full_join, by="Id")

head(aggregatedData)
## # A tibble: 6 × 17
##   Id       Sleep…¹ Total…² AveHo…³ AveTi…⁴ Natur…⁵ BodyW…⁶  BMIv Weigh…⁷ Avera…⁸
##   <chr>      <int>   <dbl>   <dbl>   <dbl> <chr>     <dbl> <dbl> <chr>     <dbl>
## 1 1503960…      27  150.      6.00    6.39 Poor S…    52.6  22.6 Normal…  12521.
## 2 1644430…       4   19.6     4.9     5.77 Poor S…    NA    NA   <NA>      7283.
## 3 1844505…       3   32.6    10.9    16.0  Good S…    NA    NA   <NA>      3622.
## 4 1927972…       8   34.8     6.95    7.30 Poor S…   134.   47.5 Obese     1669.
## 5 2026352…      28  236.      8.44    8.96 Good S…    NA    NA   <NA>      5567.
## 6 2320127…       1    1.02    1.02    1.15 Poor S…    NA    NA   <NA>      4717.
## # … with 7 more variables: AverageDistance <dbl>, VeryActiveHours <dbl>,
## #   FairlyActiveHours <dbl>, LightlyActiveHours <dbl>, AveSedentaryHours <dbl>,
## #   TotalActiveHours <dbl>, AveCaloriesSpent <dbl>, and abbreviated variable
## #   names ¹​SleepRecord, ²​TotalHoursAsleep, ³​AveHoursAsleep, ⁴​AveTimeInBed,
## #   ⁵​NatureofSleep, ⁶​BodyWeightKg, ⁷​WeightStatus, ⁸​AverageSteps
  • The average number of daily steps was 8375.  This is within the 6000–8000 recommended daily steps. About 70% of participants reached the recommended quota. 
  • The average very active minutes was approximately 24 minutes.  This falls short of the recommended 30 minutes of daily moderate-intensity aerobic activity. About 25% of participants reached this quota. 
  • The average sedentary minutes was 953.1 min i.e. about 16 hours. 
  • The average time spent asleep was 419.2 minutes i.e. about 7 hours.  This is within the 7 to 9 hours recommended good sleep duration for adults.  About 48% of participants whose sleep records were captured, have good sleep duration.  
Checking for Relationship between Day of the Week and participants activities:
ggplot(data = activityData) +
  aes(x = (Day), y = TotalSteps) +
  geom_col(fill =  'blue') +
  labs(x = 'Day of week', y = 'Total Steps', title = 'Total steps per Day')

ggplot(data = activityData) +
  aes(x = (Day), y = TotalActiveMinutes) +
  geom_col(fill =  'blue') +
  labs(x = 'Day of week',
       y = 'Time Active (Minutes)',
       title = 'Total Active Minutes per Day')

ggplot(data = activityData) +
  aes(x = (Day), y = Calories) +
  geom_col(fill =  'blue') +
  labs(x = 'Day of week',
       y = 'Calories Expended',
       title = 'Daily Calories Expended')

From the plots, the most/highest records were observed on these days (in descending order);  Tuesdays, Wednesdays, Thursdays and Saturdays. While Sundays had the least records. 

Checking relationship between Calories expended and steps taken (TotalSteps), time active (TotalActiveMinutes) and Time in Sedentary State (SedentaryMinutes).
# Showing relationships for Calorie expenditure

ggplot(data = activityData) +
  aes(x= TotalActiveMinutes, y = Calories) +
  geom_point(color = 'orange') +
  geom_smooth() +
  labs(x = 'Time Active (minutes)',
       y = 'Calories Expended',
       title = 'Calories Expended vs Time Active')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = activityData) +
  aes(x= TotalSteps, y = Calories) +
  geom_point(color = 'orange') +
  geom_smooth() +
  labs(x = 'Total Steps',
       y = 'Calories Expended',
       title = 'Calories Expended vs Total Steps')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = activityData) +
  aes(x= SedentaryMinutes, y = Calories) +
  geom_point(color = 'orange') +
  geom_smooth() +
  labs(x = 'Time Sedentary (minutes)',
       y = 'Calories Expended',
       title = 'Calories Expended vs Time Sedentary (minutes)')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In the plots of Calories against Time active and Steps taken, the trendline tends upwards.  This upward trend observed indicates a positive relationship between calorie expenditure and physical activity. 

Whereas a downward progression of the trendline indicating a negative relationship, was observed for the plot against time in sedentary state.   Showing that the more sedentary the participant tends to be, the lesser amount of calories that will be expended.

Showing the segregation of participants activities:
# analysing the participants' activities per time

VeryActiveMins <- sum(activityData$VeryActiveMinutes)
FairlyActiveMins <- sum(activityData$FairlyActiveMinutes)
LightlyActiveMins <- sum(activityData$LightlyActiveMinutes)
SedentaryMins <- sum(activityData$SedentaryMinutes)
total <- (VeryActiveMins + FairlyActiveMins + LightlyActiveMins + SedentaryMins)

VeryActive <- round((VeryActiveMins/total) * 100,0)
FairlyActive <- round((FairlyActiveMins / total)* 100,0)
LightlyActive <- round((LightlyActiveMins / total)* 100,0)
Sedentary <- round((SedentaryMins / total)* 100,0)

DF <- 
  data.frame(Activity=c("Very Active", "Fairly", "Lightly", "Sedentary" ),
             Time = c(VeryActiveMins, FairlyActiveMins, LightlyActiveMins, SedentaryMins)
)

ggplot(data = DF) +
  geom_col() + 
  aes(x= (Time)/60, y= Activity) +
  labs(x = 'Time (Hours)',
       y = 'Activity Type',
       title = 'Time Spent per Activity Type')

The plot showed that the participants spent about 79% of their time (i.e. approximately 12645 hours) in sedentary states.   But spent about 20% in active states.

Analysing the sleep patterns of participants, and the sleep-time distribution
ggplot(data = sleepData) +
  geom_boxplot() +
  aes(y= TotalMinutesAsleep/60) +
  labs(y = 'Time Asleep (Hours)',
       title = 'Sleep Time Plot')

ggplot(data = sleepAggregated) +
  geom_bar() +
  aes(y= NatureofSleep) +
  labs(y = 'Nature of Sleep',
       x= 'Number of Participants',
       title = 'Nature of Sleep Plot')

These show that 50th percentile of the participants experienced about 432 minutes (7.2 hours) of sleep.  About 13 participants overall experienced less than 7 hours of sleep.  And 11 participants overall experienced greater than or equal to 7 hours of sleep. 

Analysing the weight distribution of participants
ggplot(data = weightData) +
  geom_boxplot() +
  aes(y= WeightKg) +
  labs(y = 'Weight (Kg)',
       title = 'Weight Plot')

ggplot(data = weightData) +
  geom_histogram() +
  aes(x= WeightKg) +
  labs(x = 'Weight (Kg)',
       y= 'Number of Participants',
       title = 'Weight Plot')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = weightData) +
  geom_bar() +
  aes(x= WeightStatus) +
  labs(x = 'Weight Status',
       y= 'Number of Participants',
       title = 'Weight Distribution Plot')

ggplot(data = weightAggregated) +
  geom_col() +
  aes(y= BodyWeightKg, x=Id) +
  labs(y = 'Weight (Kg)',
       x= 'Participants',
       title = 'Weight Distribution Plot')

ggplot(data = weightAggregated) +
  geom_col() +
  aes(y= BMIv, x=Id) +
  labs(x = 'Participants',
       y='Body Mass Index' ,
       title = 'Weight Distribution Plot')

These show a 50th percentile of participants weighing about 62.50Kg. 

Of the 8 participants that captured their body weights,

3 had Body Mass Index values (BMI) between 18.9 and 24.9 (Normal weight); 

4 participants between 25 and 29.9 (Overweight); 

while 1 participant’s was above 30 (Obese). 


Key Findings and Recommendations

The Bellabeat company has built its moat (i.e. competitive advantage) by developing technology and services that focuses on women’s health & well-being. It is therefore important that continuous upgrades and expansion be adopted to accommodate and satisfy the needs & applications of the users. 

Out of 30 users that took part in the survey, just 8 recorded their weights, and 23 recorded sleep events. the poor compliance can be attributed to;  1. for weight-records, the manual inputs of weight values. To mitigate this, smart devices should be developed with a means for automated and flexible capture of the body-weights of its users for accurate and consistent data.  2. for sleep-events-records; the poor compliance observed can be attributed to the habit of users to charge their devices at bedtime. Improvements in power and battery specifications can reduce such redundancies (e.g. Fast-Charge feature, longer-lasting batteries, etc.). 

Programs targeted at empowering women with current information on Health & Wellness can be run via the Bellabeat App. Also information on how the various Bellabeat products can be utilized to achieve user’s health & wellness goals can be provided on the app. 

The Leaf product can be packaged to users based on its benefits of effectively tracking health status (e.g heart rate ) and showcasing problem areas where lifestyle modifications and medical interventions might be needed (e.g. where level of activity is poor, presented via low daily steps count).  

The Bellabeat membership product can be designed to encourage referrals amongst peers with attractive incentives attached. This product can also be designed to add a feature for “Closest-Gyms-Near-You” to encourage its subscribers on the benefits of good physical activity and excersice on general well-being. 


References

  1. FitBit Fitness Tracker Data

  2. Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo.   (https://doi.org/10.5281/zenodo.53894)  

  3. Brinton J, Keating M, Ortiz A, Evenson K, Furberg R Establishing Linkages Between Distributed Survey Responses and Consumer Wearable Device Datasets: A Pilot Protocol JMIR Res Protoc 2017;6(4):e66 URL: https://www.researchprotocols.org/2017/4/e66   DOI: 10.2196/resprot.6513  

  4. sleep standards cdc