Google Data Analytics Certificate Capstone: Exploratory Analysis of Smart Fitness Tracker Data

Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

Sršen, Bellabeat’s cofounder and Chief Creative Officer, knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Stakeholders

Primary
- Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
- Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Secondary
- Bellabeat marketing analytics team

Bellabeat Products

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Deliverables

I will produce a report with the following deliverables:

A clear summary of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of your analysis
Supporting visualizations and key findings
Your top high-level content recommendations based on your analysis

ASK

Business Task

Analyze smart device usage data in order to gain insight into how consumers use smart devices
Apply insights obtained to improve future marketing strategies for a Bellabeat product

Guiding questions

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?
What metrics will you use to measure the data to achieve the objective?
Who are the stakeholders?
Who is your audience for this analysis and how does this affect your analysis process and presentation?
How will the insights obtained from this analysis help Bellabeat stakeholders improve their marketing strategy?

PREPARE

Dataset

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius); A public data that explores smart device users’ daily habits. This data set contains personal fitness tracker from thirty (30) fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. A third-party data service provider called Fitabase LLC (San Diego, California), aggregated the self-tracker data.

The data set is cited as:
Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.53894

Limitations of the Dataset

The specific population sampled may not be generalizable to other populations. However, this population is familiar with the online environment and therefore may be more adept at performing tasks with technology, thus making feasibility of the protocol administration more likely to be successful.
Individuals who can afford and use Fitbit devices are more likely to be younger (between the ages of 18 and 34 years) and affluent, thus impacting generalizability.
The data was generated in 2016. This might limit reflections that would be useful in current time.

Setting Up Environment in R

RStudio will be used for data exploration, data cleaning, transformation, analysis and visualisation.

All the required R packages to be installed and loaded;

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("here")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("skimr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("reshape2")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library("tidyverse")

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library("here")

## here() starts at /cloud/project

library("skimr")
library("janitor")

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library("lubridate")

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library("reshape2")

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

The datasets were downloaded and stored appropriately.

For this analysis, the following datasets were imported and used;

● the 'dailyActivity_merged' dataset
● the 'sleepDay_merged' dataset
● the 'weightLogInfo_merged' dataset

I have excluded datasets whose data are already present in the “dailyActivity_merged” table.

Loading the datasets:

setwd("/cloud/project/Google Capstone")

dailyActivity_DF <-
  read.csv("dataset/dailyActivity_merged.csv")

sleepDay_DF <-
  read.csv("dataset/sleepDay_merged.csv")

weightLogInfo_DF <-
  read.csv("dataset/weightLogInfo_merged.csv")

PROCESS

To preview and glance through the data, and Check for errors, missing values, consistent naming;

Details Observed For Each Dataset

● dailyActivity_DF

head(dailyActivity_DF)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

skim_without_charts(dailyActivity_DF)

Data summary
Name	dailyActivity_DF
Number of rows	940
Number of columns	15
_______________________
Column type frequency:
character	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityDate	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	4.855407e+09	2.424805e+09	1503960366	2.320127e+09	4.445115e+09	6.962181e+09	8.877689e+09
TotalSteps	1	7.637910e+03	5.087150e+03	0	3.789750e+03	7.405500e+03	1.072700e+04	3.601900e+04
TotalDistance	1	5.490000e+00	3.920000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
TrackerDistance	1	5.480000e+00	3.910000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
LoggedActivitiesDistance	1	1.100000e-01	6.200000e-01	0	0.000000e+00	0.000000e+00	0.000000e+00	4.940000e+00
VeryActiveDistance	1	1.500000e+00	2.660000e+00	0	0.000000e+00	2.100000e-01	2.050000e+00	2.192000e+01
ModeratelyActiveDistance	1	5.700000e-01	8.800000e-01	0	0.000000e+00	2.400000e-01	8.000000e-01	6.480000e+00
LightActiveDistance	1	3.340000e+00	2.040000e+00	0	1.950000e+00	3.360000e+00	4.780000e+00	1.071000e+01
SedentaryActiveDistance	1	0.000000e+00	1.000000e-02	0	0.000000e+00	0.000000e+00	0.000000e+00	1.100000e-01
VeryActiveMinutes	1	2.116000e+01	3.284000e+01	0	0.000000e+00	4.000000e+00	3.200000e+01	2.100000e+02
FairlyActiveMinutes	1	1.356000e+01	1.999000e+01	0	0.000000e+00	6.000000e+00	1.900000e+01	1.430000e+02
LightlyActiveMinutes	1	1.928100e+02	1.091700e+02	0	1.270000e+02	1.990000e+02	2.640000e+02	5.180000e+02
SedentaryMinutes	1	9.912100e+02	3.012700e+02	0	7.297500e+02	1.057500e+03	1.229500e+03	1.440000e+03
Calories	1	2.303610e+03	7.181700e+02	0	1.828500e+03	2.134000e+03	2.793250e+03	4.900000e+03

sum(duplicated(dailyActivity_DF))

## [1] 0

n_distinct(dailyActivity_DF$Id)

## [1] 33

sapply(dailyActivity_DF, function(x) n_distinct(x))

##                       Id             ActivityDate               TotalSteps 
##                       33                       31                      842 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                      615                      613                       19 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                      333                      211                      491 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        9                      122                       81 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                      335                      549                      734

This dataframe has a long form with 940 rows and 15 columns. Consistent and meaningful variable names were observed.
There are no missing values and no duplicate entries in this dataframe.
The “ActivityDate” column does not have the right datatype, to be corrected to the DateTime datatype. This dataframe has records of 33 different participants and data was collected over a maximum of 31 days.

● sleepDay_DF

head(sleepDay_DF)

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

skim_without_charts(sleepDay_DF)

Data summary
Name	sleepDay_DF
Number of rows	413
Number of columns	5
_______________________
Column type frequency:
character	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
SleepDay	0	1	20	21	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	5.000979e+09	2.06036e+09	1503960366	3977333714	4702921684	6962181067	8792009665
TotalSleepRecords	1	1.120000e+00	3.50000e-01	1	1	1	1	3
TotalMinutesAsleep	1	4.194700e+02	1.18340e+02	58	361	433	490	796
TotalTimeInBed	1	4.586400e+02	1.27100e+02	61	403	463	526	961

sum(duplicated(sleepDay_DF))

## [1] 3

n_distinct(sleepDay_DF$Id)

## [1] 24

sapply(sleepDay_DF, function(x) n_distinct(x))

##                 Id           SleepDay  TotalSleepRecords TotalMinutesAsleep 
##                 24                 31                  3                256 
##     TotalTimeInBed 
##                242

This dataframe has a long form with 413 rows and 5 columns. Consistent and meaningful variable names were observed.
There are no missing values, but 3 duplicate entries were observed (these duplicates will be removed).
Here, 24 participants data were recorded and collected over a maximum of 31 days. The “SleepDay” column does not have the right datatype, to be corrected to the DateTime datatype.

● weightLogInfo_DF

head(weightLogInfo_DF)

##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12

skim_without_charts(weightLogInfo_DF)

Data summary
Name	weightLogInfo_DF
Number of rows	67
Number of columns	8
_______________________
Column type frequency:
character	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Date	0	1	19	21	0	56	0
IsManualReport	0	1	4	5	0	2	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	0	1.00	7.009282e+09	1.950322e+09	1.503960e+09	6.962181e+09	6.962181e+09	8.877689e+09	8.877689e+09
WeightKg	0	1.00	7.204000e+01	1.392000e+01	5.260000e+01	6.140000e+01	6.250000e+01	8.505000e+01	1.335000e+02
WeightPounds	0	1.00	1.588100e+02	3.070000e+01	1.159600e+02	1.353600e+02	1.377900e+02	1.875000e+02	2.943200e+02
Fat	65	0.03	2.350000e+01	2.120000e+00	2.200000e+01	2.275000e+01	2.350000e+01	2.425000e+01	2.500000e+01
BMI	0	1.00	2.519000e+01	3.070000e+00	2.145000e+01	2.396000e+01	2.439000e+01	2.556000e+01	4.754000e+01
LogId	0	1.00	1.461772e+12	7.829948e+08	1.460444e+12	1.461079e+12	1.461802e+12	1.462375e+12	1.463098e+12

sum(duplicated(weightLogInfo_DF))

## [1] 0

n_distinct(weightLogInfo_DF$Id)

## [1] 8

sapply(weightLogInfo_DF, function(x) n_distinct(x))

##             Id           Date       WeightKg   WeightPounds            Fat 
##              8             56             34             34              3 
##            BMI IsManualReport          LogId 
##             36              2             56

This dataframe has a long form with 67 rows and 8 columns. Consistent and meaningful variable names were observed.
The “Fat” column has 65 missing values out of 67 rows (A just 2.99% complete rate, this column will be dropped).
No duplicate entry was observed.
The Date column does not have the right datatype, to be corrected to the DateTime datatype.
Only 8 participants weight data were recorded.

To Transform the Data;

The sleepDay_DF

I made a copy (sleepData) and removed duplicates.
I separated the sleepDay column into Date, SleepTime and DayStatus individual columns and then dropped the SleepTime and DayStatus columns.
I changed the “Id” column datatype to character and the “SleepDate” column datatype to Date.

# remove duplicates 
sleepData <- sleepDay_DF[!duplicated(sleepDay_DF),]

# split the SleepDay column to SleepDate and SleepTime columns
sleepData <-
  sleepData %>%
  separate(SleepDay, into = c("Date", "SleepTime", "DayStatus"), sep = " ") %>%
  subset(select = -c(SleepTime,DayStatus))

# Change datatype
class(sleepData$Id) = "character"
sleepData$Date <- mdy(sleepData$Date)

The weightLogInfo_DF

I made a copy (weightData).
I dropped/deleted the “Fat” column
I separated the Date column into Date, Time and DayStatus individual columns and then dropped the Time and DayStatus columns.
I changed the “Id” column datatype to character and the “Date” column datatype to Date.
I created a new column “WeightStatus” showing the weight groups based on BMI values ranging from Underweight, Normal Weight, Overweight and Obese.

weightData <-
  weightLogInfo_DF %>%
  subset(select = -Fat) %>%
  separate(Date, into = c("Date", "Time", "DayStatus"), sep = " ") %>%
  subset(select = -c(Time,DayStatus))%>%
  mutate(WeightStatus = case_when(BMI > 29.9 ~ 'Obese', 
                                  BMI > 24.9 ~ 'Overweight',
                                  BMI < 18.5 ~ 'Underweight',
                                  TRUE ~ 'Normal Weight'))

# Change datatype
class(weightData$Id) = "character"
weightData$Date <- mdy(weightData$Date)

The dailyActivity_DF

I made a copy (activityData).
I changed the “Id” column datatype to character and the “ActivityDate” column datatype to Date.
I renamed the “ActivityDate” column to “Date”
I created new columns; “Day” (showing the day of the week), “TotalActiveMinutes” (a sum of the columns; “VeryActiveMinutes”, “FairlyActiveMinues”, & “LightlyActiveMinutes”).
I dropped rows/observations having zero value for “Total Steps” and “Total Distance”.
I dropped the entire “LoggedActiveDistance” column since just about 4% of observations have values other than zero.

activityData <- dailyActivity_DF %>% 
  subset(!dailyActivity_DF$TotalSteps == 0) %>%
  subset(!dailyActivity_DF$TotalDistance == 0) %>%
  subset(select = -c(LoggedActivitiesDistance))

# Change datatype
class(activityData$Id) = "character"
activityData$ActivityDate <- mdy(activityData$ActivityDate)


activityData <-
  activityData %>%
  rename("Date" = ActivityDate)

#Creating new columns
activityData <- 
  activityData %>%
  mutate(Day = weekdays(Date)) %>%
  mutate(TotalActiveMinutes = (activityData$VeryActiveMinutes + 
                                 activityData$FairlyActiveMinutes + 
                                 activityData$LightlyActiveMinutes)) %>%
  drop_na()

The fitBitData_merged

fitBitData_merged <-
  merge(sleepData,activityData, by=c("Id", "Date"), all = TRUE)

I created this dataframe by combining the activityData and the sleepData dataframes, based on the “Id” and “Date” columns.

This dataframe has a long form with 940 rows and 18 columns. Consistent and meaningful variable names were observed. Three columns originating from the sleepData dataframe had missing values. There are no duplicate entries in this dataframe. This dataframe has records of 33 different participants and data spanning over 31 days.

head(fitBitData_merged)

##           Id       Date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-14                NA                 NA             NA
## 4 1503960366 2016-04-15                 1                412            442
## 5 1503960366 2016-04-16                 2                340            367
## 6 1503960366 2016-04-17                 1                700            712
##   TotalSteps TotalDistance TrackerDistance VeryActiveDistance
## 1      13162          8.50            8.50               1.88
## 2      10735          6.97            6.97               1.57
## 3      10460          6.74            6.74               2.44
## 4       9762          6.28            6.28               2.14
## 5      12669          8.16            8.16               2.71
## 6       9705          6.48            6.48               3.19
##   ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## 1                     0.55                6.06                       0
## 2                     0.69                4.71                       0
## 3                     0.40                3.91                       0
## 4                     1.26                2.83                       0
## 5                     0.41                5.04                       0
## 6                     0.78                2.51                       0
##   VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## 1                25                  13                  328              728
## 2                21                  19                  217              776
## 3                30                  11                  181             1218
## 4                29                  34                  209              726
## 5                36                  10                  221              773
## 6                38                  20                  164              539
##   Calories       Day TotalActiveMinutes
## 1     1985   Tuesday                366
## 2     1797 Wednesday                257
## 3     1776  Thursday                222
## 4     1745    Friday                272
## 5     1863  Saturday                267
## 6     1728    Sunday                222

skim_without_charts(fitBitData_merged)

Data summary
Name	fitBitData_merged
Number of rows	823
Number of columns	19
_______________________
Column type frequency:
character	2
Date	1
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Id	0	1.00	10	10	0	33	0
Day	27	0.97	6	9	0	7	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Date	0	1	2016-04-12	2016-05-12	2016-04-26	31

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
TotalSleepRecords	413	0.50	1.12	0.35	1	1.00	1.00	1.00	3.00
TotalMinutesAsleep	413	0.50	419.17	118.64	58	361.00	432.50	490.00	796.00
TotalTimeInBed	413	0.50	458.48	127.46	61	403.75	463.00	526.00	961.00
TotalSteps	27	0.97	8375.21	4787.27	4	4932.00	7974.50	11136.25	36019.00
TotalDistance	27	0.97	6.04	3.78	0	3.38	5.59	7.97	28.03
TrackerDistance	27	0.97	6.02	3.76	0	3.38	5.59	7.91	28.03
VeryActiveDistance	27	0.97	1.69	2.82	0	0.00	0.41	2.29	21.92
ModeratelyActiveDistance	27	0.97	0.63	0.93	0	0.00	0.31	0.87	6.48
LightActiveDistance	27	0.97	3.63	1.84	0	2.34	3.55	4.88	10.71
SedentaryActiveDistance	27	0.97	0.00	0.01	0	0.00	0.00	0.00	0.11
VeryActiveMinutes	27	0.97	23.63	34.42	0	0.00	7.00	36.00	210.00
FairlyActiveMinutes	27	0.97	14.96	20.73	0	0.00	8.00	21.00	143.00
LightlyActiveMinutes	27	0.97	209.27	95.77	0	147.00	206.00	268.25	518.00
SedentaryMinutes	27	0.97	953.12	280.83	0	721.75	1018.50	1190.25	1440.00
Calories	27	0.97	2366.88	721.52	52	1850.50	2195.50	2859.25	4900.00
TotalActiveMinutes	27	0.97	247.87	104.22	0	182.75	257.00	321.00	552.00

sum(duplicated(fitBitData_merged))

## [1] 0

n_distinct(fitBitData_merged$Id)

## [1] 33

sapply(fitBitData_merged, function(x) n_distinct(x))

##                       Id                     Date        TotalSleepRecords 
##                       33                       31                        4 
##       TotalMinutesAsleep           TotalTimeInBed               TotalSteps 
##                      257                      243                      779 
##            TotalDistance          TrackerDistance       VeryActiveDistance 
##                      588                      586                      321 
## ModeratelyActiveDistance      LightActiveDistance  SedentaryActiveDistance 
##                      207                      466                       10 
##        VeryActiveMinutes      FairlyActiveMinutes     LightlyActiveMinutes 
##                      123                       82                      325 
##         SedentaryMinutes                 Calories                      Day 
##                      523                      669                        8 
##       TotalActiveMinutes 
##                      356

I replaced the missing values found in “TotalSleepRecords”, “TotalMinutesAsleep”, & “TotalTimeInBed” columns with 0 value.

fitBitData_merged <-
  fitBitData_merged %>%
  replace(is.na(fitBitData_merged), 0)

ANALYSE

Quick Summary Statistics of The data:

activityData %>%  
  select(TotalSteps,
         TotalDistance,
         VeryActiveDistance,
         ModeratelyActiveDistance,
         LightActiveDistance,
         SedentaryActiveDistance,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes,
         Calories) %>%
  summary()

##    TotalSteps    TotalDistance    VeryActiveDistance ModeratelyActiveDistance
##  Min.   :    4   Min.   : 0.000   Min.   : 0.000     Min.   :0.0000          
##  1st Qu.: 4932   1st Qu.: 3.380   1st Qu.: 0.000     1st Qu.:0.0000          
##  Median : 7974   Median : 5.590   Median : 0.415     Median :0.3100          
##  Mean   : 8375   Mean   : 6.036   Mean   : 1.687     Mean   :0.6288          
##  3rd Qu.:11136   3rd Qu.: 7.973   3rd Qu.: 2.292     3rd Qu.:0.8700          
##  Max.   :36019   Max.   :28.030   Max.   :21.920     Max.   :6.4800          
##  LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
##  Min.   : 0.000      Min.   :0.000000        Min.   :  0.00   
##  1st Qu.: 2.337      1st Qu.:0.000000        1st Qu.:  0.00   
##  Median : 3.550      Median :0.000000        Median :  7.00   
##  Mean   : 3.628      Mean   :0.001796        Mean   : 23.63   
##  3rd Qu.: 4.880      3rd Qu.:0.000000        3rd Qu.: 36.00   
##  Max.   :10.710      Max.   :0.110000        Max.   :210.00   
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0.0   Min.   :  52  
##  1st Qu.:  0.00      1st Qu.:147.0        1st Qu.: 721.8   1st Qu.:1850  
##  Median :  8.00      Median :206.0        Median :1018.5   Median :2196  
##  Mean   : 14.96      Mean   :209.3        Mean   : 953.1   Mean   :2367  
##  3rd Qu.: 21.00      3rd Qu.:268.2        3rd Qu.:1190.2   3rd Qu.:2859  
##  Max.   :143.00      Max.   :518.0        Max.   :1440.0   Max.   :4900

sleepData %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

weightData %>%  
  select(WeightKg,
         WeightPounds,
         BMI) %>%
  summary()

##     WeightKg       WeightPounds        BMI       
##  Min.   : 52.60   Min.   :116.0   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:135.4   1st Qu.:23.96  
##  Median : 62.50   Median :137.8   Median :24.39  
##  Mean   : 72.04   Mean   :158.8   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:187.5   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :294.3   Max.   :47.54

Aggregated Values

I aggregated the dataset to obtain average values for each participant.

weightAggregated <-
  weightData %>%
  group_by(Id) %>%
  summarise(BodyWeightKg = mean(WeightKg),
            BMIv = mean(BMI)) %>%
  mutate(WeightStatus = case_when(BMIv > 29.9 ~ 'Obese', 
                                  BMIv> 24.9 ~ 'Overweight',
                                  BMIv < 18.5 ~ 'Underweight',
                                  TRUE ~ 'Normal Weight'))


sleepAggregated <-
  sleepData %>%
  group_by(Id) %>%
  summarise(SleepRecord = sum(TotalSleepRecords),
            TotalHoursAsleep = (sum(TotalMinutesAsleep)/60),
            AveHoursAsleep = (mean(TotalMinutesAsleep)/60),
            AveTimeInBed = (mean(TotalTimeInBed)/60)) %>%
  mutate(NatureofSleep = case_when(AveHoursAsleep < 7 ~ 'Poor Sleep',
                                   TRUE ~ 'Good Sleep'))


activityAggregated <-
  activityData %>%
  group_by(Id) %>%
  summarise(AverageSteps = mean(TotalSteps),
            AverageDistance = mean(TotalDistance),
            VeryActiveHours = (mean(VeryActiveMinutes)/60),
            FairlyActiveHours = (mean(FairlyActiveMinutes)/60),
            LightlyActiveHours = (mean(LightlyActiveMinutes)/60),
            AveSedentaryHours = (mean(SedentaryMinutes)/60),
            TotalActiveHours = (VeryActiveHours + FairlyActiveHours + LightlyActiveHours),
            AveCaloriesSpent = mean(Calories))


aggregatedData <-
  list(sleepAggregated, weightAggregated, activityAggregated) %>%
  reduce(full_join, by="Id")

head(aggregatedData)

## # A tibble: 6 × 17
##   Id       Sleep…¹ Total…² AveHo…³ AveTi…⁴ Natur…⁵ BodyW…⁶  BMIv Weigh…⁷ Avera…⁸
##   <chr>      <int>   <dbl>   <dbl>   <dbl> <chr>     <dbl> <dbl> <chr>     <dbl>
## 1 1503960…      27  150.      6.00    6.39 Poor S…    52.6  22.6 Normal…  12521.
## 2 1644430…       4   19.6     4.9     5.77 Poor S…    NA    NA   <NA>      7283.
## 3 1844505…       3   32.6    10.9    16.0  Good S…    NA    NA   <NA>      3622.
## 4 1927972…       8   34.8     6.95    7.30 Poor S…   134.   47.5 Obese     1669.
## 5 2026352…      28  236.      8.44    8.96 Good S…    NA    NA   <NA>      5567.
## 6 2320127…       1    1.02    1.02    1.15 Poor S…    NA    NA   <NA>      4717.
## # … with 7 more variables: AverageDistance <dbl>, VeryActiveHours <dbl>,
## #   FairlyActiveHours <dbl>, LightlyActiveHours <dbl>, AveSedentaryHours <dbl>,
## #   TotalActiveHours <dbl>, AveCaloriesSpent <dbl>, and abbreviated variable
## #   names ¹SleepRecord, ²TotalHoursAsleep, ³AveHoursAsleep, ⁴AveTimeInBed,
## #   ⁵NatureofSleep, ⁶BodyWeightKg, ⁷WeightStatus, ⁸AverageSteps

The average number of daily steps was 8375. This is within the 6000–8000 recommended daily steps. About 70% of participants reached the recommended quota.
The average very active minutes was approximately 24 minutes. This falls short of the recommended 30 minutes of daily moderate-intensity aerobic activity. About 25% of participants reached this quota.
The average sedentary minutes was 953.1 min i.e. about 16 hours.
The average time spent asleep was 419.2 minutes i.e. about 7 hours. This is within the 7 to 9 hours recommended good sleep duration for adults. About 48% of participants whose sleep records were captured, have good sleep duration.

Checking for Relationship between Day of the Week and participants activities:

ggplot(data = activityData) +
  aes(x = (Day), y = TotalSteps) +
  geom_col(fill =  'blue') +
  labs(x = 'Day of week', y = 'Total Steps', title = 'Total steps per Day')

ggplot(data = activityData) +
  aes(x = (Day), y = TotalActiveMinutes) +
  geom_col(fill =  'blue') +
  labs(x = 'Day of week',
       y = 'Time Active (Minutes)',
       title = 'Total Active Minutes per Day')

ggplot(data = activityData) +
  aes(x = (Day), y = Calories) +
  geom_col(fill =  'blue') +
  labs(x = 'Day of week',
       y = 'Calories Expended',
       title = 'Daily Calories Expended')

From the plots, the most/highest records were observed on these days (in descending order); Tuesdays, Wednesdays, Thursdays and Saturdays. While Sundays had the least records.

Checking relationship between Calories expended and steps taken (TotalSteps), time active (TotalActiveMinutes) and Time in Sedentary State (SedentaryMinutes).

# Showing relationships for Calorie expenditure

ggplot(data = activityData) +
  aes(x= TotalActiveMinutes, y = Calories) +
  geom_point(color = 'orange') +
  geom_smooth() +
  labs(x = 'Time Active (minutes)',
       y = 'Calories Expended',
       title = 'Calories Expended vs Time Active')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = activityData) +
  aes(x= TotalSteps, y = Calories) +
  geom_point(color = 'orange') +
  geom_smooth() +
  labs(x = 'Total Steps',
       y = 'Calories Expended',
       title = 'Calories Expended vs Total Steps')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = activityData) +
  aes(x= SedentaryMinutes, y = Calories) +
  geom_point(color = 'orange') +
  geom_smooth() +
  labs(x = 'Time Sedentary (minutes)',
       y = 'Calories Expended',
       title = 'Calories Expended vs Time Sedentary (minutes)')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In the plots of Calories against Time active and Steps taken, the trendline tends upwards. This upward trend observed indicates a positive relationship between calorie expenditure and physical activity.

Whereas a downward progression of the trendline indicating a negative relationship, was observed for the plot against time in sedentary state. Showing that the more sedentary the participant tends to be, the lesser amount of calories that will be expended.

Showing the segregation of participants activities:

# analysing the participants' activities per time

VeryActiveMins <- sum(activityData$VeryActiveMinutes)
FairlyActiveMins <- sum(activityData$FairlyActiveMinutes)
LightlyActiveMins <- sum(activityData$LightlyActiveMinutes)
SedentaryMins <- sum(activityData$SedentaryMinutes)
total <- (VeryActiveMins + FairlyActiveMins + LightlyActiveMins + SedentaryMins)

VeryActive <- round((VeryActiveMins/total) * 100,0)
FairlyActive <- round((FairlyActiveMins / total)* 100,0)
LightlyActive <- round((LightlyActiveMins / total)* 100,0)
Sedentary <- round((SedentaryMins / total)* 100,0)

DF <- 
  data.frame(Activity=c("Very Active", "Fairly", "Lightly", "Sedentary" ),
             Time = c(VeryActiveMins, FairlyActiveMins, LightlyActiveMins, SedentaryMins)
)

ggplot(data = DF) +
  geom_col() + 
  aes(x= (Time)/60, y= Activity) +
  labs(x = 'Time (Hours)',
       y = 'Activity Type',
       title = 'Time Spent per Activity Type')

The plot showed that the participants spent about 79% of their time (i.e. approximately 12645 hours) in sedentary states. But spent about 20% in active states.

Analysing the sleep patterns of participants, and the sleep-time distribution

ggplot(data = sleepData) +
  geom_boxplot() +
  aes(y= TotalMinutesAsleep/60) +
  labs(y = 'Time Asleep (Hours)',
       title = 'Sleep Time Plot')

ggplot(data = sleepAggregated) +
  geom_bar() +
  aes(y= NatureofSleep) +
  labs(y = 'Nature of Sleep',
       x= 'Number of Participants',
       title = 'Nature of Sleep Plot')

These show that 50th percentile of the participants experienced about 432 minutes (7.2 hours) of sleep. About 13 participants overall experienced less than 7 hours of sleep. And 11 participants overall experienced greater than or equal to 7 hours of sleep.

Analysing the weight distribution of participants

ggplot(data = weightData) +
  geom_boxplot() +
  aes(y= WeightKg) +
  labs(y = 'Weight (Kg)',
       title = 'Weight Plot')

ggplot(data = weightData) +
  geom_histogram() +
  aes(x= WeightKg) +
  labs(x = 'Weight (Kg)',
       y= 'Number of Participants',
       title = 'Weight Plot')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = weightData) +
  geom_bar() +
  aes(x= WeightStatus) +
  labs(x = 'Weight Status',
       y= 'Number of Participants',
       title = 'Weight Distribution Plot')

ggplot(data = weightAggregated) +
  geom_col() +
  aes(y= BodyWeightKg, x=Id) +
  labs(y = 'Weight (Kg)',
       x= 'Participants',
       title = 'Weight Distribution Plot')

ggplot(data = weightAggregated) +
  geom_col() +
  aes(y= BMIv, x=Id) +
  labs(x = 'Participants',
       y='Body Mass Index' ,
       title = 'Weight Distribution Plot')

These show a 50th percentile of participants weighing about 62.50Kg.

Of the 8 participants that captured their body weights,

3 had Body Mass Index values (BMI) between 18.9 and 24.9 (Normal weight);

4 participants between 25 and 29.9 (Overweight);

while 1 participant’s was above 30 (Obese).

Key Findings and Recommendations

The Bellabeat company has built its moat (i.e. competitive advantage) by developing technology and services that focuses on women’s health & well-being. It is therefore important that continuous upgrades and expansion be adopted to accommodate and satisfy the needs & applications of the users.

Out of 30 users that took part in the survey, just 8 recorded their weights, and 23 recorded sleep events. the poor compliance can be attributed to; 1. for weight-records, the manual inputs of weight values. To mitigate this, smart devices should be developed with a means for automated and flexible capture of the body-weights of its users for accurate and consistent data. 2. for sleep-events-records; the poor compliance observed can be attributed to the habit of users to charge their devices at bedtime. Improvements in power and battery specifications can reduce such redundancies (e.g. Fast-Charge feature, longer-lasting batteries, etc.).

Programs targeted at empowering women with current information on Health & Wellness can be run via the Bellabeat App. Also information on how the various Bellabeat products can be utilized to achieve user’s health & wellness goals can be provided on the app.

The Leaf product can be packaged to users based on its benefits of effectively tracking health status (e.g heart rate ) and showcasing problem areas where lifestyle modifications and medical interventions might be needed (e.g. where level of activity is poor, presented via low daily steps count).

The Bellabeat membership product can be designed to encourage referrals amongst peers with attractive incentives attached. This product can also be designed to add a feature for “Closest-Gyms-Near-You” to encourage its subscribers on the benefits of good physical activity and excersice on general well-being.

References

FitBit Fitness Tracker Data
Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo. (https://doi.org/10.5281/zenodo.53894)
Brinton J, Keating M, Ortiz A, Evenson K, Furberg R Establishing Linkages Between Distributed Survey Responses and Consumer Wearable Device Datasets: A Pilot Protocol JMIR Res Protoc 2017;6(4):e66 URL: https://www.researchprotocols.org/2017/4/e66 DOI: 10.2196/resprot.6513
sleep standards cdc

Bellabeat Case Study

Onyi

2023-01-09

Google Data Analytics Certificate Capstone: Exploratory Analysis of Smart Fitness Tracker Data

Scenario

Stakeholders

Bellabeat Products

Deliverables

ASK

Business Task

Guiding questions

PREPARE

Dataset

Limitations of the Dataset

Setting Up Environment in R

Loading the datasets:

PROCESS

Details Observed For Each Dataset

● dailyActivity_DF

● sleepDay_DF

● weightLogInfo_DF

To Transform the Data;

The sleepDay_DF

The weightLogInfo_DF

The dailyActivity_DF

The fitBitData_merged

ANALYSE

Quick Summary Statistics of The data:

Aggregated Values

Checking for Relationship between Day of the Week and participants activities:

Checking relationship between Calories expended and steps taken (TotalSteps), time active (TotalActiveMinutes) and Time in Sedentary State (SedentaryMinutes).

Showing the segregation of participants activities:

Analysing the sleep patterns of participants, and the sleep-time distribution

Analysing the weight distribution of participants

Key Findings and Recommendations

References