You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
Sršen, Bellabeat’s cofounder and Chief Creative Officer, knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
I will produce a report with the following deliverables:
FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius); A public data that explores smart device users’ daily habits. This data set contains personal fitness tracker from thirty (30) fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. A third-party data service provider called Fitabase LLC (San Diego, California), aggregated the self-tracker data.
The data set is cited as:
Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016).
Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.53894
RStudio will be used for data exploration, data cleaning, transformation, analysis and visualisation.
All the required R packages to be installed and loaded;
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("reshape2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("tidyverse")
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("here")
## here() starts at /cloud/project
library("skimr")
library("janitor")
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("lubridate")
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library("reshape2")
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
The datasets were downloaded and stored appropriately.
For this analysis, the following datasets were imported and used;
● the 'dailyActivity_merged' dataset
● the 'sleepDay_merged' dataset
● the 'weightLogInfo_merged' dataset
I have excluded datasets whose data are already present in the “dailyActivity_merged” table.
setwd("/cloud/project/Google Capstone")
dailyActivity_DF <-
read.csv("dataset/dailyActivity_merged.csv")
sleepDay_DF <-
read.csv("dataset/sleepDay_merged.csv")
weightLogInfo_DF <-
read.csv("dataset/weightLogInfo_merged.csv")
To preview and glance through the data, and Check for errors, missing values, consistent naming;
head(dailyActivity_DF)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
skim_without_charts(dailyActivity_DF)
| Name | dailyActivity_DF |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDate | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| TotalSteps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| TotalDistance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| TrackerDistance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| LoggedActivitiesDistance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| VeryActiveDistance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| ModeratelyActiveDistance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| LightActiveDistance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| VeryActiveMinutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| FairlyActiveMinutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| LightlyActiveMinutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| SedentaryMinutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| Calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
sum(duplicated(dailyActivity_DF))
## [1] 0
n_distinct(dailyActivity_DF$Id)
## [1] 33
sapply(dailyActivity_DF, function(x) n_distinct(x))
## Id ActivityDate TotalSteps
## 33 31 842
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 615 613 19
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 333 211 491
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 9 122 81
## LightlyActiveMinutes SedentaryMinutes Calories
## 335 549 734
This dataframe has a long form with 940 rows and 15 columns.
Consistent and meaningful variable names were observed.
There are no missing values and no duplicate entries in this
dataframe.
The “ActivityDate” column does not have the right datatype, to be
corrected to the DateTime datatype. This dataframe has records of 33
different participants and data was collected over a maximum of 31
days.
head(sleepDay_DF)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
skim_without_charts(sleepDay_DF)
| Name | sleepDay_DF |
| Number of rows | 413 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| SleepDay | 0 | 1 | 20 | 21 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 5.000979e+09 | 2.06036e+09 | 1503960366 | 3977333714 | 4702921684 | 6962181067 | 8792009665 |
| TotalSleepRecords | 0 | 1 | 1.120000e+00 | 3.50000e-01 | 1 | 1 | 1 | 1 | 3 |
| TotalMinutesAsleep | 0 | 1 | 4.194700e+02 | 1.18340e+02 | 58 | 361 | 433 | 490 | 796 |
| TotalTimeInBed | 0 | 1 | 4.586400e+02 | 1.27100e+02 | 61 | 403 | 463 | 526 | 961 |
sum(duplicated(sleepDay_DF))
## [1] 3
n_distinct(sleepDay_DF$Id)
## [1] 24
sapply(sleepDay_DF, function(x) n_distinct(x))
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 24 31 3 256
## TotalTimeInBed
## 242
This dataframe has a long form with 413 rows and 5 columns.
Consistent and meaningful variable names were observed.
There are no missing values, but 3 duplicate entries were observed
(these duplicates will be removed).
Here, 24 participants data were recorded and collected over a maximum of
31 days. The “SleepDay” column does not have the right datatype, to be
corrected to the DateTime datatype.
head(weightLogInfo_DF)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
skim_without_charts(weightLogInfo_DF)
| Name | weightLogInfo_DF |
| Number of rows | 67 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Date | 0 | 1 | 19 | 21 | 0 | 56 | 0 |
| IsManualReport | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 7.009282e+09 | 1.950322e+09 | 1.503960e+09 | 6.962181e+09 | 6.962181e+09 | 8.877689e+09 | 8.877689e+09 |
| WeightKg | 0 | 1.00 | 7.204000e+01 | 1.392000e+01 | 5.260000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 |
| WeightPounds | 0 | 1.00 | 1.588100e+02 | 3.070000e+01 | 1.159600e+02 | 1.353600e+02 | 1.377900e+02 | 1.875000e+02 | 2.943200e+02 |
| Fat | 65 | 0.03 | 2.350000e+01 | 2.120000e+00 | 2.200000e+01 | 2.275000e+01 | 2.350000e+01 | 2.425000e+01 | 2.500000e+01 |
| BMI | 0 | 1.00 | 2.519000e+01 | 3.070000e+00 | 2.145000e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 |
| LogId | 0 | 1.00 | 1.461772e+12 | 7.829948e+08 | 1.460444e+12 | 1.461079e+12 | 1.461802e+12 | 1.462375e+12 | 1.463098e+12 |
sum(duplicated(weightLogInfo_DF))
## [1] 0
n_distinct(weightLogInfo_DF$Id)
## [1] 8
sapply(weightLogInfo_DF, function(x) n_distinct(x))
## Id Date WeightKg WeightPounds Fat
## 8 56 34 34 3
## BMI IsManualReport LogId
## 36 2 56
This dataframe has a long form with 67 rows and 8 columns. Consistent
and meaningful variable names were observed.
The “Fat” column has 65 missing values out of 67 rows (A just 2.99%
complete rate, this column will be dropped).
No duplicate entry was observed.
The Date column does not have the right datatype, to be corrected to the
DateTime datatype.
Only 8 participants weight data were recorded.
# remove duplicates
sleepData <- sleepDay_DF[!duplicated(sleepDay_DF),]
# split the SleepDay column to SleepDate and SleepTime columns
sleepData <-
sleepData %>%
separate(SleepDay, into = c("Date", "SleepTime", "DayStatus"), sep = " ") %>%
subset(select = -c(SleepTime,DayStatus))
# Change datatype
class(sleepData$Id) = "character"
sleepData$Date <- mdy(sleepData$Date)
weightData <-
weightLogInfo_DF %>%
subset(select = -Fat) %>%
separate(Date, into = c("Date", "Time", "DayStatus"), sep = " ") %>%
subset(select = -c(Time,DayStatus))%>%
mutate(WeightStatus = case_when(BMI > 29.9 ~ 'Obese',
BMI > 24.9 ~ 'Overweight',
BMI < 18.5 ~ 'Underweight',
TRUE ~ 'Normal Weight'))
# Change datatype
class(weightData$Id) = "character"
weightData$Date <- mdy(weightData$Date)
activityData <- dailyActivity_DF %>%
subset(!dailyActivity_DF$TotalSteps == 0) %>%
subset(!dailyActivity_DF$TotalDistance == 0) %>%
subset(select = -c(LoggedActivitiesDistance))
# Change datatype
class(activityData$Id) = "character"
activityData$ActivityDate <- mdy(activityData$ActivityDate)
activityData <-
activityData %>%
rename("Date" = ActivityDate)
#Creating new columns
activityData <-
activityData %>%
mutate(Day = weekdays(Date)) %>%
mutate(TotalActiveMinutes = (activityData$VeryActiveMinutes +
activityData$FairlyActiveMinutes +
activityData$LightlyActiveMinutes)) %>%
drop_na()
fitBitData_merged <-
merge(sleepData,activityData, by=c("Id", "Date"), all = TRUE)
I created this dataframe by combining the activityData and the sleepData dataframes, based on the “Id” and “Date” columns.
This dataframe has a long form with 940 rows and 18 columns. Consistent and meaningful variable names were observed. Three columns originating from the sleepData dataframe had missing values. There are no duplicate entries in this dataframe. This dataframe has records of 33 different participants and data spanning over 31 days.
head(fitBitData_merged)
## Id Date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-14 NA NA NA
## 4 1503960366 2016-04-15 1 412 442
## 5 1503960366 2016-04-16 2 340 367
## 6 1503960366 2016-04-17 1 700 712
## TotalSteps TotalDistance TrackerDistance VeryActiveDistance
## 1 13162 8.50 8.50 1.88
## 2 10735 6.97 6.97 1.57
## 3 10460 6.74 6.74 2.44
## 4 9762 6.28 6.28 2.14
## 5 12669 8.16 8.16 2.71
## 6 9705 6.48 6.48 3.19
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## 1 0.55 6.06 0
## 2 0.69 4.71 0
## 3 0.40 3.91 0
## 4 1.26 2.83 0
## 5 0.41 5.04 0
## 6 0.78 2.51 0
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## 1 25 13 328 728
## 2 21 19 217 776
## 3 30 11 181 1218
## 4 29 34 209 726
## 5 36 10 221 773
## 6 38 20 164 539
## Calories Day TotalActiveMinutes
## 1 1985 Tuesday 366
## 2 1797 Wednesday 257
## 3 1776 Thursday 222
## 4 1745 Friday 272
## 5 1863 Saturday 267
## 6 1728 Sunday 222
skim_without_charts(fitBitData_merged)
| Name | fitBitData_merged |
| Number of rows | 823 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| Date | 1 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 10 | 10 | 0 | 33 | 0 |
| Day | 27 | 0.97 | 6 | 9 | 0 | 7 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Date | 0 | 1 | 2016-04-12 | 2016-05-12 | 2016-04-26 | 31 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| TotalSleepRecords | 413 | 0.50 | 1.12 | 0.35 | 1 | 1.00 | 1.00 | 1.00 | 3.00 |
| TotalMinutesAsleep | 413 | 0.50 | 419.17 | 118.64 | 58 | 361.00 | 432.50 | 490.00 | 796.00 |
| TotalTimeInBed | 413 | 0.50 | 458.48 | 127.46 | 61 | 403.75 | 463.00 | 526.00 | 961.00 |
| TotalSteps | 27 | 0.97 | 8375.21 | 4787.27 | 4 | 4932.00 | 7974.50 | 11136.25 | 36019.00 |
| TotalDistance | 27 | 0.97 | 6.04 | 3.78 | 0 | 3.38 | 5.59 | 7.97 | 28.03 |
| TrackerDistance | 27 | 0.97 | 6.02 | 3.76 | 0 | 3.38 | 5.59 | 7.91 | 28.03 |
| VeryActiveDistance | 27 | 0.97 | 1.69 | 2.82 | 0 | 0.00 | 0.41 | 2.29 | 21.92 |
| ModeratelyActiveDistance | 27 | 0.97 | 0.63 | 0.93 | 0 | 0.00 | 0.31 | 0.87 | 6.48 |
| LightActiveDistance | 27 | 0.97 | 3.63 | 1.84 | 0 | 2.34 | 3.55 | 4.88 | 10.71 |
| SedentaryActiveDistance | 27 | 0.97 | 0.00 | 0.01 | 0 | 0.00 | 0.00 | 0.00 | 0.11 |
| VeryActiveMinutes | 27 | 0.97 | 23.63 | 34.42 | 0 | 0.00 | 7.00 | 36.00 | 210.00 |
| FairlyActiveMinutes | 27 | 0.97 | 14.96 | 20.73 | 0 | 0.00 | 8.00 | 21.00 | 143.00 |
| LightlyActiveMinutes | 27 | 0.97 | 209.27 | 95.77 | 0 | 147.00 | 206.00 | 268.25 | 518.00 |
| SedentaryMinutes | 27 | 0.97 | 953.12 | 280.83 | 0 | 721.75 | 1018.50 | 1190.25 | 1440.00 |
| Calories | 27 | 0.97 | 2366.88 | 721.52 | 52 | 1850.50 | 2195.50 | 2859.25 | 4900.00 |
| TotalActiveMinutes | 27 | 0.97 | 247.87 | 104.22 | 0 | 182.75 | 257.00 | 321.00 | 552.00 |
sum(duplicated(fitBitData_merged))
## [1] 0
n_distinct(fitBitData_merged$Id)
## [1] 33
sapply(fitBitData_merged, function(x) n_distinct(x))
## Id Date TotalSleepRecords
## 33 31 4
## TotalMinutesAsleep TotalTimeInBed TotalSteps
## 257 243 779
## TotalDistance TrackerDistance VeryActiveDistance
## 588 586 321
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## 207 466 10
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
## 123 82 325
## SedentaryMinutes Calories Day
## 523 669 8
## TotalActiveMinutes
## 356
fitBitData_merged <-
fitBitData_merged %>%
replace(is.na(fitBitData_merged), 0)
activityData %>%
select(TotalSteps,
TotalDistance,
VeryActiveDistance,
ModeratelyActiveDistance,
LightActiveDistance,
SedentaryActiveDistance,
VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes,
SedentaryMinutes,
Calories) %>%
summary()
## TotalSteps TotalDistance VeryActiveDistance ModeratelyActiveDistance
## Min. : 4 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 4932 1st Qu.: 3.380 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 7974 Median : 5.590 Median : 0.415 Median :0.3100
## Mean : 8375 Mean : 6.036 Mean : 1.687 Mean :0.6288
## 3rd Qu.:11136 3rd Qu.: 7.973 3rd Qu.: 2.292 3rd Qu.:0.8700
## Max. :36019 Max. :28.030 Max. :21.920 Max. :6.4800
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## Min. : 0.000 Min. :0.000000 Min. : 0.00
## 1st Qu.: 2.337 1st Qu.:0.000000 1st Qu.: 0.00
## Median : 3.550 Median :0.000000 Median : 7.00
## Mean : 3.628 Mean :0.001796 Mean : 23.63
## 3rd Qu.: 4.880 3rd Qu.:0.000000 3rd Qu.: 36.00
## Max. :10.710 Max. :0.110000 Max. :210.00
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 52
## 1st Qu.: 0.00 1st Qu.:147.0 1st Qu.: 721.8 1st Qu.:1850
## Median : 8.00 Median :206.0 Median :1018.5 Median :2196
## Mean : 14.96 Mean :209.3 Mean : 953.1 Mean :2367
## 3rd Qu.: 21.00 3rd Qu.:268.2 3rd Qu.:1190.2 3rd Qu.:2859
## Max. :143.00 Max. :518.0 Max. :1440.0 Max. :4900
sleepData %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
weightData %>%
select(WeightKg,
WeightPounds,
BMI) %>%
summary()
## WeightKg WeightPounds BMI
## Min. : 52.60 Min. :116.0 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:135.4 1st Qu.:23.96
## Median : 62.50 Median :137.8 Median :24.39
## Mean : 72.04 Mean :158.8 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:187.5 3rd Qu.:25.56
## Max. :133.50 Max. :294.3 Max. :47.54
I aggregated the dataset to obtain average values for each participant.
weightAggregated <-
weightData %>%
group_by(Id) %>%
summarise(BodyWeightKg = mean(WeightKg),
BMIv = mean(BMI)) %>%
mutate(WeightStatus = case_when(BMIv > 29.9 ~ 'Obese',
BMIv> 24.9 ~ 'Overweight',
BMIv < 18.5 ~ 'Underweight',
TRUE ~ 'Normal Weight'))
sleepAggregated <-
sleepData %>%
group_by(Id) %>%
summarise(SleepRecord = sum(TotalSleepRecords),
TotalHoursAsleep = (sum(TotalMinutesAsleep)/60),
AveHoursAsleep = (mean(TotalMinutesAsleep)/60),
AveTimeInBed = (mean(TotalTimeInBed)/60)) %>%
mutate(NatureofSleep = case_when(AveHoursAsleep < 7 ~ 'Poor Sleep',
TRUE ~ 'Good Sleep'))
activityAggregated <-
activityData %>%
group_by(Id) %>%
summarise(AverageSteps = mean(TotalSteps),
AverageDistance = mean(TotalDistance),
VeryActiveHours = (mean(VeryActiveMinutes)/60),
FairlyActiveHours = (mean(FairlyActiveMinutes)/60),
LightlyActiveHours = (mean(LightlyActiveMinutes)/60),
AveSedentaryHours = (mean(SedentaryMinutes)/60),
TotalActiveHours = (VeryActiveHours + FairlyActiveHours + LightlyActiveHours),
AveCaloriesSpent = mean(Calories))
aggregatedData <-
list(sleepAggregated, weightAggregated, activityAggregated) %>%
reduce(full_join, by="Id")
head(aggregatedData)
## # A tibble: 6 × 17
## Id Sleep…¹ Total…² AveHo…³ AveTi…⁴ Natur…⁵ BodyW…⁶ BMIv Weigh…⁷ Avera…⁸
## <chr> <int> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 1503960… 27 150. 6.00 6.39 Poor S… 52.6 22.6 Normal… 12521.
## 2 1644430… 4 19.6 4.9 5.77 Poor S… NA NA <NA> 7283.
## 3 1844505… 3 32.6 10.9 16.0 Good S… NA NA <NA> 3622.
## 4 1927972… 8 34.8 6.95 7.30 Poor S… 134. 47.5 Obese 1669.
## 5 2026352… 28 236. 8.44 8.96 Good S… NA NA <NA> 5567.
## 6 2320127… 1 1.02 1.02 1.15 Poor S… NA NA <NA> 4717.
## # … with 7 more variables: AverageDistance <dbl>, VeryActiveHours <dbl>,
## # FairlyActiveHours <dbl>, LightlyActiveHours <dbl>, AveSedentaryHours <dbl>,
## # TotalActiveHours <dbl>, AveCaloriesSpent <dbl>, and abbreviated variable
## # names ¹SleepRecord, ²TotalHoursAsleep, ³AveHoursAsleep, ⁴AveTimeInBed,
## # ⁵NatureofSleep, ⁶BodyWeightKg, ⁷WeightStatus, ⁸AverageSteps
ggplot(data = activityData) +
aes(x = (Day), y = TotalSteps) +
geom_col(fill = 'blue') +
labs(x = 'Day of week', y = 'Total Steps', title = 'Total steps per Day')
ggplot(data = activityData) +
aes(x = (Day), y = TotalActiveMinutes) +
geom_col(fill = 'blue') +
labs(x = 'Day of week',
y = 'Time Active (Minutes)',
title = 'Total Active Minutes per Day')
ggplot(data = activityData) +
aes(x = (Day), y = Calories) +
geom_col(fill = 'blue') +
labs(x = 'Day of week',
y = 'Calories Expended',
title = 'Daily Calories Expended')
From the plots, the most/highest records were observed on these days (in descending order); Tuesdays, Wednesdays, Thursdays and Saturdays. While Sundays had the least records.
# Showing relationships for Calorie expenditure
ggplot(data = activityData) +
aes(x= TotalActiveMinutes, y = Calories) +
geom_point(color = 'orange') +
geom_smooth() +
labs(x = 'Time Active (minutes)',
y = 'Calories Expended',
title = 'Calories Expended vs Time Active')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data = activityData) +
aes(x= TotalSteps, y = Calories) +
geom_point(color = 'orange') +
geom_smooth() +
labs(x = 'Total Steps',
y = 'Calories Expended',
title = 'Calories Expended vs Total Steps')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data = activityData) +
aes(x= SedentaryMinutes, y = Calories) +
geom_point(color = 'orange') +
geom_smooth() +
labs(x = 'Time Sedentary (minutes)',
y = 'Calories Expended',
title = 'Calories Expended vs Time Sedentary (minutes)')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
In the plots of Calories against Time active and Steps taken, the trendline tends upwards. This upward trend observed indicates a positive relationship between calorie expenditure and physical activity.
Whereas a downward progression of the trendline indicating a negative relationship, was observed for the plot against time in sedentary state. Showing that the more sedentary the participant tends to be, the lesser amount of calories that will be expended.
# analysing the participants' activities per time
VeryActiveMins <- sum(activityData$VeryActiveMinutes)
FairlyActiveMins <- sum(activityData$FairlyActiveMinutes)
LightlyActiveMins <- sum(activityData$LightlyActiveMinutes)
SedentaryMins <- sum(activityData$SedentaryMinutes)
total <- (VeryActiveMins + FairlyActiveMins + LightlyActiveMins + SedentaryMins)
VeryActive <- round((VeryActiveMins/total) * 100,0)
FairlyActive <- round((FairlyActiveMins / total)* 100,0)
LightlyActive <- round((LightlyActiveMins / total)* 100,0)
Sedentary <- round((SedentaryMins / total)* 100,0)
DF <-
data.frame(Activity=c("Very Active", "Fairly", "Lightly", "Sedentary" ),
Time = c(VeryActiveMins, FairlyActiveMins, LightlyActiveMins, SedentaryMins)
)
ggplot(data = DF) +
geom_col() +
aes(x= (Time)/60, y= Activity) +
labs(x = 'Time (Hours)',
y = 'Activity Type',
title = 'Time Spent per Activity Type')
The plot showed that the participants spent about 79% of their time (i.e. approximately 12645 hours) in sedentary states. But spent about 20% in active states.
ggplot(data = sleepData) +
geom_boxplot() +
aes(y= TotalMinutesAsleep/60) +
labs(y = 'Time Asleep (Hours)',
title = 'Sleep Time Plot')
ggplot(data = sleepAggregated) +
geom_bar() +
aes(y= NatureofSleep) +
labs(y = 'Nature of Sleep',
x= 'Number of Participants',
title = 'Nature of Sleep Plot')
These show that 50th percentile of the participants experienced about 432 minutes (7.2 hours) of sleep. About 13 participants overall experienced less than 7 hours of sleep. And 11 participants overall experienced greater than or equal to 7 hours of sleep.
ggplot(data = weightData) +
geom_boxplot() +
aes(y= WeightKg) +
labs(y = 'Weight (Kg)',
title = 'Weight Plot')
ggplot(data = weightData) +
geom_histogram() +
aes(x= WeightKg) +
labs(x = 'Weight (Kg)',
y= 'Number of Participants',
title = 'Weight Plot')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = weightData) +
geom_bar() +
aes(x= WeightStatus) +
labs(x = 'Weight Status',
y= 'Number of Participants',
title = 'Weight Distribution Plot')
ggplot(data = weightAggregated) +
geom_col() +
aes(y= BodyWeightKg, x=Id) +
labs(y = 'Weight (Kg)',
x= 'Participants',
title = 'Weight Distribution Plot')
ggplot(data = weightAggregated) +
geom_col() +
aes(y= BMIv, x=Id) +
labs(x = 'Participants',
y='Body Mass Index' ,
title = 'Weight Distribution Plot')
These show a 50th percentile of participants weighing about 62.50Kg.
Of the 8 participants that captured their body weights,
3 had Body Mass Index values (BMI) between 18.9 and 24.9 (Normal weight);
4 participants between 25 and 29.9 (Overweight);
while 1 participant’s was above 30 (Obese).
The Bellabeat company has built its moat (i.e. competitive advantage) by developing technology and services that focuses on women’s health & well-being. It is therefore important that continuous upgrades and expansion be adopted to accommodate and satisfy the needs & applications of the users.
Out of 30 users that took part in the survey, just 8 recorded their weights, and 23 recorded sleep events. the poor compliance can be attributed to; 1. for weight-records, the manual inputs of weight values. To mitigate this, smart devices should be developed with a means for automated and flexible capture of the body-weights of its users for accurate and consistent data. 2. for sleep-events-records; the poor compliance observed can be attributed to the habit of users to charge their devices at bedtime. Improvements in power and battery specifications can reduce such redundancies (e.g. Fast-Charge feature, longer-lasting batteries, etc.).
Programs targeted at empowering women with current information on Health & Wellness can be run via the Bellabeat App. Also information on how the various Bellabeat products can be utilized to achieve user’s health & wellness goals can be provided on the app.
The Leaf product can be packaged to users based on its benefits of effectively tracking health status (e.g heart rate ) and showcasing problem areas where lifestyle modifications and medical interventions might be needed (e.g. where level of activity is poor, presented via low daily steps count).
The Bellabeat membership product can be designed to encourage referrals amongst peers with attractive incentives attached. This product can also be designed to add a feature for “Closest-Gyms-Near-You” to encourage its subscribers on the benefits of good physical activity and excersice on general well-being.
Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo. (https://doi.org/10.5281/zenodo.53894)
Brinton J, Keating M, Ortiz A, Evenson K, Furberg R Establishing Linkages Between Distributed Survey Responses and Consumer Wearable Device Datasets: A Pilot Protocol JMIR Res Protoc 2017;6(4):e66 URL: https://www.researchprotocols.org/2017/4/e66 DOI: 10.2196/resprot.6513