Introduction

Bellabeat is a high-tech company founded in 2013 that manufactures health-focused smart products for women. The collection of data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their health and daily habits. The company has 5 products: bellabeat app, leaf, time, spring and bellabeat membership. Our team has been asked to analyse smart device data to gain insight into how consumers are using their gadgets. The insights of this analysis will help to guide Bellabeat’s marketing strategy.

Business Task

Analise a given data set and present some recommendations to improve Bellabeat’s marketing strategy.

Key Stakeholders

Urška Sršen – Co-founder and Chief Creative Officer
Sando Mur – Co-founder and key member of the Bellabeat executive team

Data

The data used for this analysis is the Fitbit Fitness Tracker Data available on Kaggle.
It contains personal tracker data of 30 users who have consented to its submission.
A survey via Amazon Mechanical Turk was conducted between 03.12.2016 and 05.12.2016.

This data set has 2 main limitations: only 30 users and their data was collected from March to May of 2016 which is a short time frame.

Coding

Loading Packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(dplyr)
library(ggplot2)
library(tidyr)

Importing Datasets

I have already set my working directory.

dailyActivity_merged <- read_csv("Fitbit_Fitness_Tracker/dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Activity <- dailyActivity_merged
HeartRate <- read_csv("Fitbit_Fitness_Tracker/heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Weight <- read_csv("Fitbit_Fitness_Tracker/weightLogInfo_merged.csv")

## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Sleep <- read.csv("Fitbit_Fitness_Tracker/sleepDay_merged.csv")
Calories <- read.csv("Fitbit_Fitness_Tracker/dailyCalories_merged.csv")

Formatting Time

class(Sleep$SleepDay)

## [1] "character"

Sleep$SleepDay=as.POSIXct(Sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Sleep$date <- format(Sleep$SleepDay, format = "%m/%d/%y")

Activity$ActivityDate=as.POSIXct(Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
Activity$date <- format(Activity$ActivityDate, format = "%m/%d/%y")

HeartRate$Time=as.POSIXct(HeartRate$Time, format="%m/%d/%Y", tz=Sys.timezone())
HeartRate$date <- format(HeartRate$Time, format = "%m/%d/%y")

Exploring and summarising data

n_distinct(Activity$Id)

## [1] 33

n_distinct(Calories$Id)

## [1] 33

n_distinct(Sleep$Id)

## [1] 24

n_distinct(HeartRate$Id)

## [1] 14

n_distinct(Weight$Id)

## [1] 8

There is information about 33 users from calories and activity, 24 for sleep and 14 for heart rate. The data frame for weight has only 8 users which is not enough to make conclusions regarding weight and fat percentage.

Let’s see what activity data frame tells us:

Activity %>%
  select(TotalSteps,
         VeryActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes) %>%
  summary()

##    TotalSteps    VeryActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :    0   Min.   :  0.00    Min.   :  0.0        Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.:  0.00    1st Qu.:127.0        1st Qu.: 729.8  
##  Median : 7406   Median :  4.00    Median :199.0        Median :1057.5  
##  Mean   : 7638   Mean   : 21.16    Mean   :192.8        Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 32.00    3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :36019   Max.   :210.00    Max.   :518.0        Max.   :1440.0

If we consider these users as more active people than average, 7406 steps as the median value is not satisfactory and could be improved. Despite moderate and intense exercise having more health benefits than walking 10,000-12,000 steps a day, achieving both is something to strive for. The mean sedentary time is 991 minutes (16hours) and the maximum sedentary time is 1440 minutes (24h). This needs to be addressed.

HeartRate %>%
  select(Value) %>%
  summary()

##      Value       
##  Min.   : 36.00  
##  1st Qu.: 63.00  
##  Median : 73.00  
##  Mean   : 77.33  
##  3rd Qu.: 88.00  
##  Max.   :203.00

Average values and there is no information regarding heart rate while exercising or sleeping.

Sleep %>%
  select(.) %>%
  filter(!complete.cases(.))

## data frame with 0 columns and 0 rows

Sleep %>%
  select(TotalMinutesAsleep) %>%
  summarise(mean(TotalMinutesAsleep))

##   mean(TotalMinutesAsleep)
## 1                 419.4673

Sleep %>%
  select(TotalMinutesAsleep) %>%
  summarise(median(TotalMinutesAsleep))

##   median(TotalMinutesAsleep)
## 1                        433

The mean sleeping time is just under 7 hours and the median sleeping time is 7h13mins.

Merging Data

merged_data <- merge(Sleep, Activity, by=c('Id', 'date'))
head(merged_data)

##           Id     date   SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 04/12/16 2016-04-12                 1                327
## 2 1503960366 04/13/16 2016-04-13                 2                384
## 3 1503960366 04/15/16 2016-04-15                 1                412
## 4 1503960366 04/16/16 2016-04-16                 2                340
## 5 1503960366 04/17/16 2016-04-17                 1                700
## 6 1503960366 04/19/16 2016-04-19                 1                304
##   TotalTimeInBed ActivityDate TotalSteps TotalDistance TrackerDistance
## 1            346   2016-04-12      13162          8.50            8.50
## 2            407   2016-04-13      10735          6.97            6.97
## 3            442   2016-04-15       9762          6.28            6.28
## 4            367   2016-04-16      12669          8.16            8.16
## 5            712   2016-04-17       9705          6.48            6.48
## 6            320   2016-04-19      15506          9.88            9.88
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.14                     1.26
## 4                        0               2.71                     0.41
## 5                        0               3.19                     0.78
## 6                        0               3.53                     1.32
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                2.83                       0                29
## 4                5.04                       0                36
## 5                2.51                       0                38
## 6                5.03                       0                50
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  34                  209              726     1745
## 4                  10                  221              773     1863
## 5                  20                  164              539     1728
## 6                  31                  264              775     2035

Visualization

ggplot(data=Activity, aes(x = TotalDistance, y = VeryActiveDistance,)) +
  geom_point (color = 'darkgreen') + labs(title = "Very Active Distance in relation to Total Distance")

cor(Activity$TotalDistance, Activity$VeryActiveDistance)

## [1] 0.7945816

There is a moderate correlation between distance in total and very active distance. People may tend to exercise more vigorously if they improve their walking/moving time.

plot( TotalMinutesAsleep ~ TotalTimeInBed,
      data = Sleep,
      main = "Do people sleep if they go to bed?", 
      col.main = 'navyblue', 
      fg = 'blue',
      col = 'darkorange',
      col.axis = 'navyblue',
      col.lab = 'darkorange',
      xlab = "TotalTimeInBed",
      ylab = "TotalMinutesAsleep"
      )

cor(Sleep$TotalTimeInBed, Sleep$TotalMinutesAsleep)

## [1] 0.9304575

There is a strong correlation between time spent in bed and sleeping time. Bellabeat may recommend going to bed half an hour earlier to improve sleeping time. Customers could receive points for good quality sleep and get discounts on other Bellabeat’s products.

ggplot(data=merged_data, aes(x = TotalMinutesAsleep, y = LightlyActiveMinutes)) +
  geom_point( color = 'darkgreen') + geom_smooth( color = 'red') +
  labs(title = 'Sleep and Activity')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data=merged_data, aes(x = TotalMinutesAsleep, y = FairlyActiveMinutes)) +
  geom_point( color = 'darkgreen') + geom_smooth( color = 'red') +
  labs(title = 'Sleep and Activity')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data=merged_data, aes(x = TotalMinutesAsleep, y = VeryActiveMinutes)) +
  geom_point( color = 'darkgreen') + geom_smooth( color = 'red') +
  labs(title = 'Sleep and Activity')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data=merged_data, aes(x = TotalMinutesAsleep, y = SedentaryMinutes)) +
  geom_point( color = 'darkgreen') + geom_smooth( color = 'red') +
  labs(title = 'Sleep and Activity')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

cor(merged_data$TotalMinutesAsleep, merged_data$SedentaryMinutes)

## [1] -0.599394

There seems to be no correlation between sleep time and active spent time. Sleep has a negative low correlation with sedentary minutes. However, sleep time should not be the focus to improve an active lifestyle according to this data.

Recommendations for Bellabeat

There are no gender information about the participants which may bias these recommendations as Bellabeat’s target audience is women. Regarding the limited data set analysed, I can suggest the following:

Increasing average total steps may improve wellness overall. If there is a weekly goal on the app that changes according to previous records, women will walk more.
Sedentary time is the worst indicator analised which has to be addressed. Notifications to stand up, stretch or go for a small walk in the office may reduce sedentarism.
Mean sleeping time is barely 7h and there is a strong correlation between time asleep and time spent in bed. As the waking time is often non-negotiable, some kind of app reward to go to bed earlier may increase sleep time and probably its quality.
More data has to be collected from users to get more direct insights; data collected for women using bellabeat app and its other products would suit the business better.

Session Info

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.0  timechange_0.1.1 forcats_0.5.2    stringr_1.5.0   
##  [5] dplyr_1.0.10     purrr_1.0.0      readr_2.1.3      tidyr_1.2.1     
##  [9] tibble_3.1.8     ggplot2_3.4.0    tidyverse_1.3.2 
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.20-45     assertthat_0.2.1    digest_0.6.31      
##  [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
##  [7] backports_1.4.1     reprex_2.0.2        evaluate_0.19      
## [10] highr_0.10          httr_1.4.4          pillar_1.8.1       
## [13] rlang_1.0.6         googlesheets4_1.0.1 readxl_1.4.1       
## [16] rstudioapi_0.14     jquerylib_0.1.4     Matrix_1.5-1       
## [19] rmarkdown_2.19      splines_4.2.2       labeling_0.4.2     
## [22] googledrive_2.0.0   bit_4.0.5           munsell_0.5.0      
## [25] broom_1.0.2         compiler_4.2.2      modelr_0.1.10      
## [28] xfun_0.36           pkgconfig_2.0.3     mgcv_1.8-41        
## [31] htmltools_0.5.4     tidyselect_1.2.0    fansi_1.0.3        
## [34] crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1       
## [37] withr_2.5.0         grid_4.2.2          nlme_3.1-160       
## [40] jsonlite_1.8.4      gtable_0.3.1        lifecycle_1.0.3    
## [43] DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
## [46] cli_3.5.0           stringi_1.7.8       vroom_1.6.0        
## [49] cachem_1.0.6        farver_2.1.1        fs_1.5.2           
## [52] xml2_1.3.3          bslib_0.4.2         ellipsis_0.3.2     
## [55] generics_0.1.3      vctrs_0.5.1         tools_4.2.2        
## [58] bit64_4.0.5         glue_1.6.2          hms_1.1.2          
## [61] parallel_4.2.2      fastmap_1.1.0       yaml_2.3.6         
## [64] colorspace_2.0-3    gargle_1.2.1        rvest_1.0.3        
## [67] knitr_1.41          haven_2.5.1         sass_0.4.4

Bellabeat CaseStudy using R

Google Data Analytics Capstone Project

Sami Parkkali

14/01/2023