Case Study: Bellabeat

Summary

The following case study analyzes fitness tracker information of Fitbit users and provides business recommendations to the Bellabeat wellness company which manufactures high-tech health-focused products for women. Conclusions of the analysis are designed to benefit the Bellabeat app with new growth strategies and ideas.

Ask Phase

Business task

Provide marketing strategy recommendations for Bellabeat through analyzing trends and gaining insights into non-Bellabeat smart device usage.

Our key stakeholders include the following:

Urška Sršen - Bellabeat cofounder and Chief Creative Officer;
Sando Mur - Bellabeat cofounder and key member of Bellabeat executive team;
Bellabeat Marketing Analytics team.

We focus our recommendations on the Bellabeat app - one of the Bellabeat products which provides users with personal health statistics on their activity, sleep, stress, menstrual cycle, and mindfulness habits. It is designed to empower women to lead a healthy lifestyle by keeping them informed, accountable, and motivated. The Bellabeat app collects data from the company’s smart wellness products.

Prepare Phase

Dataset information

The dataset used for this case study is Fitbit Fitness Tracker Data which is publicly available on Kaggle: https://www.kaggle.com/datasets/arashnic/fitbit.

It contains personal fitness tracker information such as physical activity, heart rate and sleep, of thirty Fitbit users who submitted their data to the Amazon Mechanical Turk’s survey in the period of 04/12/2016 - 05/12/2016.

The dataset is open source. It can be viewed, downloaded, modified and reused by public. Hence, we do not need to gain owner’s permission to use the dataset for this project.

Data organization

The dataset contains 18 CSV documents, all in a long format. Each Fitbit user has a unique ID and multiple rows allocated to them tracking different attributes by day, hour or minute.

As the size of our sample appears to be small with 30 users, we inspect the tables in Microsoft Excel and build pivot tables to better understand our dataset. After sorting and filtering the data via pivot tables, we count the number of observations and confirm the duration of analysis - 31 days.

We observe some tables are a product of other tables merged together (e.g. dailyActivity_merged contains the same information as dailyCalories_merged, dailyIntensities_merged, dailySteps_merged combined together). Other tables like heartrate_seconds_merged and weightLogInfo_merged present heart rate and weight stats of 7 and 8 users respectively. We disregard these documents due to a very small sample size which cannot be representative of the population to identify general trends for.

We choose the following datasets to proceed with our analysis:

dailyActivity_merged: daily activity stats of 33 users over the period of 31 days. Columns include steps, distance, intensities, calories;
sleepDay_merged: daily sleep logs of 24 users during 31 days. Columns include: number of sleep records a day, total minutes asleep and total minutes in bed;
hourlyIntensities_merged: hourly total and average intensity of 33 users over 31 days;
hourlySteps_merged: hourly total steps of 33 users during the period of 31 days.

We read into Fitbit’s measurement of active minutes to understand what intensity metric means. Active minutes translate into exerting more energy than while resting and pushing your metabolic equivalent of task (MET) above resting metabolic rate of 1 (sedentary minutes). Light intensity activities such as brisk walking require roughly twice the energy (MET=~2) that people spend at rest. Fair intensity (MET from 3 to 6) activities include fast walking and yoga while high intensity exercise (MET>=6) involves running, playing soccer, fast cycling etc.

Data limitations

The number of study participants is quite low and might not represent the whole population accurately. We do not possess any demographic information on Fitbit users to understand if a sampling bias could be present. The study was conducted over six years ago and lasted for only one month. Taking into account the dataset’s limitations, we conduct this case study to practice data analysis skills.

Process Phase

We perform our analysis using R as we can manipulate, process, and visualize data all in one place - RStudio. R language is beginner-friendly, while R packages make data analysis quick and efficient.

Installing R packages

We first install and load packages needed for the analysis:

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.4.0     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

install.packages("skimr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(skimr)
install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

install.packages("lubridate")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(lubridate)

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

install.packages("here")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(here)

## here() starts at /cloud/project

install.packages("ggpubr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(ggpubr)

Importing datasets

We upload and import the four datasets we use for this case study.

daily_activity <- read.csv("dailyActivity_merged.csv")
daily_sleep <- read.csv("sleepDay_merged.csv")
hourly_steps <- read.csv("hourlySteps_merged.csv")
hourly_intensities <- read.csv("hourlyIntensities_merged.csv")

Previewing datasets

We preview the datasets to ensure all data was imported correctly.

head(daily_activity)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

head(daily_sleep)

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

head(hourly_steps)

##           Id          ActivityHour StepTotal
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366  4/12/2016 1:00:00 AM       160
## 3 1503960366  4/12/2016 2:00:00 AM       151
## 4 1503960366  4/12/2016 3:00:00 AM         0
## 5 1503960366  4/12/2016 4:00:00 AM         0
## 6 1503960366  4/12/2016 5:00:00 AM         0

head(hourly_intensities)

##           Id          ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 4/12/2016 12:00:00 AM             20         0.333333
## 2 1503960366  4/12/2016 1:00:00 AM              8         0.133333
## 3 1503960366  4/12/2016 2:00:00 AM              7         0.116667
## 4 1503960366  4/12/2016 3:00:00 AM              0         0.000000
## 5 1503960366  4/12/2016 4:00:00 AM              0         0.000000
## 6 1503960366  4/12/2016 5:00:00 AM              0         0.000000

We use the glimpse function to further check every column and corresponding data type.

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

glimpse(daily_sleep)

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

glimpse(hourly_steps)

## Rows: 22,099
## Columns: 3
## $ Id           <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ StepTotal    <int> 373, 160, 151, 0, 0, 0, 0, 0, 250, 1864, 676, 360, 253, 2…

glimpse(hourly_intensities)

## Rows: 22,099
## Columns: 4
## $ Id               <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 15039…
## $ ActivityHour     <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/1…
## $ TotalIntensity   <int> 20, 8, 7, 0, 0, 0, 0, 0, 13, 30, 29, 12, 11, 6, 36, 5…
## $ AverageIntensity <dbl> 0.333333, 0.133333, 0.116667, 0.000000, 0.000000, 0.0…

Data cleaning

We start cleaning and formatting our dataframes. First, we check for duplicates and delete them if present.

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(daily_sleep))

## [1] 3

sum(duplicated(hourly_intensities))

## [1] 0

sum(duplicated(hourly_steps))

## [1] 0

Daily_sleep dataframe has three duplicates which we remove to begin our cleaning process.

daily_sleep <- daily_sleep %>% 
  distinct()

We double check if the duplicates have been removed.

sum(duplicated(daily_sleep))

## [1] 0

Duplicates have been removed successfully. We further drop any missing values in all the datasets.

daily_activity <- daily_activity %>% 
  drop_na()
daily_sleep <- daily_sleep %>% 
  drop_na()
hourly_intensities <- hourly_intensities %>% 
  drop_na()
hourly_steps <- hourly_steps %>% 
  drop_na()

We initially confirmed the number of users in each data frame via Excel pivot tables:

24 participants in the sleep dataset;
33 participants in all other ones.

We verify the number of unique IDs to ensure they match the number of study participants.

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(daily_sleep$Id)

## [1] 24

n_distinct(hourly_intensities$Id)

## [1] 33

n_distinct(hourly_steps$Id)

## [1] 33

Data formatting

While examining the dataframes earlier, we noticed that date’s data type is character in all four dataframes. We convert day records from string to date format, hours to date-time format.

daily_activity <- daily_activity %>%
  rename(Date = ActivityDate) %>%
  mutate(Date = mdy(Date))

daily_sleep <- daily_sleep %>%
  rename(Date = SleepDay) %>%
  mutate(Date = mdy_hms(Date))

hourly_intensities <- hourly_intensities %>% 
  rename(DateTime = ActivityHour) %>% 
  mutate(DateTime = as.POSIXct(DateTime, format="%m/%d/%Y %I:%M:%S %p"))

hourly_steps <- hourly_steps %>% 
  rename(DateTime = ActivityHour) %>% 
  mutate(DateTime = as.POSIXct(DateTime, format="%m/%d/%Y %I:%M:%S %p"))

Merging datasets

We merge daily_sleep with daily_activity and hourly_steps with hourly_intensities to observe any correlation between variables using Id and Date/DateTime as primary keys to join the tables.

daily_activity_sleep <- merge(daily_activity, daily_sleep, by = c ("Id", "Date"))
head(daily_activity_sleep)

##           Id       Date TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12      13162          8.50            8.50
## 2 1503960366 2016-04-13      10735          6.97            6.97
## 3 1503960366 2016-04-15       9762          6.28            6.28
## 4 1503960366 2016-04-16      12669          8.16            8.16
## 5 1503960366 2016-04-17       9705          6.48            6.48
## 6 1503960366 2016-04-19      15506          9.88            9.88
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.14                     1.26
## 4                        0               2.71                     0.41
## 5                        0               3.19                     0.78
## 6                        0               3.53                     1.32
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                2.83                       0                29
## 4                5.04                       0                36
## 5                2.51                       0                38
## 6                5.03                       0                50
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  34                  209              726     1745
## 4                  10                  221              773     1863
## 5                  20                  164              539     1728
## 6                  31                  264              775     2035
##   TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1                 1                327            346
## 2                 2                384            407
## 3                 1                412            442
## 4                 2                340            367
## 5                 1                700            712
## 6                 1                304            320

hourly_intensities_steps <- merge(hourly_intensities, hourly_steps, by = c ("Id", "DateTime"))
head(hourly_intensities_steps)

##           Id            DateTime TotalIntensity AverageIntensity StepTotal
## 1 1503960366 2016-04-12 00:00:00             20         0.333333       373
## 2 1503960366 2016-04-12 01:00:00              8         0.133333       160
## 3 1503960366 2016-04-12 02:00:00              7         0.116667       151
## 4 1503960366 2016-04-12 03:00:00              0         0.000000         0
## 5 1503960366 2016-04-12 04:00:00              0         0.000000         0
## 6 1503960366 2016-04-12 05:00:00              0         0.000000         0

Analyze and Share Phase

Average numbers

We summarize observations from the merged daily_activity_sleep dataset to have a snapshot of average numbers among the users.

daily_activity_sleep %>% 
  select(TotalSteps, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories, TotalMinutesAsleep, TotalTimeInBed) %>% 
  summary()

##    TotalSteps    VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
##  Min.   :   17   Min.   :  0.00    Min.   :  0.00      Min.   :  2.0       
##  1st Qu.: 5189   1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:158.0       
##  Median : 8913   Median :  9.00    Median : 11.00      Median :208.0       
##  Mean   : 8515   Mean   : 25.05    Mean   : 17.92      Mean   :216.5       
##  3rd Qu.:11370   3rd Qu.: 38.00    3rd Qu.: 26.75      3rd Qu.:263.0       
##  Max.   :22770   Max.   :210.00    Max.   :143.00      Max.   :518.0       
##  SedentaryMinutes    Calories    TotalMinutesAsleep TotalTimeInBed 
##  Min.   :   0.0   Min.   : 257   Min.   : 58.0      Min.   : 61.0  
##  1st Qu.: 631.2   1st Qu.:1841   1st Qu.:361.0      1st Qu.:403.8  
##  Median : 717.0   Median :2207   Median :432.5      Median :463.0  
##  Mean   : 712.1   Mean   :2389   Mean   :419.2      Mean   :458.5  
##  3rd Qu.: 782.8   3rd Qu.:2920   3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :1265.0   Max.   :4900   Max.   :796.0      Max.   :961.0

On average, FitBit users are walking less than the recommended 10,000 steps a day. They do however take more than 8,000 daily steps which is associated with a 51% reduced risk of death from all causes in the next decade of their lives compared to adults walking 4,000 steps a day. Taking 12,000 steps a day has been shown to lower all-cause mortality by 65%. (https://www.cdc.gov/media/releases/2020/p0324-daily-step-count.html)

Users are sleeping ~6.98 hours per day which is lower than the recommended by CDC 7-9 hours. Daily sedentary period is ~11.87 hours which indicates the study participants need to limit their sedentary behavior to optimize health benefits of physical activity and prolong their lives as a result. (https://ijbnpa.biomedcentral.com/articles/10.1186/s12966-020-01044-0)

Steps and sleep per day of the week

We dive deeper into our data to learn which day of the week Fitbit users are the most/least active and how many hours they sleep on different days of the week.

weekday_activity_sleep <- daily_activity_sleep %>% 
  mutate(Weekday = weekdays(Date))

weekday_activity_sleep$Weekday <- ordered(weekday_activity_sleep$Weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))

We summarize the numbers to find average steps taken and average minutes asleep per different days of the week.

weekday_activity_sleep <- weekday_activity_sleep %>%
  group_by(Weekday) %>%
  summarize(DailySteps = mean(TotalSteps), DailySleep = mean(TotalMinutesAsleep))

head(weekday_activity_sleep)

## # A tibble: 6 × 3
##   Weekday   DailySteps DailySleep
##   <ord>          <dbl>      <dbl>
## 1 Monday         9273.       420.
## 2 Tuesday        9183.       405.
## 3 Wednesday      8023.       435.
## 4 Thursday       8184.       401.
## 5 Friday         7901.       405.
## 6 Saturday       9871.       419.

We visualize the new table with ggplot and ggarrange.

ggarrange(
  ggplot(weekday_activity_sleep) + geom_col(aes(Weekday, DailySteps), fill = "#E6A0C4") + geom_hline(yintercept = 8000) + labs(title = "Steps per day of the week", x = "", y = "") + theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.7)),
  ggplot(weekday_activity_sleep) + geom_col(aes(Weekday, DailySleep), fill = "#7394D4") + geom_hline(yintercept = 480) + labs(title = "Minutes asleep per day of the week", x = "", y = "") + theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.7))
)

As shown above, study participants do not walk the recommended 8,000 steps on Fridays and Sundays. They particularly fall short on Sundays with approximately 7,298 steps. Users on average do not sleep the recommended 8 hours (480 minutes) on either day of the week. They sleep the least time on Tuesdays, Thursdays and Fridays.

Hourly steps/intensities throughout the day

Moving beyond days of the week, we investigate into hourly steps and intensity levels to better understand what time Fitbit users are the most/least active throughout the day.

Using hourly_intensities_steps dataframe, we separate date and time into two different columns and visualize average hourly steps/intensities per day.

hourly_intensities_steps <- hourly_intensities_steps %>%
  separate(DateTime, into = c("Date", "Time"), sep= " ")

head(hourly_intensities_steps)

##           Id       Date     Time TotalIntensity AverageIntensity StepTotal
## 1 1503960366 2016-04-12 00:00:00             20         0.333333       373
## 2 1503960366 2016-04-12 01:00:00              8         0.133333       160
## 3 1503960366 2016-04-12 02:00:00              7         0.116667       151
## 4 1503960366 2016-04-12 03:00:00              0         0.000000         0
## 5 1503960366 2016-04-12 04:00:00              0         0.000000         0
## 6 1503960366 2016-04-12 05:00:00              0         0.000000         0

average_intensities_steps <- hourly_intensities_steps %>% 
  group_by(Time) %>% 
  summarize(AverageSteps = mean(StepTotal), AverageTotalIntensity = mean(TotalIntensity))

ggplot(average_intensities_steps) + geom_col(aes(x = Time, y = AverageSteps, fill = AverageSteps)) + labs(title = "Hourly steps throughout the day", x = "", y = "") + scale_fill_gradient(low = "#ECCBAE", high = "#7394D4") + theme(axis.text.x = element_text(angle = 90))

ggplot(average_intensities_steps) + geom_col(aes(x = Time, y = AverageTotalIntensity, fill = AverageTotalIntensity)) + labs(title = "Hourly intensity throughout the day", x = "", y = "") + scale_fill_gradient(low = "#ECCBAE", high = "#7394D4") + theme(axis.text.x = element_text(angle = 90))

As we can see from the graphs, Fitbit users are the most active between 5 and 7pm. It could be explained with people getting off work and walking/exercising shortly afterwards. The activity level is also high 12-2pm which is normally a period for lunch breaks.

Correlations

We explore possible correlations between the following variables:

daily steps & calories;
daily steps & minutes asleep;
calories & minutes asleep.

ggarrange(
  
  ggplot(daily_activity_sleep, aes(TotalSteps, Calories)) + geom_jitter(color = "#E6A0C4", alpha = 0.7) + geom_smooth (method = loess, formula = y~x, color ="#D8A499", se = FALSE) + labs(title = "Daily steps vs calories", x = "Steps", y= "Calories") + theme(panel.background = element_blank()), 
  
  ggplot(daily_activity_sleep, aes(TotalSteps, TotalMinutesAsleep)) + geom_jitter(color = "#C6CDF7", alpha = 0.7) + geom_smooth (method = loess, formula = y~x, color ="#D8A499", se = FALSE) + labs(title = "Daily steps vs minutes asleep", x = "Steps", y= "Minutes asleep") + theme(panel.background = element_blank()),
  
  ggplot(daily_activity_sleep, aes(Calories, TotalMinutesAsleep)) + geom_jitter(color = "#7394D4", alpha = 0.7) + geom_smooth (method = loess, formula = y~x, color ="#D8A499", se = FALSE) + labs(title = "Calories vs minutes asleep", x = "Calories", y= "Minutes asleep") + theme(panel.background = element_blank())
)

We observe a positive correlation between steps taken and calories burned in a day which is obvious as the more you move, the more energy you spend. No correlation found either between daily steps and minutes asleep, or between minutes asleep and calories burned.

We dive deeper into the participants’ daily calorie expenditure and look for any relationship between different activity levels and calories.

ggarrange(
  
  ggplot(daily_activity_sleep, aes(VeryActiveMinutes, Calories)) + geom_jitter(color = "#E6A0C4", alpha = 0.7) + geom_smooth(method = loess, formula = y~x, color ="#E6A0C4", se = FALSE) + theme(panel.background = element_blank()), 
  
  ggplot(daily_activity_sleep, aes(FairlyActiveMinutes, Calories)) + geom_jitter(color = "#C6CDF7", alpha = 0.7) + geom_smooth(method = loess, formula = y~x, color ="#C6CDF7", se = FALSE) + theme(panel.background = element_blank()), 

  ggplot(daily_activity_sleep, aes(LightlyActiveMinutes, Calories)) + geom_jitter(color = "#D8A499", alpha = 0.7) + geom_smooth(method = loess, formula = y~x, color ="#D8A499", se = FALSE) + theme(panel.background = element_blank()), 

  ggplot(daily_activity_sleep, aes(SedentaryMinutes, Calories)) + geom_jitter(color = "#7394D4", alpha = 0.7) + geom_smooth(method = loess, formula = y~x, color ="#7394D4", se = FALSE) + theme(panel.background = element_blank())
)

A positive correlation between very active time and calories can be seen from the graph above. Users engaging in high-intensity activities burn more calories a day.

We examine any possible correlations between total minutes asleep and active minutes.

ggarrange(
  
  ggplot(daily_activity_sleep, aes(VeryActiveMinutes, TotalMinutesAsleep)) + geom_point(color = "#E6A0C4", alpha = 0.7) + geom_smooth(method = loess, formula = y~x, color = "#E6A0C4", se = FALSE) + theme(panel.background = element_blank()), 
  
  ggplot(daily_activity_sleep, aes(FairlyActiveMinutes, TotalMinutesAsleep)) + geom_point(color = "#C6CDF7", alpha = 0.7) + geom_smooth(method = loess, formula = y~x, color = "#C6CDF7", se = FALSE) + theme(panel.background = element_blank()), 
  
  ggplot(daily_activity_sleep, aes(LightlyActiveMinutes, TotalMinutesAsleep)) + geom_point(color = "#D8A499", alpha = 0.7) + geom_smooth(method = loess, formula = y~x, color = "#D8A499", se = FALSE) + theme(panel.background = element_blank()), 
  
  ggplot(daily_activity_sleep, aes(SedentaryMinutes, TotalMinutesAsleep)) + geom_point(color = "#7394D4", alpha = 0.7) + geom_smooth(method = loess, formula = y~x, color = "#7394D4", se = FALSE) + theme(panel.background = element_blank())
)

There is a clear negative correlation between total minutes asleep and sedentary minutes. As sedentary time increases, the duration of the users’ sleep decreases.

Classification of users per activity level

We divide the users into different categories based on their activity level (number of steps taken per day) to identify how active/sedentary the study participants are.

We first determine average daily steps taken by users.

daily_average <- daily_activity_sleep %>%
  group_by(Id) %>%
  summarize(MeanDailySteps = mean(TotalSteps), MeanDailyCalories = mean(Calories), MeanDailySleep = mean(TotalMinutesAsleep))

head(daily_average)

## # A tibble: 6 × 4
##           Id MeanDailySteps MeanDailyCalories MeanDailySleep
##        <dbl>          <dbl>             <dbl>          <dbl>
## 1 1503960366         12406.             1872.           360.
## 2 1644430081          7968.             2978.           294 
## 3 1844505072          3477              1676.           652 
## 4 1927972279          1490              2316.           417 
## 5 2026352035          5619.             1541.           506.
## 6 2320127002          5079              1804             61

Having mean daily steps, we now split users into the following groups:

sedentary, less than 5,000 steps per day;
lightly active, 5,000 - 7,499 steps;
fairly active, 7,500 - 9,999 steps;
active, 10,000 - 12,499 steps;
highly active, 12,500 & more.

Classifications are based on a medical article from MedicineNet: https://www.medicinenet.com/how_many_steps_a_day_is_considered_active/article.htm

daily_average_user <- daily_average %>% 
  mutate(UserType = case_when(
    MeanDailySteps < 5000 ~ "sedentary",
    MeanDailySteps >= 5000 & MeanDailySteps <= 7499 ~ "lightly active", 
    MeanDailySteps >= 7500 & MeanDailySteps <= 9999 ~ "fairly active", 
    MeanDailySteps >= 10000 & MeanDailySteps <= 12499 ~ "active", 
    MeanDailySteps >= 12500 ~ "highly active"
  ))

We find the percentage of each group relative to the total number of users.

daily_average_user_percent <- daily_average_user %>% 
  group_by(UserType) %>%
  summarize(Total = n()) %>%
  mutate(Totals = sum(Total)) %>%
  group_by(UserType) %>%
  summarize(TotalPercent = Total/Totals) %>%
  mutate(Labels = scales::percent(TotalPercent))

daily_average_user_percent$UserType <- factor(daily_average_user_percent$UserType, levels = c("highly active", "active", "fairly active", "lightly active", "sedentary"))

head(daily_average_user_percent)

## # A tibble: 5 × 3
##   UserType       TotalPercent Labels
##   <fct>                 <dbl> <chr> 
## 1 active               0.167  16.7% 
## 2 fairly active        0.375  37.5% 
## 3 highly active        0.0417 4.2%  
## 4 lightly active       0.208  20.8% 
## 5 sedentary            0.208  20.8%

The majority of users are fairly active, while the least number of users are highly active. We visualize the new dataframe using a pie chart.

ggplot(daily_average_user_percent, aes(x = "", y = TotalPercent, fill = UserType)) +
geom_bar(stat = "identity", width = 1) + coord_polar("y", start = 0) + theme_minimal() + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_blank(), panel.grid = element_blank(), plot.title = element_text(hjust = 0.5, size=12, face = "bold")) + scale_fill_manual(values = c("#E6A0C4", "#7394D4", "#C6CDF7", "#FAD77B", "#D8A499")) + labs(title = "User type distribution") + geom_text(aes(label = Labels), position = position_stack(vjust = 0.5))

Conclusion/Act Phase

Overall Conclusions

Lack of steps/activity on Fridays and Sundays:

Fitbit users do not walk the recommended 8,000 steps on Fridays and Sundays. Bellabeat can implement notifications to encourage women to hit their daily activity goal every day. Awards (7-day step goal streak, 14-day step goal streak, etc.) and competitions with other users/friends could motivate them to stay consistent. Higher number of steps evidently correlates with higher number of calories burned - displaying calorie expenditure could be motivational to some.
Users are the most active between 5-7pm and lunch breaks:

Fitbit users are the most active between 5-7pm and 12-2pm. Bellabeat can send notifications to remind women to exercise/take a walk after work and during lunch breaks. A reward system could be set up to empower more movement.
Not sleeping the recommended 8 hours:

Fitbit users are not sleeping the recommended 8 hours. To promote overall wellness and optimize physical performance among its customers, Bellabeat should remind them not to sacrifice sleep as it is essential to good health and longevity. Suggesting breathing techniques before bedtime, reminding to reduce device use in the late evening, motivating to stay consistent with sleep schedule are some of the ideas to promote sleep duration.
Sedentary behavior & duration of sleep:

Higher sedentary time correlates with lower duration of sleep. Bellabeat needs to be reminding its users of health risks of sedentary behavior and significance of proper sleep hygiene. Notifications to stand up and walk around if a user does not move for an hour, posting articles on the importance of sleep, giving users an option to set up their desired bed time with automatic reminders - are all great ways to reduce sedentary time and increase sleep duration.
Fairly active as base consumers:

Majority of Fitbit users are fairly active. Bellabeat needs to educate its customers of the health benefits of living an active (or at least, fairly active) lifestyle. Various fitness/wellness articles and research studies with straight-to-the-point summaries should be published on a daily/weekly basis to keep users informed and motivated.

Recommendations for Bellabeat app

Frequent educational articles on the importance of exercising, walking and sleeping;
Reward system to keep users motivated: awards for 7-day step goal/sleep streak, longest walk/run/workout, etc. Opportunity to compete with other users and friends;
Notifications to stand up and move around at the end of each sedentary hour;
Reminders to start getting ready for sleep with low device use and breathing exercises;
Notifications to reach step/workout goals throughout the day with displaying how close the users are to their daily target.