Bellabeat Report

Background

Bellabeat is a manufacturer of health-focused smart products for women. The company’s app and multiple smart devices collect data on activity, sleep, stress, hydration levels, and reproductive health to empower women with an understanding of their own health and habits. The company was founded in 2013 by Urška Sršen and Sando Mur and has expanded quickly since, now with the possibility to become a greater player in the global smart device market. Bellabeat membership is a subscription-based membership program that provides users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, beauty, and mindfulness based on their lifestyle and goals.

The phases of the data analysis process (Ask, Prepare, Process, Analyze, Share and Act) are to be used in the analysis of the datasets.

Ask

Guiding Questions:

What is the problem you are trying to solve?
What are the trends in smart device usage and which of these are Bellabeat currently implementing.
How can your insights drive business decisions?
The identified trends should guide the company’s marketing strategy.

Business Task:

Analyze Fitbit data to gain insight and help guide marketing strategy for Bellabeat to grow as a global player.

Stakeholders:

Primary stakeholders: Urška Sršen and Sando Mur, executive team members.
Secondary stakeholders: Bellabeat marketing analytics team.

Prepare

Overview of the Data

Information on the Data Source:

Data is publicly available at FitBit Fitness Tracker Data on Kaggle (https://www.kaggle.com/arashnic/fitbit) and contains 18 csv files.
Generated by respondents from a survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016. 33 FitBit users consented to the submission of personal tracker data.
Data collected includes physical activity recorded in minutes, heart rate, sleep monitoring, daily activity and steps.

Limitations of the Dataset:

Data was collected 5 years ago in 2016. Therefore, data may be outdated as users’ daily activity, fitness and sleeping habits, diet and food consumption may have changed since then.
Absence of key characteristics of the userss, such as gender, age, location, lifestyle.
Sample size of 30 FitBit users is not representative of the entire fitness population.
As data is collected in a survey, we are unable to ascertain its integrity or accuracy.

Is Data ROCCC:

A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited.
Reliable — LOW — Not reliable as it only has 33 respondents and some of the dataset have less than that.
Original — LOW — Third party provider (Amazon Mechanical Turk)
Comprehensive — MED — Parameters match most of Bellabeat products’ parameters
Current — LOW — Data is 5 years old and may not be relevant
Cited — LOW — Data collected from third party, hence unknown

The data integrity and credibility is not sufficient to provide a comprehensive analysis of the company. Therefore, the following analysis can only provide high-level direction and should be verified with

Setting up the Working Environment

The analysis is to be carried out using RStudio. First, all relevant packages are loaded.

library("tidyverse")
library("here")
library("janitor")
library("skimr")
library("dplyr")
library("lubridate")
library("tidyr")
library("ggplot2")

Process

Importing Data

As there is only a small sample size of the datasets, it is prudent focus on the “daily” data which is more likely to provide high level insights.

daily_activity <- read_csv("dailyActivity_merged.csv")
daily_calories <- read_csv("dailyCalories_merged.csv")
daily_intensities <- read_csv("dailyIntensities_merged.csv")
daily_steps <- read_csv("dailySteps_merged.csv")
daily_sleep <- read_csv("sleepDay_merged.csv")

Cleaning the Data

Review the data to look for useful starting points and patterns.

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~

glimpse(daily_calories)

## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ Calories    <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775~

glimpse(daily_intensities)

## Rows: 940
## Columns: 10
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDay              <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~

glimpse(daily_steps)

## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ StepTotal   <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054~

glimpse(daily_sleep)

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~

Check to see how many distinct ID’s (users) are in each data set.

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(daily_calories$Id)

## [1] 33

n_distinct(daily_intensities$Id)

## [1] 33

n_distinct(daily_steps$Id)

## [1] 33

n_distinct(daily_sleep$Id)

## [1] 24

The sleep data has 9 fewer users.All other data sets have 33 users.

Following a closer inspection, the daily_activity data set contains the relevant data found in daily_calories, daily_intensities and daily_steps. Therefore, the focus can be dedicated to the daily_activity and daily_sleep data sets.

Check for (and remove) duplicates. The sleep data had three identical rows which were removed.

distinct(daily_activity)

## # A tibble: 940 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
##  1 1.50e9 4/12/2016         13162          8.5             8.5                 0
##  2 1.50e9 4/13/2016         10735          6.97            6.97                0
##  3 1.50e9 4/14/2016         10460          6.74            6.74                0
##  4 1.50e9 4/15/2016          9762          6.28            6.28                0
##  5 1.50e9 4/16/2016         12669          8.16            8.16                0
##  6 1.50e9 4/17/2016          9705          6.48            6.48                0
##  7 1.50e9 4/18/2016         13019          8.59            8.59                0
##  8 1.50e9 4/19/2016         15506          9.88            9.88                0
##  9 1.50e9 4/20/2016         10544          6.68            6.68                0
## 10 1.50e9 4/21/2016          9819          6.34            6.34                0
## # ... with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

daily_sleep_updated <- distinct(daily_sleep)

Next, the date columns are to be converted to the correct “date” format.

daily_activity$ActivityDate <-as.Date(daily_activity$ActivityDate, format = "%m/%d/%Y")

The sleep data has time included in the date column. Prior to formatting the column, it is to be split so that the date is on its own.The time data is not necessary and so is removed. Two additional columns were added with the minutes asleep and minutes in bed converted to hours.

sleep_day_2 <- separate(daily_sleep_updated, col = SleepDay, into = c("Date", "Time", "Am/Pm"), sep = " ")

sleep_day_2$Date <-as.Date(sleep_day_2$Date, format = "%m/%d/%Y")

sleep_day_2$`Am/Pm` <- NULL
sleep_day_2$Time <- NULL

sleep_day_2$TotalHoursAsleep <- sleep_day_2$TotalMinutesAsleep/60
sleep_day_2$TotalHoursInBed <- sleep_day_2$TotalTimeInBed/60

Check through each table to see if they contain potentially invalid ID numbers (i.e. less than 10 digits)

subset(daily_activity, nchar(daily_activity$Id) < 10 | nchar(daily_activity$Id) > 10)

## # A tibble: 0 x 15
## # ... with 15 variables: Id <dbl>, ActivityDate <date>, TotalSteps <dbl>,
## #   TotalDistance <dbl>, TrackerDistance <dbl>, LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

subset(daily_sleep_updated, nchar(daily_sleep_updated$Id) < 10 | nchar(daily_sleep_updated$Id) > 10)

## # A tibble: 0 x 5
## # ... with 5 variables: Id <dbl>, SleepDay <chr>, TotalSleepRecords <dbl>,
## #   TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>

No invalid ID numbers.

Check for rows with NA or missing info.

drop_na(daily_activity)

## # A tibble: 940 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <date>            <dbl>         <dbl>           <dbl>            <dbl>
##  1 1.50e9 2016-04-12        13162          8.5             8.5                 0
##  2 1.50e9 2016-04-13        10735          6.97            6.97                0
##  3 1.50e9 2016-04-14        10460          6.74            6.74                0
##  4 1.50e9 2016-04-15         9762          6.28            6.28                0
##  5 1.50e9 2016-04-16        12669          8.16            8.16                0
##  6 1.50e9 2016-04-17         9705          6.48            6.48                0
##  7 1.50e9 2016-04-18        13019          8.59            8.59                0
##  8 1.50e9 2016-04-19        15506          9.88            9.88                0
##  9 1.50e9 2016-04-20        10544          6.68            6.68                0
## 10 1.50e9 2016-04-21         9819          6.34            6.34                0
## # ... with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

drop_na(daily_sleep_updated)

## # A tibble: 410 x 5
##            Id SleepDay          TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
##         <dbl> <chr>                        <dbl>            <dbl>          <dbl>
##  1 1503960366 4/12/2016 12:00:~                1              327            346
##  2 1503960366 4/13/2016 12:00:~                2              384            407
##  3 1503960366 4/15/2016 12:00:~                1              412            442
##  4 1503960366 4/16/2016 12:00:~                2              340            367
##  5 1503960366 4/17/2016 12:00:~                1              700            712
##  6 1503960366 4/19/2016 12:00:~                1              304            320
##  7 1503960366 4/20/2016 12:00:~                1              360            377
##  8 1503960366 4/21/2016 12:00:~                1              325            364
##  9 1503960366 4/23/2016 12:00:~                1              361            384
## 10 1503960366 4/24/2016 12:00:~                1              430            449
## # ... with 400 more rows

Some of the steps and distance results may be 0, which would likely indicate the device was not used that day. As this may skew results, these values were removed.

daily_activity <- daily_activity %>% filter(TotalSteps !=0)
daily_activity <- daily_activity %>%  filter(TotalDistance !=0)

Add a day of the week column to the activity data, so that potential patterns can be evaluated.

daily_activity$Day <- weekdays(daily_activity$ActivityDate)
colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"                 "Day"

Analyze

A summary of several variables related to the activity and sleep data is shown below.

daily_activity %>%
  select(TotalSteps, TotalDistance, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, Calories) %>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes LightlyActiveMinutes
##  Min.   :    8   Min.   : 0.010   Min.   :   0.0   Min.   :  0.0       
##  1st Qu.: 4927   1st Qu.: 3.373   1st Qu.: 721.2   1st Qu.:147.0       
##  Median : 8054   Median : 5.590   Median :1020.5   Median :208.5       
##  Mean   : 8329   Mean   : 5.986   Mean   : 955.2   Mean   :210.3       
##  3rd Qu.:11096   3rd Qu.: 7.905   3rd Qu.:1189.0   3rd Qu.:272.0       
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :518.0       
##  FairlyActiveMinutes VeryActiveMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.00    Min.   :  52  
##  1st Qu.:  0.00      1st Qu.:  0.00    1st Qu.:1857  
##  Median :  8.00      Median :  7.00    Median :2220  
##  Mean   : 14.79      Mean   : 23.04    Mean   :2362  
##  3rd Qu.: 21.00      3rd Qu.: 35.00    3rd Qu.:2832  
##  Max.   :143.00      Max.   :210.00    Max.   :4900

sleep_day_2 %>%  
  select(TotalHoursAsleep,
         TotalHoursInBed) %>%
  summary()

##  TotalHoursAsleep  TotalHoursInBed 
##  Min.   : 0.9667   Min.   : 1.017  
##  1st Qu.: 6.0167   1st Qu.: 6.729  
##  Median : 7.2083   Median : 7.717  
##  Mean   : 6.9862   Mean   : 7.641  
##  3rd Qu.: 8.1667   3rd Qu.: 8.767  
##  Max.   :13.2667   Max.   :16.017

Observations:

Average number of steps taken per day was 8329.
Average sedentary time was 955.2 minutes (15.9 hours).
Average very active time was 21.16 minutes
The average amount of calories burnt per day was around 2362 kcal.
Participants spent, on average, 7.64 hours in bed with an average time actually sleeping of 6.98 hours

Visualizations are shown below to highlight trends and patterns.

If more steps are taken, are more calories burned?

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=TotalSteps, y=Calories), color="blue") +
  geom_smooth(mapping=aes(x=TotalSteps, y=Calories), color="red") +
  labs(title="The Relationship Between Total Steps and Calories Burned", x="Total Steps", y="Calories Burned (kcal)") +
  theme(plot.title = element_text(hjust = 0.5))

The above graph indicates that users who logged more steps typically burned more calories.

Is there a correlation between level (or intensity) of the activity and how many calories are burned?

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=SedentaryMinutes, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=SedentaryMinutes, y=Calories)) +
  labs(title="Calories Burned Relative to Sedentary Time", x="Sedentary Time (Min)", y="Calories Burned (kcal)")+ theme(plot.title = element_text(hjust = 0.5))

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=FairlyActiveMinutes, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=FairlyActiveMinutes, y=Calories)) +
  labs(title="Calories Burned Relative to Fairly Active Time", x="Fairly Active (Min)", y="Calories Burned (kcal)")+ theme(plot.title = element_text(hjust = 0.5))

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=LightlyActiveMinutes, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=LightlyActiveMinutes, y=Calories)) +
  labs(title="Calories Burned Relative to Light Activity Time", x="Lightly Active (Min)", y="Calories Burned (kcal)")+ theme(plot.title = element_text(hjust = 0.5))

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=VeryActiveMinutes, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=VeryActiveMinutes, y=Calories)) +
  labs(title="Calories Burned Relative to Very Active Time", x="Very Active (Min)", y="Calories Burned (kcal)")+ theme(plot.title = element_text(hjust = 0.5))

In general, there is correlation between higher intensity activity resulting in more calories burned.

Is there a clear correlation between time spent in bed and how much actual sleep a user gets?

ggplot(data=sleep_day_2, mapping=aes(x=TotalHoursAsleep, y=TotalHoursInBed), color="purple") +
  geom_point() +
  geom_smooth() +
  labs(title="Total Sleep Time Relative to Time in Bed", x="Total Time Alseep (hours)", y="Total Time in Bed (hours)") +
  theme(plot.title = element_text(hjust = 0.5))

The amount of sleep a user got increased fairly linearly with how much time they actually spent in bed.

How frequently did users log data (how often did they use the app) and what days?

Note: First, the order the days appear in the output is to be correctly implemented.

day_order <- c( 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')

daily_activity %>% 
  mutate(Day = fct_relevel(.f = Day, levels = day_order)) %>% 
  ggplot(aes(x = Day)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 30)) +
  labs(title="Active Days", y="Logged activity (counts)") +
  theme(plot.title = element_text(hjust = 0.5))

The data indicates that app usage peaks on Tuesdays and Wednesdays before trending down until Mondays.

What percentage of the logged time was active (and to what intensity of activity)?

sum(daily_activity$SedentaryMinutes)

## [1] 823376

activity_status <- c("Sedentary", "LightlyActive", "FairlyActive", "VeryActive")
total_time <- c(
  sum(daily_activity$SedentaryMinutes), 
  sum(daily_activity$LightlyActiveMinutes), 
  sum(daily_activity$FairlyActiveMinutes), 
  sum(daily_activity$VeryActiveMinutes))

total_time_combined <- sum(total_time)

percentage <- c(
  (sum(daily_activity$SedentaryMinutes)*100/total_time_combined), 
  (sum(daily_activity$LightlyActiveMinutes)*100/total_time_combined), 
  (sum(daily_activity$FairlyActiveMinutes)*100/total_time_combined), 
  (sum(daily_activity$VeryActiveMinutes)*100/total_time_combined))

total_activity_status <- data.frame(activity_status, total_time, percentage)

pie(total_activity_status$total_time, main="Activity Breakdown", labels = paste0(activity_status, " = ", round(percentage, 0), "%"), edges = 200, radius = 1,  col=rainbow(4),  init.angle = 15, cex = 1)

A significant majority of the activity logged was sedentary time. Actual activity made up only a fifth of the time, with fairly and very active intensity being a small percentage.

Act

Several insights and suggestions, based on the analysis, is summarized below for the Shareholders:

The data indicates that as more steps were taken, more calories were burned. It is recommended that an incentive be employed to encourage users to walk (i.e. points system, reminders, etc. ) Discuss promotion of more consistent usage
The data also illustrates that the more intense activity resulted in more calories burned. A disproportionate amount of the activity is less intense. Ways to promote higher intensity activity, such as guided workouts, is recommended.
Emphasize better more consistent sleeping habits. The average sleep time is slightly below the recommended amount of 7 hours. Encouragement to go to bed earlier, through notifications or reminders, might be an effective way to remedy this.
Users were fairly consistent with their app usage, however, the data does highlight that usage starts strong on Tuesday and trends down through the week until Monday. This may be attributable to users losing motivation as the week progresses and bottoming out at the start of the work week (“Monday Blues”). Inspiring reminders and notifications could be a potential simple way to correct this.

Based on the limitations of the FitBit data, here are some recommendations for future analysis:

A larger sample size should be used in order to improve the statistical significance of the analysis.
A longer period of tracking data, ideally 6-12 months, to account for behavioural changes due to the seasonal change.
Additional metrics such as personal attributes (age) and hydration may also be useful to analyse.
Collection of internal data or from secondary data sources to increase credibility and reliability of the datasets.