Background

Bellabeat is a manufacturer of health-focused smart products for women. The company’s app and multiple smart devices collect data on activity, sleep, stress, hydration levels, and reproductive health to empower women with an understanding of their own health and habits. The company was founded in 2013 by Urška Sršen and Sando Mur and has expanded quickly since, now with the possibility to become a greater player in the global smart device market. Bellabeat membership is a subscription-based membership program that provides users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, beauty, and mindfulness based on their lifestyle and goals.

The phases of the data analysis process (Ask, Prepare, Process, Analyze, Share and Act) are to be used in the analysis of the datasets.

Ask

Guiding Questions:

Business Task:

Stakeholders:

Prepare

Overview of the Data

Information on the Data Source:

  • Data is publicly available at FitBit Fitness Tracker Data on Kaggle (https://www.kaggle.com/arashnic/fitbit) and contains 18 csv files.
  • Generated by respondents from a survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016. 33 FitBit users consented to the submission of personal tracker data.
  • Data collected includes physical activity recorded in minutes, heart rate, sleep monitoring, daily activity and steps.

Limitations of the Dataset:

  • Data was collected 5 years ago in 2016. Therefore, data may be outdated as users’ daily activity, fitness and sleeping habits, diet and food consumption may have changed since then.
  • Absence of key characteristics of the userss, such as gender, age, location, lifestyle.
  • Sample size of 30 FitBit users is not representative of the entire fitness population.
  • As data is collected in a survey, we are unable to ascertain its integrity or accuracy.

Is Data ROCCC:

  • A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited.
  • Reliable — LOW — Not reliable as it only has 33 respondents and some of the dataset have less than that.
  • Original — LOW — Third party provider (Amazon Mechanical Turk)
  • Comprehensive — MED — Parameters match most of Bellabeat products’ parameters
  • Current — LOW — Data is 5 years old and may not be relevant
  • Cited — LOW — Data collected from third party, hence unknown

The data integrity and credibility is not sufficient to provide a comprehensive analysis of the company. Therefore, the following analysis can only provide high-level direction and should be verified with

Setting up the Working Environment

The analysis is to be carried out using RStudio. First, all relevant packages are loaded.

library("tidyverse")
library("here")
library("janitor")
library("skimr")
library("dplyr")
library("lubridate")
library("tidyr")
library("ggplot2")

Process

Importing Data

As there is only a small sample size of the datasets, it is prudent focus on the “daily” data which is more likely to provide high level insights.

daily_activity <- read_csv("dailyActivity_merged.csv")
daily_calories <- read_csv("dailyCalories_merged.csv")
daily_intensities <- read_csv("dailyIntensities_merged.csv")
daily_steps <- read_csv("dailySteps_merged.csv")
daily_sleep <- read_csv("sleepDay_merged.csv")

Cleaning the Data

Review the data to look for useful starting points and patterns.

glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
glimpse(daily_calories)
## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ Calories    <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775~
glimpse(daily_intensities)
## Rows: 940
## Columns: 10
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDay              <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
glimpse(daily_steps)
## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ StepTotal   <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054~
glimpse(daily_sleep)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~

Check to see how many distinct ID’s (users) are in each data set.

n_distinct(daily_activity$Id)
## [1] 33
n_distinct(daily_calories$Id)
## [1] 33
n_distinct(daily_intensities$Id)
## [1] 33
n_distinct(daily_steps$Id)
## [1] 33
n_distinct(daily_sleep$Id)
## [1] 24

The sleep data has 9 fewer users.All other data sets have 33 users.

Following a closer inspection, the daily_activity data set contains the relevant data found in daily_calories, daily_intensities and daily_steps. Therefore, the focus can be dedicated to the daily_activity and daily_sleep data sets.

Check for (and remove) duplicates. The sleep data had three identical rows which were removed.

distinct(daily_activity)
## # A tibble: 940 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
##  1 1.50e9 4/12/2016         13162          8.5             8.5                 0
##  2 1.50e9 4/13/2016         10735          6.97            6.97                0
##  3 1.50e9 4/14/2016         10460          6.74            6.74                0
##  4 1.50e9 4/15/2016          9762          6.28            6.28                0
##  5 1.50e9 4/16/2016         12669          8.16            8.16                0
##  6 1.50e9 4/17/2016          9705          6.48            6.48                0
##  7 1.50e9 4/18/2016         13019          8.59            8.59                0
##  8 1.50e9 4/19/2016         15506          9.88            9.88                0
##  9 1.50e9 4/20/2016         10544          6.68            6.68                0
## 10 1.50e9 4/21/2016          9819          6.34            6.34                0
## # ... with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
daily_sleep_updated <- distinct(daily_sleep)

Next, the date columns are to be converted to the correct “date” format.

daily_activity$ActivityDate <-as.Date(daily_activity$ActivityDate, format = "%m/%d/%Y")

The sleep data has time included in the date column. Prior to formatting the column, it is to be split so that the date is on its own.The time data is not necessary and so is removed. Two additional columns were added with the minutes asleep and minutes in bed converted to hours.

sleep_day_2 <- separate(daily_sleep_updated, col = SleepDay, into = c("Date", "Time", "Am/Pm"), sep = " ")

sleep_day_2$Date <-as.Date(sleep_day_2$Date, format = "%m/%d/%Y")

sleep_day_2$`Am/Pm` <- NULL
sleep_day_2$Time <- NULL

sleep_day_2$TotalHoursAsleep <- sleep_day_2$TotalMinutesAsleep/60
sleep_day_2$TotalHoursInBed <- sleep_day_2$TotalTimeInBed/60

Check through each table to see if they contain potentially invalid ID numbers (i.e. less than 10 digits)

subset(daily_activity, nchar(daily_activity$Id) < 10 | nchar(daily_activity$Id) > 10)
## # A tibble: 0 x 15
## # ... with 15 variables: Id <dbl>, ActivityDate <date>, TotalSteps <dbl>,
## #   TotalDistance <dbl>, TrackerDistance <dbl>, LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
subset(daily_sleep_updated, nchar(daily_sleep_updated$Id) < 10 | nchar(daily_sleep_updated$Id) > 10)
## # A tibble: 0 x 5
## # ... with 5 variables: Id <dbl>, SleepDay <chr>, TotalSleepRecords <dbl>,
## #   TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>

No invalid ID numbers.

Check for rows with NA or missing info.

drop_na(daily_activity)
## # A tibble: 940 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <date>            <dbl>         <dbl>           <dbl>            <dbl>
##  1 1.50e9 2016-04-12        13162          8.5             8.5                 0
##  2 1.50e9 2016-04-13        10735          6.97            6.97                0
##  3 1.50e9 2016-04-14        10460          6.74            6.74                0
##  4 1.50e9 2016-04-15         9762          6.28            6.28                0
##  5 1.50e9 2016-04-16        12669          8.16            8.16                0
##  6 1.50e9 2016-04-17         9705          6.48            6.48                0
##  7 1.50e9 2016-04-18        13019          8.59            8.59                0
##  8 1.50e9 2016-04-19        15506          9.88            9.88                0
##  9 1.50e9 2016-04-20        10544          6.68            6.68                0
## 10 1.50e9 2016-04-21         9819          6.34            6.34                0
## # ... with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
drop_na(daily_sleep_updated)
## # A tibble: 410 x 5
##            Id SleepDay          TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
##         <dbl> <chr>                        <dbl>            <dbl>          <dbl>
##  1 1503960366 4/12/2016 12:00:~                1              327            346
##  2 1503960366 4/13/2016 12:00:~                2              384            407
##  3 1503960366 4/15/2016 12:00:~                1              412            442
##  4 1503960366 4/16/2016 12:00:~                2              340            367
##  5 1503960366 4/17/2016 12:00:~                1              700            712
##  6 1503960366 4/19/2016 12:00:~                1              304            320
##  7 1503960366 4/20/2016 12:00:~                1              360            377
##  8 1503960366 4/21/2016 12:00:~                1              325            364
##  9 1503960366 4/23/2016 12:00:~                1              361            384
## 10 1503960366 4/24/2016 12:00:~                1              430            449
## # ... with 400 more rows

Some of the steps and distance results may be 0, which would likely indicate the device was not used that day. As this may skew results, these values were removed.

daily_activity <- daily_activity %>% filter(TotalSteps !=0)
daily_activity <- daily_activity %>%  filter(TotalDistance !=0)

Add a day of the week column to the activity data, so that potential patterns can be evaluated.

daily_activity$Day <- weekdays(daily_activity$ActivityDate)
colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"                 "Day"

Analyze

A summary of several variables related to the activity and sleep data is shown below.

daily_activity %>%
  select(TotalSteps, TotalDistance, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, Calories) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes LightlyActiveMinutes
##  Min.   :    8   Min.   : 0.010   Min.   :   0.0   Min.   :  0.0       
##  1st Qu.: 4927   1st Qu.: 3.373   1st Qu.: 721.2   1st Qu.:147.0       
##  Median : 8054   Median : 5.590   Median :1020.5   Median :208.5       
##  Mean   : 8329   Mean   : 5.986   Mean   : 955.2   Mean   :210.3       
##  3rd Qu.:11096   3rd Qu.: 7.905   3rd Qu.:1189.0   3rd Qu.:272.0       
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :518.0       
##  FairlyActiveMinutes VeryActiveMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.00    Min.   :  52  
##  1st Qu.:  0.00      1st Qu.:  0.00    1st Qu.:1857  
##  Median :  8.00      Median :  7.00    Median :2220  
##  Mean   : 14.79      Mean   : 23.04    Mean   :2362  
##  3rd Qu.: 21.00      3rd Qu.: 35.00    3rd Qu.:2832  
##  Max.   :143.00      Max.   :210.00    Max.   :4900
sleep_day_2 %>%  
  select(TotalHoursAsleep,
         TotalHoursInBed) %>%
  summary()
##  TotalHoursAsleep  TotalHoursInBed 
##  Min.   : 0.9667   Min.   : 1.017  
##  1st Qu.: 6.0167   1st Qu.: 6.729  
##  Median : 7.2083   Median : 7.717  
##  Mean   : 6.9862   Mean   : 7.641  
##  3rd Qu.: 8.1667   3rd Qu.: 8.767  
##  Max.   :13.2667   Max.   :16.017

Observations:

Share

Visualizations are shown below to highlight trends and patterns.

If more steps are taken, are more calories burned?

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=TotalSteps, y=Calories), color="blue") +
  geom_smooth(mapping=aes(x=TotalSteps, y=Calories), color="red") +
  labs(title="The Relationship Between Total Steps and Calories Burned", x="Total Steps", y="Calories Burned (kcal)") +
  theme(plot.title = element_text(hjust = 0.5))

The above graph indicates that users who logged more steps typically burned more calories.

Is there a correlation between level (or intensity) of the activity and how many calories are burned?

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=SedentaryMinutes, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=SedentaryMinutes, y=Calories)) +
  labs(title="Calories Burned Relative to Sedentary Time", x="Sedentary Time (Min)", y="Calories Burned (kcal)")+ theme(plot.title = element_text(hjust = 0.5))

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=FairlyActiveMinutes, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=FairlyActiveMinutes, y=Calories)) +
  labs(title="Calories Burned Relative to Fairly Active Time", x="Fairly Active (Min)", y="Calories Burned (kcal)")+ theme(plot.title = element_text(hjust = 0.5))

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=LightlyActiveMinutes, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=LightlyActiveMinutes, y=Calories)) +
  labs(title="Calories Burned Relative to Light Activity Time", x="Lightly Active (Min)", y="Calories Burned (kcal)")+ theme(plot.title = element_text(hjust = 0.5))

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=VeryActiveMinutes, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=VeryActiveMinutes, y=Calories)) +
  labs(title="Calories Burned Relative to Very Active Time", x="Very Active (Min)", y="Calories Burned (kcal)")+ theme(plot.title = element_text(hjust = 0.5))

In general, there is correlation between higher intensity activity resulting in more calories burned.

Is there a clear correlation between time spent in bed and how much actual sleep a user gets?

ggplot(data=sleep_day_2, mapping=aes(x=TotalHoursAsleep, y=TotalHoursInBed), color="purple") +
  geom_point() +
  geom_smooth() +
  labs(title="Total Sleep Time Relative to Time in Bed", x="Total Time Alseep (hours)", y="Total Time in Bed (hours)") +
  theme(plot.title = element_text(hjust = 0.5))

The amount of sleep a user got increased fairly linearly with how much time they actually spent in bed.

How frequently did users log data (how often did they use the app) and what days?

Note: First, the order the days appear in the output is to be correctly implemented.

day_order <- c( 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')

daily_activity %>% 
  mutate(Day = fct_relevel(.f = Day, levels = day_order)) %>% 
  ggplot(aes(x = Day)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 30)) +
  labs(title="Active Days", y="Logged activity (counts)") +
  theme(plot.title = element_text(hjust = 0.5))

The data indicates that app usage peaks on Tuesdays and Wednesdays before trending down until Mondays.

What percentage of the logged time was active (and to what intensity of activity)?

sum(daily_activity$SedentaryMinutes)
## [1] 823376
activity_status <- c("Sedentary", "LightlyActive", "FairlyActive", "VeryActive")
total_time <- c(
  sum(daily_activity$SedentaryMinutes), 
  sum(daily_activity$LightlyActiveMinutes), 
  sum(daily_activity$FairlyActiveMinutes), 
  sum(daily_activity$VeryActiveMinutes))

total_time_combined <- sum(total_time)

percentage <- c(
  (sum(daily_activity$SedentaryMinutes)*100/total_time_combined), 
  (sum(daily_activity$LightlyActiveMinutes)*100/total_time_combined), 
  (sum(daily_activity$FairlyActiveMinutes)*100/total_time_combined), 
  (sum(daily_activity$VeryActiveMinutes)*100/total_time_combined))

total_activity_status <- data.frame(activity_status, total_time, percentage)

pie(total_activity_status$total_time, main="Activity Breakdown", labels = paste0(activity_status, " = ", round(percentage, 0), "%"), edges = 200, radius = 1,  col=rainbow(4),  init.angle = 15, cex = 1)

A significant majority of the activity logged was sedentary time. Actual activity made up only a fifth of the time, with fairly and very active intensity being a small percentage.

Act

Several insights and suggestions, based on the analysis, is summarized below for the Shareholders:

Based on the limitations of the FitBit data, here are some recommendations for future analysis: