Introduction

Bellabeat is a high-tech company that manufactures health-focused smart products, including Bellabeat application, Leaf, Time and Spring. Their ecosystem of products and services focused on women’s health by collecting and providing user’s health data related to their activity, sleep, stress.

The company expect to become a larger player in the global smart device market. Urška Sršen Cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

This is a case study of Google Data Analytics Certification program, then I will follow the Data Analysis Processes which include 6 phases : Ask, Prepare, Process, Analyze, Share and Act.

I will use R programming to analyse and document for this case study.

Ask Phase

Business Task

Unlock new growth opportunity by analyzing the trend of smart device usage and suggesting data-driven marketing stategy

Data analysis goal

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Deliverables

Top high-level recommendations for marketing team.

Key Stake holders

  • Urška Sršen — Bellabeat’s co-founder and Chief Creative Officer;
  • Sando Mur — Mathematician and Bellabeat’s co-founder;
  • Key member of executive team, including Marketing leads.

Prepare Phase

Source of data

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius) contains personal fitness tracker from thirty Fitbit users.

Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Determine the credibility & integrity of the data:

ROCCC Analysis

ROCCC (Reliable, Original, Comprehensive, Current & Cited) will be used to assess the credibility and integrity of the data

  • Reliability: (LOW) There were only 30 individuals involved in this survey in data set. Gender is unknown. Then this is a very small sample size and does not discern betwen men and women.
  • Originality: (LOW) Data was collected via third-party survey by Amazon Mechanical Turk.
  • Comprehensive: (MEDIUM) – Datasets contains multiple parameters on daily activity, daily steps taken, daily sleep time which required for business task requirements.
  • Current: (LOW) The dataset was collected in 2016 (over 8 years ago) and covered a short period of Apr – May 2016. The collection window is so short and might be not represented for the whole year activities.
  • Cited: (HIGH) This dataset is under CCO: public domain, made available by Mobius stored in Kaggle.

Data selection

There are 18 files in this dataset. But as the focus is to identify the pattern of activity in smart devices usages, I would suggest the following files:

  • dailyActivity_merged.csv
  • dailyCalories_merged.csv
  • dailyIntensities_merge.csv
  • dailySetps_merged.csv
  • dailySleep_merged.csv
  • weightLogInfo_merged.csv

Process Phase

R Packages and libraries are installed and loaded as below:

library(tidyverse)
library(skimr)
library(janitor)
library(ggplot2)
library(lubridate)
library(here)

Import data and Quick summmary

Import selected datasets that will be used in analysis:

file

colnames

# obs

# variables

dailyActivity

Id, ActivityDate, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, SedentaryActiveDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories

940

15

dailyCalories

Id, ActivityDay, Calories

940

3

dailyIntensities

Id, ActivityDay, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, SedentaryActiveDistance, LightActiveDistance, ModeratelyActiveDistance, VeryActiveDistance

940

10

dailySteps

Id, ActivityDay, StepTotal

940

3

dailySleep

Id, SleepDay, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed

413

5

weightLogInfo

Id, Date, WeightKg, WeightPounds, Fat, BMI, IsManualReport, LogId

67

8

Then, quick check of data to ensure if cleaning data is needed:

head(dailyActivity)
## # A tibble: 6 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 4/12/2016         13162          8.5             8.5 
## 2 1503960366 4/13/2016         10735          6.97            6.97
## 3 1503960366 4/14/2016         10460          6.74            6.74
## 4 1503960366 4/15/2016          9762          6.28            6.28
## 5 1503960366 4/16/2016         12669          8.16            8.16
## 6 1503960366 4/17/2016          9705          6.48            6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
str(dailyActivity)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(dailyCalories)
## spc_tbl_ [940 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id         : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDay = col_character(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(dailyIntensities)
## spc_tbl_ [940 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay             : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDay = col_character(),
##   ..   SedentaryMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   VeryActiveDistance = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(dailySteps)
## spc_tbl_ [940 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id         : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : num [1:940] 13162 10735 10460 9762 12669 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDay = col_character(),
##   ..   StepTotal = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(dailySleep)
## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(weightLogInfo)
## spc_tbl_ [67 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id            : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num [1:67] 116 116 294 125 126 ...
##  $ Fat           : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
##  $ LogId         : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   Date = col_character(),
##   ..   WeightKg = col_double(),
##   ..   WeightPounds = col_double(),
##   ..   Fat = col_double(),
##   ..   BMI = col_double(),
##   ..   IsManualReport = col_logical(),
##   ..   LogId = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data cleaning

Verifying number of user

n_distinct(dailyActivity$Id)
## [1] 33
n_distinct(dailyCalories$Id)
## [1] 33
n_distinct(dailyIntensities$Id)
## [1] 33
n_distinct(dailySteps$Id)
## [1] 33
n_distinct(dailySleep$Id)
## [1] 24
n_distinct(weightLogInfo$Id)
## [1] 8

With expected 30 user participating in the survey, but the daily dataset have 33 unique Id, this means some users might have created more Id. While the dailySleep and weightLogIn dataset have a missing of 6 and 22 participant’s information.

Duplicate

We will check if any duplicates:

## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 3
## [1] 0

Remove duplicate and NA records

Knowing duplicates in dailySleep, we can remove them or retrieve only distinct and non_NA rows as follows:

dailyActivity <- dailyActivity %>% distinct() %>% drop_na()
dailyCalories <- dailyCalories %>% distinct() %>% drop_na()
dailyIntensities <- dailyIntensities %>% distinct() %>% drop_na()
dailySteps <- dailySteps %>% distinct() %>% drop_na()
dailySleep <- dailySleep %>% distinct() %>% drop_na()

With weightLogInfo, we will not remove NAs, because this table are input manually by user, and some user may skip inputting their weight information into application.

Clean and rename columns

To ensure column names are consistently using same syntax format, we should clean and format them using lowercase:

dailyActivity <- clean_names(dailyActivity)
dailyCalories <- clean_names(dailyCalories)
dailyIntensities <- clean_names(dailyIntensities)
dailySteps <- clean_names(dailySteps)
dailySleep <- clean_names(dailySleep)
weightLogInfo <- clean_names(weightLogInfo)

Made consistence of date and time columns

With date/time columns, we should convert them as_datetime format and separate into date and time columns if any. We will also add weekday for further analysis.

dailyActivity <- dailyActivity %>%
  rename(date=activity_date) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y")) %>%
  mutate(weekday = weekdays(date)) %>%
  mutate(weekday = ordered(weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday", "Sunday")))

dailyCalories <- dailyCalories %>%
  rename(date = activity_day) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

dailyIntensities <- dailyIntensities %>%
  rename(date = activity_day) %>% 
  mutate(date = as_date(date, format = "%m/%d/%Y"))

dailySteps <- dailySteps %>%
  rename(date = activity_day) %>% 
  mutate(date = as_date(date, format = "%m/%d/%Y"))

dailySleep <- dailySleep %>%
  rename(date = sleep_day) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p"))

weightLogInfo <- weightLogInfo %>%
  mutate(date=as.POSIXct(date, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())) %>%
  separate(date,into=c("date","time"),sep=" ")%>%
  mutate(date=as_date(date))

Merge data

We will join two tables: dailyActivity and dailySleep to analyse the impact of the daily activities to the sleeps.

daily_Activity_Sleep <- merge(dailyActivity,dailySleep,by=c("id","date"), all = TRUE)

# or:
# daily_Activity_Sleep <- dailyActivity %>% 
#  full_join(dailySleep)

Analyze and Share Phase

Quick summary

daily_Activity_Sleep %>%  
  select(total_steps,
         total_distance,
         ends_with("minutes"),
         calories, 
         total_minutes_asleep,
         total_time_in_bed) %>%
  summary()
##   total_steps    total_distance   very_active_minutes fairly_active_minutes
##  Min.   :    0   Min.   : 0.000   Min.   :  0.00      Min.   :  0.00       
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:  0.00      1st Qu.:  0.00       
##  Median : 7406   Median : 5.245   Median :  4.00      Median :  6.00       
##  Mean   : 7638   Mean   : 5.490   Mean   : 21.16      Mean   : 13.56       
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 32.00      3rd Qu.: 19.00       
##  Max.   :36019   Max.   :28.030   Max.   :210.00      Max.   :143.00       
##                                                                            
##  lightly_active_minutes sedentary_minutes    calories    total_minutes_asleep
##  Min.   :  0.0          Min.   :   0.0    Min.   :   0   Min.   : 58.0       
##  1st Qu.:127.0          1st Qu.: 729.8    1st Qu.:1828   1st Qu.:361.0       
##  Median :199.0          Median :1057.5    Median :2134   Median :432.5       
##  Mean   :192.8          Mean   : 991.2    Mean   :2304   Mean   :419.2       
##  3rd Qu.:264.0          3rd Qu.:1229.5    3rd Qu.:2793   3rd Qu.:490.0       
##  Max.   :518.0          Max.   :1440.0    Max.   :4900   Max.   :796.0       
##                                                          NA's   :530         
##  total_time_in_bed
##  Min.   : 61.0    
##  1st Qu.:403.8    
##  Median :463.0    
##  Mean   :458.5    
##  3rd Qu.:526.0    
##  Max.   :961.0    
##  NA's   :530
  • The average of sedentary time (non-active) is 991 minutes or ~16,5 hours, which we should target to reduce.
  • Majority of participant is lightly active based on active minutes.
  • On average, participant sleep 419 minutes or ~ 7 hours
  • Average number of steps per day is 7638 steps, a “somewhat active” level based on 10,000 steps goal.
  • The average participant burn per day is 2304 calories (~97 calories/hour)

Correlations

We will now validate the correlation of data for numeric variables:

Besides the extremely correlation between total_steps vs. total_distance which is equal to 1, we can see other pairs with highly positive correlation (cor>=0,7):

  • very_active_distance vs. very_active_minutes.
  • moderate_active_distance vs. fairly_active_minutes.
  • lightly_active_distance vs. lightly_active_minutes.
  • total_time_in_bed vs. total_time_asleep.

Then, in the following, we can use either *_active_distance or *_active_minutes; either total_steps or total_distance for activity level analysis; And either time_in_asleep or time_in_bed for sleep quality.

Hypothesis

From preliminary view, there are some hypothesis which have been assumed on activity, calories burnt, sleep time and weight:

  • There are relationship between activity level and calories burnt.
  • There are relationship between activity level and sleep time.
  • There are relationship between activity level and weight.
  • There are relationship between steps and distance.

Activity level

Let’s analyze the activity level distribution based on active/non-active minutes:

Sedentary (non-active) minutes

Since we know the average of non-active is 16,5 hours, we need to analyse the distribution of sedentary minutes:

Activity level and Calories burnt

The visualization above shows a positive correlation between very active minutes and calories burnt. This also initially shows positive correlation but turned negative - lesser burned calories as sedentary minutes increase. Then we can say: the person who has very active minutes tends to burn more calories in a day, and the more they spend inactive, the lower calories they burn in a day

Activity level and Sleep quality

  • We can see the negative relationship between non-activity - Sedentary minutes and asleep minutes. The more non-activity minutes, the lower they asleep in bed
  • With other active minutes, after 50 minutes active, it’s:
    • negative relationship between fairly active and sleep quality.
    • positive relationship between very active and sleep quality.
  • From lightly active and within 50 minutes active in fairly and very activity, the total minutes asleep are from 250-550 minutes.

In next section, we could verify when user perform fairly and very active activities during the weekdays to see if the exercise time is impacted to sleep quality.

Steps and sleeping goals for healthy

User type

According to 10,000 steps goal program, it’s recommended that 10,000 is the daily steps target for healthy adults. It also suggested the pedometer indices as guideline on steps and activity levels. We map pedometer to sedentary, lightly active, fairly active and very active to classify user type based on total daily steps:

  • Sedentary is less than 5,000 steps per day
  • Lightly active is 5,000 to 7,499 steps per day
  • Fairly active is 7,500 to 9,999 steps per day
  • Very Active is more than 10,000 steps per day

With classification based on total daily steps, we can see that user type are fairly distributed among activity levels.

Steps and Distance relation

We can observe the highly relationship between the Distance and Steps, meaning the more steps user can made the more distance is covered.

Steps and Sleep per weekdays

We can see in two charts:

  • User’s daily steps compared to the 7500 steps goal - as the ~Somewhat active level.
  • User’s sleeping minutes is under recommended daily sleep hour (8 hours ~ 480 minutes) daily sleep

Then, we should alert user when their tracker for Steps and Sleeping minutes below those threshold.

We can see, user are more active on Tuesday and Saturday with the higher steps they made, and tend to burn more calories on those days.

Activity level and weight relation

Since there are only 8 person record their weight; so with small sample size, the charts below is for illustration rather conclude about weight and activity level

Act Phase

After analyzing the Fitbit trackers data, it’s came up with the following data-driven recommendations for Bellabeat market strategy:

  1. Bellabeat can include function in their application to alert user who tend to have higher non-active minutes. They also can add function to alert if user haven’t worn their tracker devices in period time.

  2. Taking 7,500 steps per day can help to reduce risk for all-cause of mortality as CDC’s recommendation; Bellabeat can add function to promote that benefit and track whether user to met this daily goal.

  3. Bellabeat can include timely notification in Leaf/Time to motivate user to move around regularly to reduce their sedentary minutes.

  4. Bellabeat can add a function to remind user inputting their weight information, then use the relation between sedentary minutes/active minutes and weight to promote an active lifestyle.

  5. Bellabeat can enhance their sleep tracking function to promote the sleep/non-active relation. Tracking and alert user sleep less than 8 hours a day. User can set up the desired time to sleep and Bellabeat application can notify user. Articles or postcast related to sleeping technique or some relaxing music to help user a better sleep are also helpful.

THANK YOU.