Roadmap

1. Ask

2. Prepare

3. Process

4. Analyze

5. Share

6. Act


Introduction:

Bellabeat is a succesful small company that manufactures health-focused smart products. Bellabeat products empowered women with knowledge regarding their own health and habits. They have the potential to become a larger player in the global smart device market. In this case study, I analyzed FitBit Fitness Tracked Data to help find insights and recommendations that might help unlock new growth opportunities.
***

Ask

  1. The Business Task
    • Focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices
    • Using this information, offer high-level recommendations for how these trends can inform Bellabeat marketing strategy to unlock new growth opportunities for the company
  2. Key Stakeholders
    • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
    • Bellabeat marketing analytics team
    • Customers with membership
    • Customers
    • Potential Customers, e.g., users of other brand smart device products, people interested in improving their health and well being by using smart devices, people interested in looking cool and sporty, people interested in quantifying their habits into data.

Prepare

This case study used the FitBit Fitness Tracked Data posted in Kaggle. By looking into the data we could get several takeaways:

  1. The datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016
  2. Thirty eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring
  3. Individual reports can be parsed by export session ID or timestamp
  4. Variation between output represents use of different types of FitBit trackers and individual tracking behaviors / preferences
  5. The data license is CC0: Public Domain
  6. It was sourced from https://zenodo.org/record/53894#.X9oeh3Uzaao and ackowledged to Furberg, Robert; Brinton, Julia; Keating, Michael ; Ortiz, Alexa
  7. Posted to Kaggle by user named Möbius at 2020-12-16
  8. The data had been preprocessed
  9. Based on Kaggle’s evaluation, the data usability score is 10 out of 10, with regards to followings traits:
    • Easy to understand and includes essential metadata
    • Rich, Rich, machine readable file formats and metadata
    • Assurances the dataset is maintained
  10. The datasets consist of 18 files, but I will study mainly on dailyActivity_merged.csv file and sleepDay_merged.csv as supporting data. The reasons are:
    • I will not dive deep into the details in datasets based on hourly, minutes, and seconds to limit the scope of analysis
    • I will not use weightlog since only 8 IDs registered weight, while there are 33 IDs in total
    • dailyActivity_merged.csv and sleepDay_merged.csv are already merged versions of several other datasets
  11. The number of IDs tracking data in daily_activity is 33
  12. The number of IDs tracking data in sleepday is 24

Process

The cleaning and manipulation I had done to verify and add data into the informations presented in the Prepare phase before is:
  • Setting up the environment in R Studio
#fundamental building blocks to support the entire end-to-end workflow
library(tidyverse)
#to provide summary statistics
library(skimr)
#examining and cleaning dirty data
library(janitor)
#Lubridate makes it easier to do the things R does with date-times 
library(lubridate)
#Scales to better state the numbers shown in visualization
library(scales)
  • Importing the Dataset by using RStudio menu “Import Dataset From Text (readr)…
setwd("~/R/Case Study 2/Fitabase Data 4.12.16-5.12.16")
dailyActivity_merged <- read_csv("dailyActivity_merged.csv")
sleepDay_merged <- read_csv("sleepDay_merged.csv")
  • Cleaned and rename the data
daily_activity <- clean_names(dailyActivity_merged)
sleepday <- clean_names(sleepDay_merged)
  • Used Skimr package to get the general sense of the data
skim_without_charts(daily_activity)
Data summary
Name daily_activity
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
activity_date 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09
total_steps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04
total_distance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
tracker_distance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
logged_activities_distance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00
very_active_distance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01
moderately_active_distance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00
light_active_distance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01
sedentary_active_distance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01
very_active_minutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02
fairly_active_minutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02
lightly_active_minutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02
sedentary_minutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03
calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03
skim_without_charts(sleepday)
Data summary
Name sleepday
Number of rows 413
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sleep_day 0 1 20 21 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 5.000979e+09 2.06036e+09 1503960366 3977333714 4702921684 6962181067 8792009665
total_sleep_records 0 1 1.120000e+00 3.50000e-01 1 1 1 1 3
total_minutes_asleep 0 1 4.194700e+02 1.18340e+02 58 361 433 490 796
total_time_in_bed 0 1 4.586400e+02 1.27100e+02 61 403 463 526 961
  • Skim results shown that:
    • id column’s class was numeric, it had to be converted into character so it could be used to summarize the other attributes
    • activity_date column in daily_activity object and sleep_day column in sleepday object were character class type not date time and needed to be fixed
#separate date time in sleepday merged
sleepday <- sleepday %>% 
  separate(sleep_day,c("sleep_date","time")," ") 

#remove time column since it's only showing 12 am
sleepday$time <- NULL

#convert activity_date from chr to date
daily_activity$activity_date <- mdy(daily_activity$activity_date)
sleepday$sleep_date<- mdy(sleepday$sleep_date)

#convert id from numeric to character
daily_activity$id <- as.character(daily_activity$id)
sleepday$id <- as.character(sleepday$id)
  • Looked into the how many Ids tracked the data in daily_activity and sleepday
n_distinct(daily_activity$id)
## [1] 33
n_distinct(sleepday$id)
## [1] 24

Analyze

I decided to use ID as the basis to perform the analysis. This is related to the bussiness task thatinquire how people are already using their smart devices. First I looked into the daily_activity.

#group by id to observe what each id doing
da.id <- group_by(daily_activity, id)

#first we look at how many days the FitBit users participated in the data collection process
da.id.datelength <- summarize(da.id, min(activity_date), max(activity_date), length(activity_date))

#rename the column in da.id.datelength to make more sense
da.id.datelength <- rename(da.id.datelength, start_date = "min(activity_date)", end_date = "max(activity_date)", num_of_days = "length(activity_date)")

#count the number of IDs participated until 12th of May
da.id.fullparticipationcount <- count(da.id.datelength, num_of_days == 31)

#Visualize the participation graph
ggplot(data = da.id.datelength)+
  geom_bar(mapping = aes(x=num_of_days, fill=id))+
  ylim(0,30)+
  labs(title="Numbers of IDs by Length of Participations", subtitle="Sample of 33 FitBit Tracker IDs", y="IDs Count", x="Length of Participations", legend="ID")+
  annotate("text", x=28,y=29, label="21 FitBit IDs tracked data", size=2.5)+
  annotate("text", x=28,y=27, label="Since the 12th of April", size=2.5)+
  annotate("text", x=28,y=25, label="Up until the 12th of May", size=2.5)+
  scale_fill_discrete(name="ID")

#21 IDs tracked data up until 12th of may 
#while 12 IDs stopped tracking data before the end of data collection process

#I decided to filter out the 12 IDs that didn't track data until the 12th of May
da.id.complete <- da.id.datelength[,-(2:3)] %>% filter(num_of_days==31)

#I created table for the 21 IDs as the base of further analysis
da.id.base <- da.id.complete [,-2]

Armed with 21 IDs as Primary Keys, I continued my data journey into the data about daily activity. I wanted to see how the devices tracked time and distance related to calories burned.

First, I looked into the aspect of time.

# I filtered for the active minutes 
da.id.active_min_cal <- da.id %>% 
  select(-(2:10)) %>% 
  summarize(sum(very_active_minutes),sum(fairly_active_minutes),sum(lightly_active_minutes),sum(sedentary_minutes),sum(calories))

#Next I joined the 21 IDs as the base to to the active min see how these devices are used by their users
da.id.complete.active_min_cal <- left_join(da.id.base, da.id.active_min_cal, by="id")

#I needed to rename the column header to ease the process in the next step
names(da.id.complete.active_min_cal)[2:6] <- c("very_am_sum","fairly_am_sum","lightly_am_sum","sedentary_m_sum","calories_sum")

#I thought it was necessary to add one column of total_active_minutes consists of summary of all activity classifications minutes
da.id.complete.active_min_cal <- mutate(da.id.complete.active_min_cal, total_minutes = very_am_sum + fairly_am_sum +  lightly_am_sum + sedentary_m_sum)

#Then I visualized the general relationship between the minutes spent and the calories burned
ggplot(data = da.id.complete.active_min_cal)+
  geom_point(mapping = aes(x=total_minutes,y=calories_sum, color=id))+
  scale_y_continuous(name = "Calories Burned", labels=comma)+
  scale_x_continuous(name = "Total Minutes Spent", labels=comma)+
  labs(title="Calories Burned by Total Minutes Spent", subtitle="Sample of 21 FitBit Tracker IDs")

#there are two column with similar content, total distance and tracker distance
#checking whether or not they're filled with the same numerical data
dist.total_minus_tracker <- transmute(da.id, total_minus_tracker = total_distance - tracker_distance)
arrange(dist.total_minus_tracker, desc(total_minus_tracker))
## # A tibble: 940 x 2
## # Groups:   id [33]
##    id         total_minus_tracker
##    <chr>                    <dbl>
##  1 6962181067               1.83 
##  2 7007744171               1.16 
##  3 7007744171               1.16 
##  4 7007744171               1.14 
##  5 7007744171               1.07 
##  6 7007744171               1.06 
##  7 7007744171               1.05 
##  8 7007744171               0.980
##  9 7007744171               0.900
## 10 7007744171               0.880
## # ... with 930 more rows
# Filtered for the active distance
# By this time I had learned to immediately name the new column so I didn't have to rename it later
# I also realized that I don't have to use select(-) to filter out other columns since the summarize function aready doing that by itself
da.id.active_dist_cal <- da.id %>% 
  summarize(very_ad_sum = sum(very_active_distance), fairly_ad_sum = sum(moderately_active_distance), lightly_ad_sum = sum(light_active_distance), sedentary_d_sum = sum(sedentary_active_distance), calories_sum = sum(calories))

#Did they add up?
transmute(da.id, wrong_if_not_zero = (sum(total_distance) - sum(very_active_distance) - sum(moderately_active_distance) - sum(light_active_distance) - sum(sedentary_active_distance)))
## # A tibble: 940 x 2
## # Groups:   id [33]
##    id         wrong_if_not_zero
##    <chr>                  <dbl>
##  1 1503960366             0.130
##  2 1503960366             0.130
##  3 1503960366             0.130
##  4 1503960366             0.130
##  5 1503960366             0.130
##  6 1503960366             0.130
##  7 1503960366             0.130
##  8 1503960366             0.130
##  9 1503960366             0.130
## 10 1503960366             0.130
## # ... with 930 more rows
#Because the total calculation of very_ad_sum + fairly_ad_sum + lightly_ad_sum + sedentary_d_sum in some of the IDs didn't add up
#I decided to not use either the value in total_distance column or tracker_distance column
#Instead, I calculated the sum of each distance column to a new colum called total distance
da.id.active_dist_cal <- mutate(da.id.active_dist_cal, total_distance = very_ad_sum + fairly_ad_sum +  lightly_ad_sum + sedentary_d_sum)

#Then I joined it with the 21 base ID to filter out IDs that stopped tracking data before the 12th of May
da.id.complete.active_dist_cal <- left_join(da.id.base, da.id.active_dist_cal, by="id")

#I visualized the data to look at the relationship between distance tracked and calories burned
ggplot(data = da.id.complete.active_dist_cal)+
  geom_point(mapping = aes(x=total_distance,y=calories_sum, color=id))+
  scale_y_continuous(name = "Calories Burned", labels=comma)+
  scale_x_continuous(name = "Total Distance Reached", labels=comma)+
  labs(title="Calories Burned by Total Distance Reached", subtitle="Sample of 21 FitBit Tracker IDs")

sleepday.id <- sleepday %>% 
  group_by(id) %>% 
  summarise(total_time_in_bed = sum(total_time_in_bed), 
            total_sleep_time = sum(total_minutes_asleep))

#let's join the 21 base id to the sleep data with 24 id
sleepday.id.complete <- left_join(da.id.base, sleepday.id, by="id")  

Share

Conclusion

  1. From 30 people who participated, there are 33 IDs that parsed data, but only 21 IDs kept on tracking activites up until the end date at 12th of May, but only 16 DIs remaining that kept tracking sleep data. This shows that:
  • Some users used more than 1 devices
  • Some users lost interest in tracking their data
  • Tracking sleep data is not as attractive as tracking activity data

  1. The gap shown in the analysis of total minutes spent and calories burn indicate that people are either doing exercises for longer time, or in shorter time. Yet, both approaches were able to deliver similar amount of calories burned.

  1. The linear nature of relationship between the distance tracked and calories burn indicate that users of the devices in this datasets were mainly runners.

Act

Recommendation

  1. Before recommending actions, let us review the products already covered by Bellabeat:

    • Bellabeat App
    • Leaf tracker, worn as bracelet/necklace/clip
    • Time, tracker worn as a watch
    • Spring, smart water bottle
  2. Tracking sleep data is not as attractive as tracking activity data. This is probably due to the nature of wearables that might cause discomfort when users are trying to sleep. Bellabeat might unlock new great opportinity in this path by creating a product that focus on tracking sleep data without causing discomfort or disturb the attention of the user who are trying to sleep. I suggest something along smart mattress or bed or bed frame to track sleep data.

  3. Users were able to burn high amount of calories in either longer or shorter time of activities. But those users are the minority. Meanwhile most users burned relatively low or decent amount of calories in eitther longer or shorter time of activities. Ideally, it would be a great if users can burn high amount of calories in the shortest of time, push them forward to the upper left corner of the scatterplot. Bellabeat can tap into this notion of ideal world by:

  • Delivering insights about the “gap” from each user’s activity (wherever they are in the scatterplot) to the ideal user acttivity (in the upper left corner int the scatterplot) via the Bellabeat app
  • Offering training and curriculum to close that gap by special offering membership to complement/upgrade/replace the already running membership program.
  1. Since the data indicates that probably most users were runners, Bellabeat should forge a path into the running industry, especially for women who run. Since the data also shown that low - mid distance records were higher than the long distance records. We could apply the 2nd recommendation into creating a running program for women who want to empower themselves by becoming strong runners. Bellabeat could work with great marathon runners like Brigid Kosgei, Mao Ichiyama, or Sara Hall to help brand the direction Bellabeat is going to. Combined with the already collected data regarding women fitness, Bellabeat might lead the market in women runners wearables in the future.