Case Study 2 - How Can a Wellness Technology Company Play it Smart?

Roadmap

1. Ask

2. Prepare

3. Process

4. Analyze

6. Act

Introduction:

Bellabeat is a succesful small company that manufactures health-focused smart products. Bellabeat products empowered women with knowledge regarding their own health and habits. They have the potential to become a larger player in the global smart device market. In this case study, I analyzed FitBit Fitness Tracked Data to help find insights and recommendations that might help unlock new growth opportunities.
***

Ask

The Business Task
- Focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices
- Using this information, offer high-level recommendations for how these trends can inform Bellabeat marketing strategy to unlock new growth opportunities for the company
Key Stakeholders
- Urška Sršen: Bellabeat’s cofounder and Chief Creative Oﬃcer Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
- Bellabeat marketing analytics team
- Customers with membership
- Customers
- Potential Customers, e.g., users of other brand smart device products, people interested in improving their health and well being by using smart devices, people interested in looking cool and sporty, people interested in quantifying their habits into data.

Prepare

This case study used the FitBit Fitness Tracked Data posted in Kaggle. By looking into the data we could get several takeaways:

The datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016
Thirty eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring
Individual reports can be parsed by export session ID or timestamp
Variation between output represents use of different types of FitBit trackers and individual tracking behaviors / preferences
The data license is CC0: Public Domain
It was sourced from https://zenodo.org/record/53894#.X9oeh3Uzaao and ackowledged to Furberg, Robert; Brinton, Julia; Keating, Michael ; Ortiz, Alexa
Posted to Kaggle by user named Möbius at 2020-12-16
The data had been preprocessed
Based on Kaggle’s evaluation, the data usability score is 10 out of 10, with regards to followings traits:
- Easy to understand and includes essential metadata
- Rich, Rich, machine readable file formats and metadata
- Assurances the dataset is maintained
The datasets consist of 18 files, but I will study mainly on dailyActivity_merged.csv file and sleepDay_merged.csv as supporting data. The reasons are:
- I will not dive deep into the details in datasets based on hourly, minutes, and seconds to limit the scope of analysis
- I will not use weightlog since only 8 IDs registered weight, while there are 33 IDs in total
- dailyActivity_merged.csv and sleepDay_merged.csv are already merged versions of several other datasets
The number of IDs tracking data in daily_activity is 33
The number of IDs tracking data in sleepday is 24

Process

The cleaning and manipulation I had done to verify and add data into the informations presented in the Prepare phase before is:

Setting up the environment in R Studio

#fundamental building blocks to support the entire end-to-end workflow
library(tidyverse)
#to provide summary statistics
library(skimr)
#examining and cleaning dirty data
library(janitor)
#Lubridate makes it easier to do the things R does with date-times 
library(lubridate)
#Scales to better state the numbers shown in visualization
library(scales)

Importing the Dataset by using RStudio menu “Import Dataset From Text (readr)…”

setwd("~/R/Case Study 2/Fitabase Data 4.12.16-5.12.16")
dailyActivity_merged <- read_csv("dailyActivity_merged.csv")
sleepDay_merged <- read_csv("sleepDay_merged.csv")

Cleaned and rename the data

daily_activity <- clean_names(dailyActivity_merged)
sleepday <- clean_names(sleepDay_merged)

Used Skimr package to get the general sense of the data

skim_without_charts(daily_activity)

Data summary
Name	daily_activity
Number of rows	940
Number of columns	15
_______________________
Column type frequency:
character	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
activity_date	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
id	1	4.855407e+09	2.424805e+09	1503960366	2.320127e+09	4.445115e+09	6.962181e+09	8.877689e+09
total_steps	1	7.637910e+03	5.087150e+03	0	3.789750e+03	7.405500e+03	1.072700e+04	3.601900e+04
total_distance	1	5.490000e+00	3.920000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
tracker_distance	1	5.480000e+00	3.910000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
logged_activities_distance	1	1.100000e-01	6.200000e-01	0	0.000000e+00	0.000000e+00	0.000000e+00	4.940000e+00
very_active_distance	1	1.500000e+00	2.660000e+00	0	0.000000e+00	2.100000e-01	2.050000e+00	2.192000e+01
moderately_active_distance	1	5.700000e-01	8.800000e-01	0	0.000000e+00	2.400000e-01	8.000000e-01	6.480000e+00
light_active_distance	1	3.340000e+00	2.040000e+00	0	1.950000e+00	3.360000e+00	4.780000e+00	1.071000e+01
sedentary_active_distance	1	0.000000e+00	1.000000e-02	0	0.000000e+00	0.000000e+00	0.000000e+00	1.100000e-01
very_active_minutes	1	2.116000e+01	3.284000e+01	0	0.000000e+00	4.000000e+00	3.200000e+01	2.100000e+02
fairly_active_minutes	1	1.356000e+01	1.999000e+01	0	0.000000e+00	6.000000e+00	1.900000e+01	1.430000e+02
lightly_active_minutes	1	1.928100e+02	1.091700e+02	0	1.270000e+02	1.990000e+02	2.640000e+02	5.180000e+02
sedentary_minutes	1	9.912100e+02	3.012700e+02	0	7.297500e+02	1.057500e+03	1.229500e+03	1.440000e+03
calories	1	2.303610e+03	7.181700e+02	0	1.828500e+03	2.134000e+03	2.793250e+03	4.900000e+03

skim_without_charts(sleepday)

Data summary
Name	sleepday
Number of rows	413
Number of columns	5
_______________________
Column type frequency:
character	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
sleep_day	0	1	20	21	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
id	1	5.000979e+09	2.06036e+09	1503960366	3977333714	4702921684	6962181067	8792009665
total_sleep_records	1	1.120000e+00	3.50000e-01	1	1	1	1	3
total_minutes_asleep	1	4.194700e+02	1.18340e+02	58	361	433	490	796
total_time_in_bed	1	4.586400e+02	1.27100e+02	61	403	463	526	961

Skim results shown that:
- id column’s class was numeric, it had to be converted into character so it could be used to summarize the other attributes
- activity_date column in daily_activity object and sleep_day column in sleepday object were character class type not date time and needed to be fixed

#separate date time in sleepday merged
sleepday <- sleepday %>% 
  separate(sleep_day,c("sleep_date","time")," ") 

#remove time column since it's only showing 12 am
sleepday$time <- NULL

#convert activity_date from chr to date
daily_activity$activity_date <- mdy(daily_activity$activity_date)
sleepday$sleep_date<- mdy(sleepday$sleep_date)

#convert id from numeric to character
daily_activity$id <- as.character(daily_activity$id)
sleepday$id <- as.character(sleepday$id)

Looked into the how many Ids tracked the data in daily_activity and sleepday

n_distinct(daily_activity$id)

## [1] 33

n_distinct(sleepday$id)

## [1] 24

Analyze

I decided to use ID as the basis to perform the analysis. This is related to the bussiness task thatinquire how people are already using their smart devices. First I looked into the daily_activity.

#group by id to observe what each id doing
da.id <- group_by(daily_activity, id)

#first we look at how many days the FitBit users participated in the data collection process
da.id.datelength <- summarize(da.id, min(activity_date), max(activity_date), length(activity_date))

#rename the column in da.id.datelength to make more sense
da.id.datelength <- rename(da.id.datelength, start_date = "min(activity_date)", end_date = "max(activity_date)", num_of_days = "length(activity_date)")

#count the number of IDs participated until 12th of May
da.id.fullparticipationcount <- count(da.id.datelength, num_of_days == 31)

#Visualize the participation graph
ggplot(data = da.id.datelength)+
  geom_bar(mapping = aes(x=num_of_days, fill=id))+
  ylim(0,30)+
  labs(title="Numbers of IDs by Length of Participations", subtitle="Sample of 33 FitBit Tracker IDs", y="IDs Count", x="Length of Participations", legend="ID")+
  annotate("text", x=28,y=29, label="21 FitBit IDs tracked data", size=2.5)+
  annotate("text", x=28,y=27, label="Since the 12th of April", size=2.5)+
  annotate("text", x=28,y=25, label="Up until the 12th of May", size=2.5)+
  scale_fill_discrete(name="ID")

#21 IDs tracked data up until 12th of may 
#while 12 IDs stopped tracking data before the end of data collection process

#I decided to filter out the 12 IDs that didn't track data until the 12th of May
da.id.complete <- da.id.datelength[,-(2:3)] %>% filter(num_of_days==31)

#I created table for the 21 IDs as the base of further analysis
da.id.base <- da.id.complete [,-2]

Armed with 21 IDs as Primary Keys, I continued my data journey into the data about daily activity. I wanted to see how the devices tracked time and distance related to calories burned.

First, I looked into the aspect of time.

# I filtered for the active minutes 
da.id.active_min_cal <- da.id %>% 
  select(-(2:10)) %>% 
  summarize(sum(very_active_minutes),sum(fairly_active_minutes),sum(lightly_active_minutes),sum(sedentary_minutes),sum(calories))

#Next I joined the 21 IDs as the base to to the active min see how these devices are used by their users
da.id.complete.active_min_cal <- left_join(da.id.base, da.id.active_min_cal, by="id")

#I needed to rename the column header to ease the process in the next step
names(da.id.complete.active_min_cal)[2:6] <- c("very_am_sum","fairly_am_sum","lightly_am_sum","sedentary_m_sum","calories_sum")

#I thought it was necessary to add one column of total_active_minutes consists of summary of all activity classifications minutes
da.id.complete.active_min_cal <- mutate(da.id.complete.active_min_cal, total_minutes = very_am_sum + fairly_am_sum +  lightly_am_sum + sedentary_m_sum)

#Then I visualized the general relationship between the minutes spent and the calories burned
ggplot(data = da.id.complete.active_min_cal)+
  geom_point(mapping = aes(x=total_minutes,y=calories_sum, color=id))+
  scale_y_continuous(name = "Calories Burned", labels=comma)+
  scale_x_continuous(name = "Total Minutes Spent", labels=comma)+
  labs(title="Calories Burned by Total Minutes Spent", subtitle="Sample of 21 FitBit Tracker IDs")

From this general look of relationship beetween total minutes spent and calories burned we could see a gap in the middle of the graph, while the plots are either clustered in the right side or to the left side.
- Plot cluster in the right side represent devices that tracked higher amount of minutes spent, and spread quite evenly in vertical manner which to say that some users waste a lot of time burning relatively small amount of calories, while some other burned relatively high amount of calories along with the long time spent, but the id trackers mainly sit in the middle, showing that most users spent a lot of time to burn relatively decent amount of calories.
- Plot cluster on the left side represent devices that tracked lower amount of minutes spent.
  - The ID devicess in most left cluster mostly sit bellow in the far left corner spending relatively short of time to burn relatively low amount of calories, meanwhile in the far left upper-corner, a single device tracked a high amount of calories burned in relatively short of time.
  - Still in the left cluster but closer to the middle, we observed several devices that tracked relatively shorter time while burning relatively decent amount of calories.
Now that we observed how times were spent by the users and tracked by the devices, we could take a peek at how distances were tracked by the IDs in the data.

#there are two column with similar content, total distance and tracker distance
#checking whether or not they're filled with the same numerical data
dist.total_minus_tracker <- transmute(da.id, total_minus_tracker = total_distance - tracker_distance)
arrange(dist.total_minus_tracker, desc(total_minus_tracker))

## # A tibble: 940 x 2
## # Groups:   id [33]
##    id         total_minus_tracker
##    <chr>                    <dbl>
##  1 6962181067               1.83 
##  2 7007744171               1.16 
##  3 7007744171               1.16 
##  4 7007744171               1.14 
##  5 7007744171               1.07 
##  6 7007744171               1.06 
##  7 7007744171               1.05 
##  8 7007744171               0.980
##  9 7007744171               0.900
## 10 7007744171               0.880
## # ... with 930 more rows

# Filtered for the active distance
# By this time I had learned to immediately name the new column so I didn't have to rename it later
# I also realized that I don't have to use select(-) to filter out other columns since the summarize function aready doing that by itself
da.id.active_dist_cal <- da.id %>% 
  summarize(very_ad_sum = sum(very_active_distance), fairly_ad_sum = sum(moderately_active_distance), lightly_ad_sum = sum(light_active_distance), sedentary_d_sum = sum(sedentary_active_distance), calories_sum = sum(calories))

#Did they add up?
transmute(da.id, wrong_if_not_zero = (sum(total_distance) - sum(very_active_distance) - sum(moderately_active_distance) - sum(light_active_distance) - sum(sedentary_active_distance)))

## # A tibble: 940 x 2
## # Groups:   id [33]
##    id         wrong_if_not_zero
##    <chr>                  <dbl>
##  1 1503960366             0.130
##  2 1503960366             0.130
##  3 1503960366             0.130
##  4 1503960366             0.130
##  5 1503960366             0.130
##  6 1503960366             0.130
##  7 1503960366             0.130
##  8 1503960366             0.130
##  9 1503960366             0.130
## 10 1503960366             0.130
## # ... with 930 more rows

#Because the total calculation of very_ad_sum + fairly_ad_sum + lightly_ad_sum + sedentary_d_sum in some of the IDs didn't add up
#I decided to not use either the value in total_distance column or tracker_distance column
#Instead, I calculated the sum of each distance column to a new colum called total distance
da.id.active_dist_cal <- mutate(da.id.active_dist_cal, total_distance = very_ad_sum + fairly_ad_sum +  lightly_ad_sum + sedentary_d_sum)

#Then I joined it with the 21 base ID to filter out IDs that stopped tracking data before the 12th of May
da.id.complete.active_dist_cal <- left_join(da.id.base, da.id.active_dist_cal, by="id")

#I visualized the data to look at the relationship between distance tracked and calories burned
ggplot(data = da.id.complete.active_dist_cal)+
  geom_point(mapping = aes(x=total_distance,y=calories_sum, color=id))+
  scale_y_continuous(name = "Calories Burned", labels=comma)+
  scale_x_continuous(name = "Total Distance Reached", labels=comma)+
  labs(title="Calories Burned by Total Distance Reached", subtitle="Sample of 21 FitBit Tracker IDs")

From this general look of relationship between total distance tracked and calories burned we could see that the plots are lined up in a loose manner, as distance increased so did the calories burned.
Next, I delved into the sleepday data, but there was a finding that affected my analysis decision.

sleepday.id <- sleepday %>% 
  group_by(id) %>% 
  summarise(total_time_in_bed = sum(total_time_in_bed), 
            total_sleep_time = sum(total_minutes_asleep))

#let's join the 21 base id to the sleep data with 24 id
sleepday.id.complete <- left_join(da.id.base, sleepday.id, by="id")

There were several NA in the observations when I left joined the da.id.base to the sleepday.id. This led to conclusion that NOT ALL devices that kept tracking activities data until 12th of May also kept tracking on sleep data. I decided not to use the sleep data further. But from this, we could say that tracking sleeping data is not as attractive to users than tracking activities.

Act

Recommendation

Before recommending actions, let us review the products already covered by Bellabeat:
- Bellabeat App
- Leaf tracker, worn as bracelet/necklace/clip
- Time, tracker worn as a watch
- Spring, smart water bottle
Tracking sleep data is not as attractive as tracking activity data. This is probably due to the nature of wearables that might cause discomfort when users are trying to sleep. Bellabeat might unlock new great opportinity in this path by creating a product that focus on tracking sleep data without causing discomfort or disturb the attention of the user who are trying to sleep. I suggest something along smart mattress or bed or bed frame to track sleep data.
Users were able to burn high amount of calories in either longer or shorter time of activities. But those users are the minority. Meanwhile most users burned relatively low or decent amount of calories in eitther longer or shorter time of activities. Ideally, it would be a great if users can burn high amount of calories in the shortest of time, push them forward to the upper left corner of the scatterplot. Bellabeat can tap into this notion of ideal world by:

Delivering insights about the “gap” from each user’s activity (wherever they are in the scatterplot) to the ideal user acttivity (in the upper left corner int the scatterplot) via the Bellabeat app
Offering training and curriculum to close that gap by special offering membership to complement/upgrade/replace the already running membership program.

Since the data indicates that probably most users were runners, Bellabeat should forge a path into the running industry, especially for women who run. Since the data also shown that low - mid distance records were higher than the long distance records. We could apply the 2nd recommendation into creating a running program for women who want to empower themselves by becoming strong runners. Bellabeat could work with great marathon runners like Brigid Kosgei, Mao Ichiyama, or Sara Hall to help brand the direction Bellabeat is going to. Combined with the already collected data regarding women fitness, Bellabeat might lead the market in women runners wearables in the future.

Case Study 2 - How Can a Wellness Technology Company Play it Smart?

Sasmito Y. H.

9/19/2021

Roadmap

1. Ask

2. Prepare

3. Process

4. Analyze

6. Act

Introduction:

Ask

Prepare

Process

The cleaning and manipulation I had done to verify and add data into the informations presented in the Prepare phase before is:

Analyze

Act

Recommendation

Case Study 2 - How Can a Wellness Technology Company Play it Smart?

Sasmito Y. H.

9/19/2021

Roadmap

1. Ask

2. Prepare

3. Process

4. Analyze

5. Share

6. Act

Introduction:

Ask

Prepare

Process

The cleaning and manipulation I had done to verify and add data into the informations presented in the Prepare phase before is:

Analyze

Share

Conclusion

Act

Recommendation