Bellabeat Case Study

Bellabeat Case Study: How Can a Wellness Technology Company Play It Smart?

Introduction

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

Stakeholders

Urška Sršen — Bellabeat’s cofounder and Chief Creative Officer

Sando Mur — Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team

In this case study I will use the following steps for data analysis: Ask, Prepare, Process, Analyze, Share, and Act.

1. Ask

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat’s marketing strategy?

Business Task

Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

2. Prepare

The data being used for this case study was obtained from FitBit Fitness Tracker Data from Kaggle via Amazon Mechanical Turk between 03/12/2016-05/12/2016.

Does the data ROCCC? (Is it Reliable, Original, Comprehensive, Current and Cited?)

Reliable - There’s no additional information on gender, age, lifestyle and only has a sample set of 30 participants.

Original - Data comes from 3rd party(Amazon Mechanical Turk).

Comprehensive - There are some fields that are missing information due to all users not wearing FitBit for 30 days.

Current - Data is from 2016. There could be lifestyle changes since then.

Cited - Unknown as data comes from 3rd party.

3. Process

Due to the large size of the collected data, I will be using R to perform this case study for it’s ability to handle large data sets and it’s data visualization capabilities.

I install these packages to start the data cleaning process.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.2

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.2

## Warning: package 'tibble' was built under R version 4.1.2

## Warning: package 'tidyr' was built under R version 4.1.2

## Warning: package 'readr' was built under R version 4.1.2

## Warning: package 'purrr' was built under R version 4.1.2

## Warning: package 'dplyr' was built under R version 4.1.2

## Warning: package 'stringr' was built under R version 4.1.2

## Warning: package 'forcats' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.1.2

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(dplyr)
library(ggplot2)
library(janitor)

## Warning: package 'janitor' was built under R version 4.1.2

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.1.3

For this analysis, I decide to focus on data sets where the number of participants is consistent.

These .csv files are then loaded:

dailyAct <- read.csv("dailyActivity_merged.csv")
sleepDay <- read.csv("sleepDay_merged.csv")
dailySteps <- read.csv("dailySteps_merged.csv")
dailyCal <- read.csv("dailyCalories_merged.csv")
minsSleep <- read.csv("minuteSleep_merged_datetime.csv")

I familiarize myself with the column names and types of variables.

dailyAct%>% 
  glimpse()%>%
  str(dailyAct)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

dailyCal%>%
  glimpse()%>%
  str(dailyCal)

## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ Calories    <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775~
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

minsSleep %>%
  glimpse()%>%
  str(minsSleep)

## Rows: 188,521
## Columns: 4
## $ Id    <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503~
## $ date  <chr> "4/12/2016 2:47:30", "4/12/2016 2:48:30", "4/12/2016 2:49:30", "~
## $ value <int> 3, 2, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1~
## $ logId <dbl> 11380564589, 11380564589, 11380564589, 11380564589, 11380564589,~
## 'data.frame':    188521 obs. of  4 variables:
##  $ Id   : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date : chr  "4/12/2016 2:47:30" "4/12/2016 2:48:30" "4/12/2016 2:49:30" "4/12/2016 2:50:30" ...
##  $ value: int  3 2 1 1 1 1 1 2 2 2 ...
##  $ logId: num  1.14e+10 1.14e+10 1.14e+10 1.14e+10 1.14e+10 ...

dailySteps %>%
  glimpse()%>%
  str(dailySteps)

## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ StepTotal   <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054~
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...

minsSleep %>%
  glimpse()%>%
  str(minsSleep)

## Rows: 188,521
## Columns: 4
## $ Id    <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503~
## $ date  <chr> "4/12/2016 2:47:30", "4/12/2016 2:48:30", "4/12/2016 2:49:30", "~
## $ value <int> 3, 2, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1~
## $ logId <dbl> 11380564589, 11380564589, 11380564589, 11380564589, 11380564589,~
## 'data.frame':    188521 obs. of  4 variables:
##  $ Id   : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date : chr  "4/12/2016 2:47:30" "4/12/2016 2:48:30" "4/12/2016 2:49:30" "4/12/2016 2:50:30" ...
##  $ value: int  3 2 1 1 1 1 1 2 2 2 ...
##  $ logId: num  1.14e+10 1.14e+10 1.14e+10 1.14e+10 1.14e+10 ...

There are “chr” type columns for the date some data frames. I change them to a “date” type so that the formatting is correct.

dailyAct <- dailyAct %>%
  rename(Date = ActivityDate) %>%
  mutate(Date = as_date(Date, format = "%m/%d/%Y"))

dailyCal <- dailyCal %>%
  rename(Date = ActivityDay) %>%
  mutate(Date = as_date(Date, format = "%m/%d/%Y"))

dailySteps <- dailySteps %>%
  rename(Date = ActivityDay) %>%
  mutate(Date = as_date(Date, format = "%m/%d/%Y"))

sleepDay <- sleepDay %>%
  rename(Date = SleepDay) %>%
  mutate(Date = as_date(Date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

## Warning: `tz` argument is ignored by `as_date()`

Now I check for duplicates in the data. The only data frame that had any duplicates was my “sleepDay” data frame.

sum(duplicated(sleepDay))

## [1] 3

I remove duplicates from the “sleepDay” data frame. This enables more accuracy from the data.

sleepDay <- sleepDay %>% 
  distinct() %>% 
  drop_na()

Checking to see if duplicates were removed.

sum(duplicated(sleepDay))

## [1] 0

4. Analyze

Now that the data is cleaned, I start seeking initial insights from the data.

I merge the “dailyAct” and “sleepDay” data frames using the “Id” and “Date” columns to make plotting the data more streamlined.

allActivity <- merge(dailyAct, sleepDay, by = c("Id", "Date"), all = TRUE)

Making a “Weekday” column will help organize the information in a new way that’s more logical.

allActivity <- allActivity %>% 
  mutate( Weekday = weekdays(as.Date(Date, "%m/%d/%Y")))

allActivity$Weekday <- factor(allActivity$Weekday,
  levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

I do quick plots to see some relations in the data.

Checking to see how often people are using their FitBit throughout the week.

ggplot(data = allActivity, aes( x = Weekday, y = TrackerDistance, fill = Weekday)) + 
  geom_bar(stat = "identity")

It looks like there is a decline in usage as the week goes on. Some users are not using their FitBit throughout the week.

Next, I look to see what the relationship is between calories burned and non-sedantary activity.

ggplot(data = allActivity, aes(x=TotalSteps + VeryActiveMinutes + LightlyActiveMinutes + FairlyActiveMinutes, y = Calories, color = Calories)) + 
  scale_color_gradient(low = "yellow", high = "blue") +
  geom_point() + stat_smooth(method = lm) + 
  labs(title = "Active Minutes and Calories Burned", x = "Active Minutes Combined (Very Active + Fairly Active + Lightly Active)")

## `geom_smooth()` using formula 'y ~ x'

There’s a pretty obvious correlation where the more active users were, the more calories they burned.

Now I choose to focus on sleep patterns. Here I compare the time users spent in bed vs. their actual sleep time.

ggplot(data = allActivity, aes(x = TotalMinutesAsleep, y = TotalTimeInBed, color = Weekday)) +
  geom_point()

## Warning: Removed 530 rows containing missing values (geom_point).

It’s seems like there are some users that are spending time in bed without actually being asleep. Sunday looks like the day that most users spend time in bed without sleeping.

With these two charts plotted, I want to see if there’s a correlation between calories burned and time asleep. I do a plot for “Time Asleep” vs “Calories Burned”. And to make it the context of the minutes easier to understand, I convert minutes into hours.

ggplot(data = allActivity, aes(x = Calories, y = TotalMinutesAsleep / 60, color = Calories))+ scale_color_gradient(low = "yellow", high = "blue") +
  geom_point() + 
  stat_smooth(method = lm) +
  labs(title = "Total Hours Alseep vs. Calories Burned", y = "Hours Alseep", x = "Calories Burned")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (stat_smooth).

## Warning: Removed 530 rows containing missing values (geom_point).

It looks like users had a average sleep time no matter how many calories they burned, so no correlation there.

5. Share

Let’s gather some insights to draw conclusions for our stakeholders.

These plots will make it easy to compare different trends in the data.

tma <- ggplot(data = allActivity, aes(TotalMinutesAsleep / 60, SedentaryMinutes, color = TotalMinutesAsleep)) + 
  geom_point() +
  scale_color_gradient(low = "turquoise", high = "purple") +
  labs(title = "Sedentary Minutes vs Hours Asleep", y = "Sedentary Minutes", x = "Hours Alseep")

ttib <- ggplot(data = allActivity, aes(TotalTimeInBed / 60, SedentaryMinutes, color = TotalTimeInBed)) + 
  geom_point() + 
  scale_color_gradient(low = "turquoise", high = "purple") +
  labs(title = "Sedentary Minutes vs Hours in Bed", y = "Sedentary Minutes", x = "Hours in Bed")

ggarrange(ttib, tma, ncol = 2, nrow = 1)

## Warning: Removed 530 rows containing missing values (geom_point).
## Removed 530 rows containing missing values (geom_point).

There’s a slight uptick in the amount of sedentary users spending more time awake in bed.

Now I do plots to see which FitBit user types are sleeping the most.

sms <- ggplot(data = allActivity, aes(TotalMinutesAsleep, SedentaryMinutes, color = TotalMinutesAsleep)) + 
  geom_point() +
  scale_color_gradient(low = "pink", high = "blue")

tss <- ggplot(data = allActivity, aes(TotalMinutesAsleep, TotalSteps, color = TotalMinutesAsleep)) + 
  geom_point() + 
  scale_color_gradient(low = "pink", high = "blue")

tds <- ggplot(data = allActivity, aes(TotalMinutesAsleep, TotalDistance, color = TotalMinutesAsleep)) + 
  geom_point() + 
  scale_color_gradient(low = "pink", high = "blue")

vas <- ggplot(data = allActivity, aes(TotalMinutesAsleep, VeryActiveMinutes, color = TotalMinutesAsleep)) + 
  geom_point() + 
  scale_color_gradient(low = "pink", high = "blue")

ggarrange(sms, tss, tds, vas, ncol = 2, nrow = 2)

## Warning: Removed 530 rows containing missing values (geom_point).
## Removed 530 rows containing missing values (geom_point).
## Removed 530 rows containing missing values (geom_point).
## Removed 530 rows containing missing values (geom_point).

Users that are more sedentary are not getting as much sleep as users that are more active despite spending more time in bed (based on the previous chart).

I make a comparison between different the different activity types and how many calories they’re burning.

smc <- ggplot(data = allActivity, aes(SedentaryMinutes, Calories, color = Calories)) + 
  geom_point() + 
  scale_color_gradient(low = "yellow", high = "blue")

tsc <- ggplot(data = allActivity, aes(TotalSteps, Calories, color = Calories)) + 
  geom_point() + 
  scale_color_gradient(low = "yellow", high = "blue")

tdc <- ggplot(data = allActivity, aes(TotalDistance, Calories, color = Calories)) + 
  geom_point() +  
  scale_color_gradient(low = "yellow", high = "blue")

vac <- ggplot(data = allActivity, aes(VeryActiveMinutes, Calories, color = Calories)) + 
  geom_point() + scale_color_gradient(low = "yellow", high = "blue")

ggarrange(smc,tsc, tdc, vac, ncol = 2, nrow = 2)

This series of plots show that users that are active are burning more calories than users that were are sedentary.

Let’s look at how people are spending the majority of their time. I break down the minutes of activity into percentages in order to make a pie chart.

dataPercent <- allActivity %>%
  summarise(sum_fa = sum(FairlyActiveMinutes/1148807*100),
  sum_va = sum(VeryActiveMinutes/1148807*100),
  sum_la = sum(LightlyActiveMinutes/1148807*100), 
  sum_se = sum(SedentaryMinutes/1148807*100),
  sum_total=sum(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) %>% 
  round(digits = 2)

piechart <- c(dataPercent$sum_va, dataPercent$sum_fa, dataPercent$sum_la, dataPercent$sum_se, dataPercent$sum)

lbls <- c("Very Active", "Fairly Active", "Lightly Active", "Sedentary")
  
pie(piechart,
    labels = paste(lbls, sep =" "),
     col = c("pink", "turquoise","orange", "purple"),
     main = "Minutes of Activity")

There is a large margin of users that are sedentary. The larger margin is to be expected since people still need to rest but it’s disproportionately larger than the other activity types.

6. Act:

Let’s gather our conclusions and help our stakeholders make data-driven decisions based on the insights we gained from our analysis.

Observations and recommendations:

1. Fitbit users aren’t consistently wearing their FitBit:

The Bellabeat app can have notifications reminding users to wear their device to have the most data accuracy.

The app can have a feature where users can obtain milestones for length of time wearing the device.

Add a feature that can allow users to connect with friends to compare wear times/steps.

2. Users are spending a lot of time sedentary when the FitBit is a device for fitness and health. The CDC recommends 150 of physical activity a week which is roughly 30 minutes a day.:

Have reminders set up if a user is sedentary for an extended period of time.

If a user cannot be active for any reason, have a feature in the app where you can tell the app you temporarily cannot be active so notifications don’t become problematic.

Add a feature to the app where the user can add their interests so then the app can suggest fun activities like sports, hiking, skating, etc.

3. Users are spending time in bed without actually sleeping. This can be because of the usage of browsing content on smart phones while in bed. The emission of blue light while in bed can negatively effect sleep cycles.

Have a feature where users can input their bedtime. The app can then track their screen time usage during this time and send notifications to remind them to reduce screen time in bed.
The app can have a section where sleeping tips and facts are posted for users that have trouble falling and staying asleep.

Marketing Suggestions:

Bellabeat can partner with a fitness clothing brand for coupon codes for their products so users will feel more inclined to get work out clothes thus become more active.
Users can make their own fitness goals in increments and choose rewards from a rewards store from other featuring deals from business partners.
Bellabeat can sponsor fitness influencers to showcase easy and quick workouts for users.
Bellabeat can advertise in office heavy areas since office workers are more likely to be sedentary throughout the day. Sitting for extended periods of time can lead to health issues.
Offer a discount on first Bellabeat purchase and a 60 day trail period.

Bellabeat Case Study

Daphane Love

5.17.22