Bellabeat Case Study

Background

The company Bellabeat, founded by Urška Sršen, is a health-focused, smart device manufacturer based in San Francisco. Focused on female empowerment, Bellabeat collects data on female activity, sleep, stress, and reproductive health. Multiple products have been launched since 2013 including the Bellabeat app, Leaf fitness tracker, Time wellness watch, Spring smart waterbottle, and a Bellabeat membership in which members can receive personalized health guidance.

Ask

Business Objective

Urška Sršen has expressed interest in analyzing smart device data with the goal of discovering growth opportunities for the company. This analysis will focus on other smart device data to see how customers are using their health tech products in hopes of improving Bellabeat’s marketing strategy.

Key Stakeholders

Urška Sršen: Bellabeat Co-founder and Chief Creative Officer
Sando Mur: Bellabeat Co-founder and mathematician; key member of Bellabeat’s executive team
Bellabeat Marketing Analytics Team: Responsible for collecting, analyzing, and reporting data to help guide marketing strategy

Analysis Questions

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Deliverables

Documentation of cleaning and manipulation of data
Supporting visualizations with key findings
Summary of the analysis
Recommendations based on the analysis

Prepare

Data Used

This analysis uses data from the Fitbit Fitness Tracker public data set which includes health data from thirty Fitbit users over a 2 month period. The data includes sleep, total steps, calories, tracker distance, active minutes by intensity, weight, and heart rate. To improve Bellabeat marketing strategy, I will be focusing on calories, activity intensity, total steps, and sleep.

Data Ethics

After examining this data set, some limitations can be acknowledged. In the ‘weight’ data frame, only 8 Fitbit users consented to submit data into the application, making this parameter unable to be used. After the upcoming data merge, each data frame will only include 24 participants, therefore not meeting the minimum population requirement for an unbiased sample. However, we will continue for learning’s sake. Additionally, Fitbit does not include demographic details in their data set such as gender and age, making it difficult to take this information and apply it to a solely-female customer base such as Bellabeat.

daily_activity <- read.csv("/Users/grace/Library/Mobile Documents/com~apple~CloudDocs/Bellabeat/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

daily_sleep <- read.csv("/Users/grace/Library/Mobile Documents/com~apple~CloudDocs/Bellabeat/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

daily_intensities <- read.csv("/Users/grace/Library/Mobile Documents/com~apple~CloudDocs/Bellabeat/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")

daily_calories <- read.csv("/Users/grace/Library/Mobile Documents/com~apple~CloudDocs/Bellabeat/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")

Process

Tools Used

For this analysis, I have chosen to use R due to the usability of the R markdown document and to use the ggplot2 package to create high-level, statistical analyses.

Packages Used

The tidyverse, ggplot2, tidyr, skimr, janitor, lubridate, and dplyr are the packages I have used to clean and present the data.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)

library(tidyr)

library(skimr)

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Format Dates

I will be merging the data frames together further in my analysis and, therefore, need the date vectors to be converted from a character string to a date format. I have used the lubridate package to reformat the date columns.

daily_activity$ActivityDate <- as.Date(daily_activity$ActivityDate, "%m/%d/%Y")

daily_calories$ActivityDay <- as.Date(daily_calories$ActivityDay, "%m/%d/%Y")

daily_intensities$ActivityDay <- as.Date(daily_intensities$ActivityDay, "%m/%d/%Y")

daily_sleep$SleepDay <- as.Date(as.POSIXct(daily_sleep$SleepDay, "%m/%d/%Y %H:%M:%S", tz = "America/New_York"))

Rename Date Columns

Here, I have renamed the date columns from all data frames with dplyr in preparation for the upcoming merge.

daily_activity <- daily_activity %>% 
  rename(date = ActivityDate)

daily_calories <- daily_calories %>% 
  rename(date = ActivityDay)

daily_intensities <- daily_intensities %>% 
  rename(date = ActivityDay)

daily_sleep <- daily_sleep %>% 
  rename(date = SleepDay)

Clean Data Frame Names

Using the janitor package, I have cleaned the dirty data frame names into a consistent, snake_case format.

daily_activity <- daily_activity %>% 
  clean_names()

daily_calories <- daily_calories %>% 
  clean_names()

daily_intensities <- daily_intensities %>% 
  clean_names()

daily_sleep <- daily_sleep %>% 
  clean_names()

Check for Duplicates

get_dupes(daily_activity)

## No variable names specified - using all columns.

## No duplicate combinations found of: id, date, total_steps, total_distance, tracker_distance, logged_activities_distance, very_active_distance, moderately_active_distance, light_active_distance, ... and 6 other variables

##  [1] id                         date                      
##  [3] total_steps                total_distance            
##  [5] tracker_distance           logged_activities_distance
##  [7] very_active_distance       moderately_active_distance
##  [9] light_active_distance      sedentary_active_distance 
## [11] very_active_minutes        fairly_active_minutes     
## [13] lightly_active_minutes     sedentary_minutes         
## [15] calories                   dupe_count                
## <0 rows> (or 0-length row.names)

get_dupes(daily_calories)

## No variable names specified - using all columns.

## No duplicate combinations found of: id, date, calories

## [1] id         date       calories   dupe_count
## <0 rows> (or 0-length row.names)

get_dupes(daily_intensities)

## No variable names specified - using all columns.

## No duplicate combinations found of: id, date, sedentary_minutes, lightly_active_minutes, fairly_active_minutes, very_active_minutes, sedentary_active_distance, light_active_distance, moderately_active_distance, very_active_distance

##  [1] id                         date                      
##  [3] sedentary_minutes          lightly_active_minutes    
##  [5] fairly_active_minutes      very_active_minutes       
##  [7] sedentary_active_distance  light_active_distance     
##  [9] moderately_active_distance very_active_distance      
## [11] dupe_count                
## <0 rows> (or 0-length row.names)

get_dupes(daily_sleep)

## No variable names specified - using all columns.

##           id       date total_sleep_records total_minutes_asleep
## 1 4388161847 2016-05-05                   1                  471
## 2 4388161847 2016-05-05                   1                  471
## 3 4702921684 2016-05-07                   1                  520
## 4 4702921684 2016-05-07                   1                  520
## 5 8378563200 2016-04-25                   1                  388
## 6 8378563200 2016-04-25                   1                  388
##   total_time_in_bed dupe_count
## 1               495          2
## 2               495          2
## 3               543          2
## 4               543          2
## 5               402          2
## 6               402          2

3 duplicate entries were found in the daily_sleep data frame.

Remove Duplicate Rows

Using the dplyr package, I have eliminated duplicate rows from the daily_sleep data frame and returned only unique values.

daily_sleep <- daily_sleep %>% 
  distinct()

Merge Data Frames

I have created merged data frames based on two fields. The activity_sleep df will compare activity and sleep, and the intensities_sleep df will compare intensities and sleep.

activity_sleep <- merge(daily_activity, daily_sleep, by = c('id','date'))

intensities_sleep <- merge(daily_intensities, daily_sleep, by = c('id','date'))

Analyze and Share

Data summary

summary(activity_sleep)

##        id                 date             total_steps    total_distance  
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :   17   Min.   : 0.010  
##  1st Qu.:3.977e+09   1st Qu.:2016-04-19   1st Qu.: 5189   1st Qu.: 3.592  
##  Median :4.703e+09   Median :2016-04-27   Median : 8913   Median : 6.270  
##  Mean   :4.995e+09   Mean   :2016-04-26   Mean   : 8515   Mean   : 6.012  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:11370   3rd Qu.: 8.005  
##  Max.   :8.792e+09   Max.   :2016-05-12   Max.   :22770   Max.   :17.540  
##  tracker_distance logged_activities_distance very_active_distance
##  Min.   : 0.010   Min.   :0.0000             Min.   : 0.000      
##  1st Qu.: 3.592   1st Qu.:0.0000             1st Qu.: 0.000      
##  Median : 6.270   Median :0.0000             Median : 0.570      
##  Mean   : 6.007   Mean   :0.1089             Mean   : 1.446      
##  3rd Qu.: 7.950   3rd Qu.:0.0000             3rd Qu.: 2.360      
##  Max.   :17.540   Max.   :4.0817             Max.   :12.540      
##  moderately_active_distance light_active_distance sedentary_active_distance
##  Min.   :0.0000             Min.   :0.010         Min.   :0.0000000        
##  1st Qu.:0.0000             1st Qu.:2.540         1st Qu.:0.0000000        
##  Median :0.4200             Median :3.665         Median :0.0000000        
##  Mean   :0.7439             Mean   :3.791         Mean   :0.0009268        
##  3rd Qu.:1.0375             3rd Qu.:4.918         3rd Qu.:0.0000000        
##  Max.   :6.4800             Max.   :9.480         Max.   :0.1100000        
##  very_active_minutes fairly_active_minutes lightly_active_minutes
##  Min.   :  0.00      Min.   :  0.00        Min.   :  2.0         
##  1st Qu.:  0.00      1st Qu.:  0.00        1st Qu.:158.0         
##  Median :  9.00      Median : 11.00        Median :208.0         
##  Mean   : 25.05      Mean   : 17.92        Mean   :216.5         
##  3rd Qu.: 38.00      3rd Qu.: 26.75        3rd Qu.:263.0         
##  Max.   :210.00      Max.   :143.00        Max.   :518.0         
##  sedentary_minutes    calories    total_sleep_records total_minutes_asleep
##  Min.   :   0.0    Min.   : 257   Min.   :1.00        Min.   : 58.0       
##  1st Qu.: 631.2    1st Qu.:1841   1st Qu.:1.00        1st Qu.:361.0       
##  Median : 717.0    Median :2207   Median :1.00        Median :432.5       
##  Mean   : 712.1    Mean   :2389   Mean   :1.12        Mean   :419.2       
##  3rd Qu.: 782.8    3rd Qu.:2920   3rd Qu.:1.00        3rd Qu.:490.0       
##  Max.   :1265.0    Max.   :4900   Max.   :3.00        Max.   :796.0       
##  total_time_in_bed
##  Min.   : 61.0    
##  1st Qu.:403.8    
##  Median :463.0    
##  Mean   :458.5    
##  3rd Qu.:526.0    
##  Max.   :961.0

n_distinct(activity_sleep$id)

## [1] 24

summary(intensities_sleep)

##        id                 date            sedentary_minutes
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :   0.0   
##  1st Qu.:3.977e+09   1st Qu.:2016-04-19   1st Qu.: 631.2   
##  Median :4.703e+09   Median :2016-04-27   Median : 717.0   
##  Mean   :4.995e+09   Mean   :2016-04-26   Mean   : 712.1   
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.: 782.8   
##  Max.   :8.792e+09   Max.   :2016-05-12   Max.   :1265.0   
##  lightly_active_minutes fairly_active_minutes very_active_minutes
##  Min.   :  2.0          Min.   :  0.00        Min.   :  0.00     
##  1st Qu.:158.0          1st Qu.:  0.00        1st Qu.:  0.00     
##  Median :208.0          Median : 11.00        Median :  9.00     
##  Mean   :216.5          Mean   : 17.92        Mean   : 25.05     
##  3rd Qu.:263.0          3rd Qu.: 26.75        3rd Qu.: 38.00     
##  Max.   :518.0          Max.   :143.00        Max.   :210.00     
##  sedentary_active_distance light_active_distance moderately_active_distance
##  Min.   :0.0000000         Min.   :0.010         Min.   :0.0000            
##  1st Qu.:0.0000000         1st Qu.:2.540         1st Qu.:0.0000            
##  Median :0.0000000         Median :3.665         Median :0.4200            
##  Mean   :0.0009268         Mean   :3.791         Mean   :0.7439            
##  3rd Qu.:0.0000000         3rd Qu.:4.918         3rd Qu.:1.0375            
##  Max.   :0.1100000         Max.   :9.480         Max.   :6.4800            
##  very_active_distance total_sleep_records total_minutes_asleep
##  Min.   : 0.000       Min.   :1.00        Min.   : 58.0       
##  1st Qu.: 0.000       1st Qu.:1.00        1st Qu.:361.0       
##  Median : 0.570       Median :1.00        Median :432.5       
##  Mean   : 1.446       Mean   :1.12        Mean   :419.2       
##  3rd Qu.: 2.360       3rd Qu.:1.00        3rd Qu.:490.0       
##  Max.   :12.540       Max.   :3.00        Max.   :796.0       
##  total_time_in_bed
##  Min.   : 61.0    
##  1st Qu.:403.8    
##  Median :463.0    
##  Mean   :458.5    
##  3rd Qu.:526.0    
##  Max.   :961.0

n_distinct(intensities_sleep$id)

## [1] 24

There are 24 distinct user IDs in both data sets and no missing values in any column, so there is no need to remove empty columns or rows.

Activity Levels

This pie chart represents how Fitbit users spend their logged fitness time. After examining the activity intensities data frame, I have found that users are primarily sedentary (82.3%), followed by lightly active (15.8%), very active (1.72%), and fairly active (1.11%). We can take this a step further and see how each activity level correlates with time spent asleep.

#assign variables
light <- sum(daily_intensities$lightly_active_minutes)
sed <- sum(daily_intensities$sedentary_minutes)
fair <- sum(daily_intensities$fairly_active_minutes)
very <- sum(daily_intensities$very_active_minutes)
total_minutes <- light+sed+fair+very
#calculate percentages
sed_per <- sed/total_minutes
light_per <- light/total_minutes
fair_per <- fair/total_minutes
very_per <- very/total_minutes
#assign slices and labels
slices <- c(sed_per,light_per,fair_per,very_per)
lbls <- c("Sedentary- 82.3%", "Lightly Active- 15.8%", "Fairly Active- 1.11%", "Very Active- 1.72%")
#create pie chart
pie <- pie(slices, labels = lbls, main = "Activity Levels", col = c("#C399F2","#7BB6F0","#F6B4D0","#7BE0D6"))

#Very active and sleep
cor(activity_sleep$very_active_minutes, activity_sleep$total_minutes_asleep)

## [1] -0.08812658

#Fairly active and sleep
cor(activity_sleep$fairly_active_minutes, activity_sleep$total_minutes_asleep)

## [1] -0.2492079

#Lightly active and sleep
cor(activity_sleep$lightly_active_minutes, activity_sleep$total_minutes_asleep)

## [1] 0.02758336

#Sedentary mins and sleep
cor(activity_sleep$sedentary_minutes, activity_sleep$total_minutes_asleep)

## [1] -0.6010731

Out of these 4 variables, sedentary minutes has the strongest and most significant correlation with sleep. Having a correlation coefficient of -0.60, it appears that having more sedentary minutes is associated with less sleep. This can be visualized in the following plot.

ggplot(data = activity_sleep, aes(x = sedentary_minutes, y = total_minutes_asleep))+
  geom_point(col = "#9D59EA")+
  geom_smooth(method = "lm")+
  labs(title = "Sleep and Sedentary Minutes", x = "Sedentary Minutes", y = "Minutes Asleep")

## `geom_smooth()` using formula 'y ~ x'

Total Active Minutes

After looking at activity separated by intensity, we can now move on to examine trends associated with total active minutes.

activity_sleep <- activity_sleep %>% 
    mutate(total_active_mins = lightly_active_minutes+fairly_active_minutes+very_active_minutes)

hist(activity_sleep$total_active_mins, breaks = 10, col = "#C399F2", main = "Total Active Minutes Distribution", xlab = "Minutes")

This histogram depicts a normal distribution for total active minutes among all participants. The mode of Fitbit users spend between 250-300 minutes of any exercise intensity per day based on this distribution.

Another possible correlation we can look at is between total activity and calories burned.

ggplot(data = activity_sleep, aes(x = total_active_mins, y = calories))+
  geom_point(col = "#9D59EA")+
  geom_smooth(method = "lm")+
  labs(title = "Active Minutes vs Calories", x = "Total Active Minutes", y = "Calories")

## `geom_smooth()` using formula 'y ~ x'

cor(activity_sleep$total_active_mins, activity_sleep$calories)

## [1] 0.3899832

Based on the scatter plot and correlation coefficient of 0.39, there is a moderate relationship between total active minutes and calories burned. By increasing total active minutes, users may be able to increase total calorie burn.

One way to increase total active minutes may be to increase total steps throughout the day, since number of steps also have a moderate correlation with calories, depicting in the chart below.

ggplot(daily_activity, aes(x = total_steps, y = calories))+
  geom_point(color = "#9D59EA")+
  geom_smooth(method = "lm")+
  labs(title = "Daily Calories vs Total Steps", x = "Steps", y = "Calories")

## `geom_smooth()` using formula 'y ~ x'

Active Minutes by Intensity

Next, excluding sedentary minutes, we can compare the different categorical intensities of activity by looking at a box and whisker plot of the average time spent in each category: light, fair, and very active.

boxplot(activity_sleep$lightly_active_minutes, activity_sleep$fairly_active_minutes, activity_sleep$very_active_minutes, col = c("#C399F2", "#F6B4D0", "#7BB6F0"), xlab = "Activity Intensity", ylab = "Minutes", main = "Light, Fair, and Very Active Minutes", names = c("Light","Fair","Very"))

Based on this chart, the majority Fitbit users spend most of their active time in the “lightly active” range. We can determine which level of activity, if any, has a correlation with other variables such as calories burned.

#lightly active vs calories
cor(activity_sleep$lightly_active_minutes, activity_sleep$calories)

## [1] 0.1137661

#fairly active vs calories
cor(activity_sleep$fairly_active_minutes, activity_sleep$calories)

## [1] 0.1759878

#very active vs calories
cor(activity_sleep$very_active_minutes, activity_sleep$calories)

## [1] 0.6111983

Based on this output, there is a moderately strong (0.61) correlation between very active minutes and calories, a weak (0.11) correlation between lightly active minutes and calories, and a weak (0.18) correlation between fairly active intensity and calories. Using this information, we can create a visualization to further our analysis. Here, we clearly see the positive correlation between a greater amount of very active minutes and more calories burned.

ggplot(data = activity_sleep, aes(x = very_active_minutes, y = calories))+
  geom_point(col = "#9D59EA")+
  geom_smooth(method = "lm")+
  labs(title = "Very Active Minutes vs Calories", x = "Very Active Minutes", y = "Calories")

## `geom_smooth()` using formula 'y ~ x'

Act

Summary

So far, 4 relevant correlations have been discovered after analyzing this data:

Very Active Minutes and Calories (+)
Total Active Minutes and Calories (+)
Total Steps and Calories (+)
Sedentary Minutes and Sleep (-)

Based on these insights, not moving a lot during the day may possibly lead to less sleep. This could also be viewed as a type of feedback loop since getting less sleep could also lead to higher sedentary minutes during the day. Next, higher levels and intensities of activity have been shown as having a correlation with burning more calories. Therefore, increasing very active minutes as well as total active minutes may lead to a higher total calorie burn.

Recommendations

After my analysis, the following recommendations may improve Bellabeat’s marketing strategy.

Vibration System
Goal: Increase total active minutes
Bellabeat could incorporate a vibration system in which vibrations are sent to the Leaf fitness tracker after users are sedentary for certain amounts of time. Often times, people may be caught up in sedentary activities such as watching tv or working at their desk. Sending a vibration would disrupt the monotony and remind users to stand up and take even just a short stroll.
Wellness Contest
Goal: To increase very active minutes
Bellabeat could launch a workout program for users to challenge each other in logging their activities with a point system. Different activities may be worth different points, i.e. very active activity being worth more than lightly active activity. Points earned in the promotion could be used as a type of discount for other Bellabeat products, increasing sales.
Bedtime Notifications
Goal: Decrease Sedentary Minutes and Increase Minutes Asleep
Average sedentary time for all Fitbit users was found to be 717 minutes, just under 12 hours. Additionally, average time asleep for these users was 419 minutes, just under 7 hours. According to the CDC, over 7 hours of sleep is recommended for the general adult population and contributes to a healthy life style (Watson, 2015). To increase time asleep, Bellabeat may introduce a daily calendar of give the option to sync a calendar into their app to allow users to set a bedtime and wake time. The app would then send notifications to users when it is time to go to bed in order to get the necessary recommended amount of sleep.

Thank you for reading, comments and feedback are always welcome!

Citations

Watson NF, Badr MS, Belenky G, et al. Recommended amount of sleep for a healthy adult: a joint consensus statement of the American Academy of Sleep Medicine and Sleep Research Society. Sleep. 2015;38(6):843–844.