Bellabeat Case Study

Introduction

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products that inform and inspire women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.

Since Bellabeat was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.

Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Products

Bellabeat app

The Bellabeat app provides users with health data related to their activity, sleep, stress,menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

Leaf

Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

Time

This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

Spring

This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

Bellabeat membership

Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Ask

What is the business objective?

Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions will guide your analysis.

Identify key stake holders or teams

Key stakeholders of bellabeat are both Urška Sršen and Sando Mur. These members of the company are founders, so they will not be able to look deep into an analysis.

Analytics team - This team can assist with common question I may have about a projects validity, as they can look deeper into my analysis then my stakeholders can.

Prepare

Data location

‘FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.’

I’ll be using a kaggle dataset for this case study. This data is stored in wide format.

Does the data ROCCC?

Reliable, Original, Comprehensive, Current, and Cited.

Reliable

This data is relatively reliable. There are no issues with credibility or bias in this data as it meets the minimum 30 persons sample size supported by the CLT (central limit therom.) Of course as the sample size increases, so does the confidence, and margin of error decreases. I am encouraged to look for more data sources, but this is entirely optional as it meets the minimum requirements, albiet it will have a larger margin of error.

original

This data isn’t as original as it could be, i.e it’s been studied by many junior analysts already, so it’s unlikely that I’ll discover something ‘unheard of’ from the data I’ll be using.

comprehensive

This data is pretty comprehensive, it is organized very well, and has lots of different avalible csv files avalible for use.

current

This data isn’t current, it’s around 7 years old.

cited

This data is cited, and avalible for public use… It’s licensed “(CC0: Public Domain, dataset made available through Mobius)” In the event I discover PII I will remove it from the data.frames or tables I use. I’ll document my changes, and blurr or ‘hashtag’ the PII (Personal Identifiable Information.) I verified my data’s integrity by using the ROCCC data process.

Process

Processing the data from dirty to clean

In this case study, I’ll mainly use R to sort, filter, clean my data. I know the basics of SQL, but I do not have access to premium features with my free account.

First I would like to check the structure and data for the two tables I’ll be using. To do this I need to load the relevant packages and use a couple lines of code.

library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)

read_csv('dailyActivity_merged.csv') -> daily_activity

# variable for cleaned data, no nulls, no duplicates.

daily_activity_cleaned <- (na.omit(distinct(daily_activity)))

#summary of the cleaned data

str(daily_activity_cleaned)
glimpse(daily_activity_cleaned)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

Date is formatted in CHR, this is incorrect. Now that I know what is wrong with the data, I’ll proceed to develop my code to clean it.

my code

library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)

read_csv('dailyActivity_merged.csv') -> daily_activity

# variable for cleaned data, no nulls, no duplicates.

daily_activity_cleaned <- (na.omit(distinct(daily_activity)))

# formatting the data correctly

daily_activity_cleaned$ActivityDate <- as.Date(
  daily_activity_cleaned$ActivityDate, format = "%m/%d/%Y" )


## grabs min/max and averages for each column in the tables/summarizes them

str(daily_activity_cleaned)
knitr::kable(summary(daily_activity_cleaned))

it’s output

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : Date[1:940], format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...

Id	ActivityDate	TotalSteps	TotalDistance	TrackerDistance	LoggedActivitiesDistance	VeryActiveDistance	ModeratelyActiveDistance	LightActiveDistance	SedentaryActiveDistance	VeryActiveMinutes	FairlyActiveMinutes	LightlyActiveMinutes	SedentaryMinutes	Calories
Min. :1.504e+09	Min. :2016-04-12	Min. : 0	Min. : 0.000	Min. : 0.000	Min. :0.0000	Min. : 0.000	Min. :0.0000	Min. : 0.000	Min. :0.000000	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0
1st Qu.:2.320e+09	1st Qu.:2016-04-19	1st Qu.: 3790	1st Qu.: 2.620	1st Qu.: 2.620	1st Qu.:0.0000	1st Qu.: 0.000	1st Qu.:0.0000	1st Qu.: 1.945	1st Qu.:0.000000	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.:127.0	1st Qu.: 729.8	1st Qu.:1828
Median :4.445e+09	Median :2016-04-26	Median : 7406	Median : 5.245	Median : 5.245	Median :0.0000	Median : 0.210	Median :0.2400	Median : 3.365	Median :0.000000	Median : 4.00	Median : 6.00	Median :199.0	Median :1057.5	Median :2134
Mean :4.855e+09	Mean :2016-04-26	Mean : 7638	Mean : 5.490	Mean : 5.475	Mean :0.1082	Mean : 1.503	Mean :0.5675	Mean : 3.341	Mean :0.001606	Mean : 21.16	Mean : 13.56	Mean :192.8	Mean : 991.2	Mean :2304
3rd Qu.:6.962e+09	3rd Qu.:2016-05-04	3rd Qu.:10727	3rd Qu.: 7.713	3rd Qu.: 7.710	3rd Qu.:0.0000	3rd Qu.: 2.053	3rd Qu.:0.8000	3rd Qu.: 4.782	3rd Qu.:0.000000	3rd Qu.: 32.00	3rd Qu.: 19.00	3rd Qu.:264.0	3rd Qu.:1229.5	3rd Qu.:2793
Max. :8.878e+09	Max. :2016-05-12	Max. :36019	Max. :28.030	Max. :28.030	Max. :4.9421	Max. :21.920	Max. :6.4800	Max. :10.710	Max. :0.110000	Max. :210.00	Max. :143.00	Max. :518.0	Max. :1440.0	Max. :4900

takeaway

Data appears to be within reasonable data range constraints, This data is valid after analyzing the summary data, though I had to change the data type of the dates so they would not use math.

Process

I’ve chosen to use R as a main language for this project since I have a free account on BigQuery. In the following code chunk, I’ll use basic data cleaning processes used in the past process along with adding na.omit to remove n/a values. This insures my data is clean, and verifies this by summarizing the data. These changes are all documented in my RMD file.

Code I developed

My finalized processing code

library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)

read_csv('dailyActivity_merged.csv') -> daily_activity

# variable for cleaned data, no nulls, no duplicates.

daily_activity_cleaned <- (na.omit(distinct(daily_activity)))

# formatting the data correctly

daily_activity_cleaned$ActivityDate <- as.Date(
  daily_activity_cleaned$ActivityDate, format = "%m/%d/%Y" )

# variables for columns I wanna select

daily_activity_processed <- print(
  daily_activity_cleaned[, c("SedentaryMinutes", "TotalSteps", "Calories", "TrackerDistance", "TotalDistance")]
)

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 940 × 5
##    SedentaryMinutes TotalSteps Calories TrackerDistance TotalDistance
##               <dbl>      <dbl>    <dbl>           <dbl>         <dbl>
##  1              728      13162     1985            8.5           8.5 
##  2              776      10735     1797            6.97          6.97
##  3             1218      10460     1776            6.74          6.74
##  4              726       9762     1745            6.28          6.28
##  5              773      12669     1863            8.16          8.16
##  6              539       9705     1728            6.48          6.48
##  7             1149      13019     1921            8.59          8.59
##  8              775      15506     2035            9.88          9.88
##  9              818      10544     1786            6.68          6.68
## 10              838       9819     1775            6.34          6.34
## # ℹ 930 more rows

Analyze

In this phase, I’ve chosen to use the processing code, building onto it. I’ve already set up variables that are cleaned, and now I just need to plug in my graphs and plots. My data has been properly formatted in the prepare/process phase.

I’ll run some graphs now to analyze the data.

library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)
library(ggplot2)

read_csv('dailyActivity_merged.csv') -> daily_activity

# variable for cleaned data, no nulls, no duplicates.

daily_activity_cleaned <- (na.omit(distinct(daily_activity)))

# formatting the data correctly

daily_activity_cleaned$ActivityDate <- as.Date(
  daily_activity_cleaned$ActivityDate, format = "%m/%d/%Y" )

# variables for columns I wanna select

daily_activity_processed <- print(
  daily_activity_cleaned[, c("SedentaryMinutes", "TotalSteps", "Calories", "TrackerDistance", "TotalDistance")]
)

# plots out a smooth line jitter chart for total steps vs  sedentary minutes

ggplot(data=daily_activity_processed, aes(x=SedentaryMinutes, y=TotalSteps)) + 
  geom_jitter(aes(color=SedentaryMinutes), size=1) + 
  geom_smooth(color='red', linetype='dashed', size=1) + 
  scale_y_continuous(limits=c(0,20000))

# Calories vs Sedentary Minutes

ggplot(data=daily_activity_processed, aes(x=SedentaryMinutes, y=Calories)) + 
  geom_jitter(aes(color=SedentaryMinutes), size=1) + 
  geom_smooth(color='red', linetype='dashed', size=1)

# total steps vs calories

ggplot(data=daily_activity_processed, aes(x=Calories, y=TotalSteps)) + 
  geom_jitter(aes(color=Calories), size=1) + 
  geom_smooth(color='red', linetype='dashed', size=1)

# plots out a smooth line jitter chart for TrackerDistance vs TotalDistance

ggplot(data=daily_activity_processed, aes(x=TrackerDistance, y=TotalDistance)) + 
  geom_jitter(aes(color=TrackerDistance), size=1) + 
  geom_smooth(color='red', linetype='dashed', size=1)

## # A tibble: 940 × 5
##    SedentaryMinutes TotalSteps Calories TrackerDistance TotalDistance
##               <dbl>      <dbl>    <dbl>           <dbl>         <dbl>
##  1              728      13162     1985            8.5           8.5 
##  2              776      10735     1797            6.97          6.97
##  3             1218      10460     1776            6.74          6.74
##  4              726       9762     1745            6.28          6.28
##  5              773      12669     1863            8.16          8.16
##  6              539       9705     1728            6.48          6.48
##  7             1149      13019     1921            8.59          8.59
##  8              775      15506     2035            9.88          9.88
##  9              818      10544     1786            6.68          6.68
## 10              838       9819     1775            6.34          6.34
## # ℹ 930 more rows

Analysis

Upon analyzing these graphs I am surprised! I can confidently say that there may be some correlation between TotalSteps/Sedentary Minutes and Calories/Sedentary Minutes. Two graphs look almost identical. In the lower 2 graphs, they are not surprising… With total steps increasing so does calories. The same is true for the tracker distance and total distance. Though if you look closely you can actually see some people have a higher tracker-distance then what the total-distance is.

Bellabeat Case Study

Jeremy Hattendorf

2023-06-27

Introduction

Products

Bellabeat app

Leaf

Time

Spring

Bellabeat membership

Ask

What is the business objective?

Identify key stake holders or teams

Prepare

Data location

Does the data ROCCC?

Reliable

original

comprehensive

current

cited

Process

Processing the data from dirty to clean

my code

it’s output

takeaway

Process

Analyze

Analysis

Bellabeat Case Study

Jeremy Hattendorf

2023-06-27

Introduction

Products

Bellabeat app

Leaf

Time

Spring

Bellabeat membership

Ask

What is the business objective?

Identify key stake holders or teams

Prepare

Data location

Does the data ROCCC?

Reliable

original

comprehensive

current

cited

Process

Processing the data from dirty to clean

my code

it’s output

takeaway

Process

Analyze

Analysis

Share

How do the graphs relate with the objective?

what story does the data tell?

Leaf

Conclusion