Introduction

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products that inform and inspire women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.

Since Bellabeat was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.

Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Products

Bellabeat app

The Bellabeat app provides users with health data related to their activity, sleep, stress,menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

Leaf

Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

Time

This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

Spring

This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

Bellabeat membership

Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Ask

What is the business objective?

Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions will guide your analysis.

Identify key stake holders or teams

Key stakeholders of bellabeat are both Urška Sršen and Sando Mur. These members of the company are founders, so they will not be able to look deep into an analysis.

Analytics team - This team can assist with common question I may have about a projects validity, as they can look deeper into my analysis then my stakeholders can.

Prepare

Data location

‘FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.’

I’ll be using a kaggle dataset for this case study. This data is stored in wide format.

Does the data ROCCC?

Reliable, Original, Comprehensive, Current, and Cited.

Reliable

This data is relatively reliable. There are no issues with credibility or bias in this data as it meets the minimum 30 persons sample size supported by the CLT (central limit therom.) Of course as the sample size increases, so does the confidence, and margin of error decreases. I am encouraged to look for more data sources, but this is entirely optional as it meets the minimum requirements, albiet it will have a larger margin of error.

original

This data isn’t as original as it could be, i.e it’s been studied by many junior analysts already, so it’s unlikely that I’ll discover something ‘unheard of’ from the data I’ll be using.

comprehensive

This data is pretty comprehensive, it is organized very well, and has lots of different avalible csv files avalible for use.

current

This data isn’t current, it’s around 7 years old.

cited

This data is cited, and avalible for public use… It’s licensed “(CC0: Public Domain, dataset made available through Mobius)” In the event I discover PII I will remove it from the data.frames or tables I use. I’ll document my changes, and blurr or ‘hashtag’ the PII (Personal Identifiable Information.) I verified my data’s integrity by using the ROCCC data process.

Process

Processing the data from dirty to clean

In this case study, I’ll mainly use R to sort, filter, clean my data. I know the basics of SQL, but I do not have access to premium features with my free account.

First I would like to check the structure and data for the two tables I’ll be using. To do this I need to load the relevant packages and use a couple lines of code.

library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)

read_csv('dailyActivity_merged.csv') -> daily_activity

# variable for cleaned data, no nulls, no duplicates.

daily_activity_cleaned <- (na.omit(distinct(daily_activity)))

#summary of the cleaned data

str(daily_activity_cleaned)
glimpse(daily_activity_cleaned)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

Date is formatted in CHR, this is incorrect. Now that I know what is wrong with the data, I’ll proceed to develop my code to clean it.

my code
library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)

read_csv('dailyActivity_merged.csv') -> daily_activity

# variable for cleaned data, no nulls, no duplicates.

daily_activity_cleaned <- (na.omit(distinct(daily_activity)))

# formatting the data correctly

daily_activity_cleaned$ActivityDate <- as.Date(
  daily_activity_cleaned$ActivityDate, format = "%m/%d/%Y" )


## grabs min/max and averages for each column in the tables/summarizes them

str(daily_activity_cleaned)
knitr::kable(summary(daily_activity_cleaned))
it’s output
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : Date[1:940], format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
Min. :1.504e+09 Min. :2016-04-12 Min. : 0 Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. : 0.000 Min. :0.0000 Min. : 0.000 Min. :0.000000 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0
1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8 1st Qu.:1828
Median :4.445e+09 Median :2016-04-26 Median : 7406 Median : 5.245 Median : 5.245 Median :0.0000 Median : 0.210 Median :0.2400 Median : 3.365 Median :0.000000 Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5 Median :2134
Mean :4.855e+09 Mean :2016-04-26 Mean : 7638 Mean : 5.490 Mean : 5.475 Mean :0.1082 Mean : 1.503 Mean :0.5675 Mean : 3.341 Mean :0.001606 Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2 Mean :2304
3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5 3rd Qu.:2793
Max. :8.878e+09 Max. :2016-05-12 Max. :36019 Max. :28.030 Max. :28.030 Max. :4.9421 Max. :21.920 Max. :6.4800 Max. :10.710 Max. :0.110000 Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0 Max. :4900
takeaway

Data appears to be within reasonable data range constraints, This data is valid after analyzing the summary data, though I had to change the data type of the dates so they would not use math.

Process

I’ve chosen to use R as a main language for this project since I have a free account on BigQuery. In the following code chunk, I’ll use basic data cleaning processes used in the past process along with adding na.omit to remove n/a values. This insures my data is clean, and verifies this by summarizing the data. These changes are all documented in my RMD file.

Code I developed

My finalized processing code

library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)

read_csv('dailyActivity_merged.csv') -> daily_activity

# variable for cleaned data, no nulls, no duplicates.

daily_activity_cleaned <- (na.omit(distinct(daily_activity)))

# formatting the data correctly

daily_activity_cleaned$ActivityDate <- as.Date(
  daily_activity_cleaned$ActivityDate, format = "%m/%d/%Y" )

# variables for columns I wanna select

daily_activity_processed <- print(
  daily_activity_cleaned[, c("SedentaryMinutes", "TotalSteps", "Calories", "TrackerDistance", "TotalDistance")]
)
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 940 × 5
##    SedentaryMinutes TotalSteps Calories TrackerDistance TotalDistance
##               <dbl>      <dbl>    <dbl>           <dbl>         <dbl>
##  1              728      13162     1985            8.5           8.5 
##  2              776      10735     1797            6.97          6.97
##  3             1218      10460     1776            6.74          6.74
##  4              726       9762     1745            6.28          6.28
##  5              773      12669     1863            8.16          8.16
##  6              539       9705     1728            6.48          6.48
##  7             1149      13019     1921            8.59          8.59
##  8              775      15506     2035            9.88          9.88
##  9              818      10544     1786            6.68          6.68
## 10              838       9819     1775            6.34          6.34
## # ℹ 930 more rows

Analyze

In this phase, I’ve chosen to use the processing code, building onto it. I’ve already set up variables that are cleaned, and now I just need to plug in my graphs and plots. My data has been properly formatted in the prepare/process phase.

I’ll run some graphs now to analyze the data.

library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)
library(ggplot2)

read_csv('dailyActivity_merged.csv') -> daily_activity

# variable for cleaned data, no nulls, no duplicates.

daily_activity_cleaned <- (na.omit(distinct(daily_activity)))

# formatting the data correctly

daily_activity_cleaned$ActivityDate <- as.Date(
  daily_activity_cleaned$ActivityDate, format = "%m/%d/%Y" )

# variables for columns I wanna select

daily_activity_processed <- print(
  daily_activity_cleaned[, c("SedentaryMinutes", "TotalSteps", "Calories", "TrackerDistance", "TotalDistance")]
)

# plots out a smooth line jitter chart for total steps vs  sedentary minutes

ggplot(data=daily_activity_processed, aes(x=SedentaryMinutes, y=TotalSteps)) + 
  geom_jitter(aes(color=SedentaryMinutes), size=1) + 
  geom_smooth(color='red', linetype='dashed', size=1) + 
  scale_y_continuous(limits=c(0,20000))

# Calories vs Sedentary Minutes

ggplot(data=daily_activity_processed, aes(x=SedentaryMinutes, y=Calories)) + 
  geom_jitter(aes(color=SedentaryMinutes), size=1) + 
  geom_smooth(color='red', linetype='dashed', size=1)

# total steps vs calories

ggplot(data=daily_activity_processed, aes(x=Calories, y=TotalSteps)) + 
  geom_jitter(aes(color=Calories), size=1) + 
  geom_smooth(color='red', linetype='dashed', size=1)

# plots out a smooth line jitter chart for TrackerDistance vs TotalDistance

ggplot(data=daily_activity_processed, aes(x=TrackerDistance, y=TotalDistance)) + 
  geom_jitter(aes(color=TrackerDistance), size=1) + 
  geom_smooth(color='red', linetype='dashed', size=1)
## # A tibble: 940 × 5
##    SedentaryMinutes TotalSteps Calories TrackerDistance TotalDistance
##               <dbl>      <dbl>    <dbl>           <dbl>         <dbl>
##  1              728      13162     1985            8.5           8.5 
##  2              776      10735     1797            6.97          6.97
##  3             1218      10460     1776            6.74          6.74
##  4              726       9762     1745            6.28          6.28
##  5              773      12669     1863            8.16          8.16
##  6              539       9705     1728            6.48          6.48
##  7             1149      13019     1921            8.59          8.59
##  8              775      15506     2035            9.88          9.88
##  9              818      10544     1786            6.68          6.68
## 10              838       9819     1775            6.34          6.34
## # ℹ 930 more rows

Analysis

Upon analyzing these graphs I am surprised! I can confidently say that there may be some correlation between TotalSteps/Sedentary Minutes and Calories/Sedentary Minutes. Two graphs look almost identical. In the lower 2 graphs, they are not surprising… With total steps increasing so does calories. The same is true for the tracker distance and total distance. Though if you look closely you can actually see some people have a higher tracker-distance then what the total-distance is.

Share

How do the graphs relate with the objective?

How do these graphs relate to my business objective? Well I was asked to look into non bellabeat products and the consumers behavior with them. In this case study, I made 4 graphs, each telling it’s own story. I was then asked to apply them to a bellabeat product, and so I will.

what story does the data tell?

Consumers who use non bellabeat products gain more steps if they rest between 500-1100 minutes. The same relationship is vizualized in the second graph, showing consumers who use this non bellabeat product burn more calories if they rest between a range of 500-1100 resting minutes. I’d classify that relationship as maybe a correlation, I don’t want to say it’s a outright causation without exploring more deeper.

Consumers burn more calories the more steps they take. It can be said that both ‘Calories’ and ‘TotalSteps’ act in the same way which is weird but interesting.

In this case study, I was able to answer the question of “Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices.”

Leaf

Leaf is a product developed by Bellabeat used to track activity. In the context of this case study, belleabeat leaf could be optimized so it’s necklace encourages resting periods. Maybe by ‘vibrating’ or something it can give people reminders wherever they can receive them.

For the tracker, there may be some margin of error with the gps, as some people have a greater tracked distance then distance-total. For this, it’s very minor. Maybe it might be possible to blow up the numbers and claim Bellabeat is more accurate as a way to win consumers over but I think it would be very toxic.

For converting people into members, I think the resting reminders could be helpful. Bellabeat Leaf can remind you ever so gently by vibrating a few times or something. Then as it’s already proven in multiple studies, over time this can get people into the habit of listening to their leaf.

Conclusion

Add gentle reminders for the Leaf for resting periods.

If you have any comments on my case study feel free to share them and I would be interested in revaluating parts of the study for my growth.

note: running out of my monthly allotted time here on Posit, sorry for text errors in advance…