Bellabeat Case Study

How Can a Wellness Technology Company Play It Smart?

According to Frobes.com, Bellabeat is a data-oriented wellness tech company that was founded by Sandro Mur, Urška Sršen, and Lovepreet Singh in 2013. Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

Business Task:

I was asked to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices, they then want me to select one of Bellabeat’s products, and analyze smart device data to gain insight into how consumers are using their smart devices to apply these insights in my presentation.

The insights I will discover will then help guide marketing strategy for the company. I will present my analysis to the Bellabeat executive team along with our high-level recommendations for Bellabeat’s marketing strategy.

Case Study Road Map

Guiding questions:

Q1: Where is your data stored?
A1: FitBit Fitness Tracker Data

Q2: How is the data organized? Is it in long or wide format?
A2: The data contains 18 data files, most of them are formatted in a long format, some are in wide format.

Q3: Are there issues with bias or credibility in this data?
A3: The dataset has been made accessible by Urška Sršen. Mrs. Sršen indicated the dataset might have some limitations, and she encouraged the team to consider adding another data to help address those limitations. The integrity of the data appears to be reliable, the dataset is not perfect, I couldn’t find the files descriptions, the sample size was small, only 33 users out of nearly an estimated 30 millions FitBit users in 2016, which means the dataset only account for 0.000096% of the total population. The sample of the datasets should have been around 380 participants to the get a confidence level of 95%, and a margin of error of +/- 5% There was also the demographics of FitBit, compared to Bellabeat, and at the time of this analysis, we couldn’t determine whether both demographics are similar, the gender for example is very important for Bellabeat, while the data from FitBit did not contain any information about the gender, whether they were female, male, or both. We feel there is a bias in the sample, and the gender could be a bias too, since we couldn’t confirm the gender in the FitBit data.

Q4: Does your data is ROCCC?
A4: NO, I will explain what is ROCCC:
1- R = Reliability: Low, the data was not reliable enough to be used for our analysis to help guide marketing strategy for the company, due to the low number of participants, and the gender was unknown (gender is a very important part of Bellabeat).
2- O = Originality: Low, 3rd party data.
3- C = Comprehensiveness: Low, the data not comprehensive, no information about the participants demographics, such as gender, age, location, and health status.
4- C = Current: Low, this analysis is been done in 2022, the Fitbit dataset is outdated, it was created back in 2016, there have been a lot of changes, and new trends in consumer’s wellness smart devices in the past 6 years, as well as the way how’s the data is collected.
5- C = Cited: Low, the dataset is distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016, we can’t check whether this is a reliable source or not.

Q5: How are you addressing licensing, privacy, security, and accessibility?
A5: The data gathered has been anonymized, no personal information included.

Q6 + Q7 + Q8: How did you verify the data’s integrity? How does it help you answer your question? Are there any problems with the data?
A6 + A7 + A8: the data is insufficient to provide a good, and comprehensive insights to Bellabeat. Our analysis can only provide some a few hints, a reliable, and larger datasets are needed for give better directions, and insights.

Using RStudio for the data cleaning, transformation, data analysis and visualization:

Installing and loading the required R packages

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("readr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("tidyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("devtools")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("lubridate")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("ggplot2")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(dplyr)

library(readr)

library(tidyr)

library(devtools)

## Loading required package: usethis

library(lubridate)

## Loading required package: timechange

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ggplot2)

Import dataset “dailyActivity_merged.csv”

daily_activity <- read.csv("dailyActivity_merged.csv")

head(daily_activity)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

daily_activity data set contains many cells with “0” values, I will omit these to prevent skewed results.

daily_activity <- daily_activity %>%  filter(TotalDistance !=0)

daily_activity <- daily_activity %>% filter(TotalSteps !=0)

checking daily_activity for duplicates

nrow(daily_activity[duplicated(daily_activity),])

## [1] 0

Import dataset “sleepDay_merged.csv”

sleep_day <- read.csv("sleepDay_merged.csv")

head(sleep_day)

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

colnames(sleep_day)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

Seperating date and time in sleep_day dataset into 2 different columns, renaming

sleep_day_2 <- sleep_day %>% 
  separate(SleepDay, c("Date", "Time"), " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

head(sleep_day_2)

##           Id      Date     Time TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00                 1                327
## 2 1503960366 4/13/2016 12:00:00                 2                384
## 3 1503960366 4/15/2016 12:00:00                 1                412
## 4 1503960366 4/16/2016 12:00:00                 2                340
## 5 1503960366 4/17/2016 12:00:00                 1                700
## 6 1503960366 4/19/2016 12:00:00                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

colnames(sleep_day_2)

## [1] "Id"                 "Date"               "Time"              
## [4] "TotalSleepRecords"  "TotalMinutesAsleep" "TotalTimeInBed"

checking sleep_day_2 for duplicates

nrow(sleep_day_2[duplicated(sleep_day_2),])

## [1] 3

sleep_day_2 contains 3 duplicates, and should be removed.

nrow(sleep_day_2)

## [1] 413

sleep_day_2 <- unique(sleep_day_2)
nrow(sleep_day_2)

## [1] 410

Below, I have analyzed daily_activity data frame, and created visualisations, focusing on TotalSteps Vs Calories Burned, to get a general idea on the relationship between the activity level (more steps taken) vs. the amount of calories burned.

As the graph below shows, there is a positive relationship between the number of steps taken, and the calories burned, the more active the individual is, the higher number of calories burned.

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=TotalSteps, y=Calories), color="purple") +
  geom_smooth(mapping=aes(x=TotalSteps, y=Calories)) +
  labs(title="Relationship Between TotalSteps Vs. Calories Burned", x="TotalSteps", y="Calories Burned (kcal)")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Below, I have analyzed sleep_day_2 data frame, and created visualisations, focusing on Total Time in bed Vs Total Minutes Asleep, to get an idea if there is a corelation between the time a person spend in bed (more time) vs. the amount of time a person slept.

As the graph below shows, there seem to be a linear relationship between the time spent in bed, and the minutes a person slept.

sleep_day_2 %>% 
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>% 
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

ggplot(data=sleep_day_2) +
  geom_point(mapping=aes(x=TotalMinutesAsleep, y=TotalTimeInBed), color="red") +
  geom_smooth(mapping=aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
  labs(title="Total Minutes Asleep vs. the Total Time in Bed", x="Total Minutes Alseep", y="Total Time in Bed (min)")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Summarizing & Recommendation for Bellabeat:

Based on the datastes from FitBit, not to forget the limitations of this dataset, I see that Bellabeat can use their app to encourage users of setting new goals, set an alert about not being active on specific time/day. They can offer an app incentives, partner with gyms to offer some discounted membership, or even offer to join an online activity groups, live online workouts.