According to Frobes.com, Bellabeat is a data-oriented
wellness tech company that was founded by Sandro Mur, Urška Sršen, and
Lovepreet Singh in 2013. Bellabeat has grown rapidly and quickly
positioned itself as a tech-driven wellness company for
women.
Urška Sršen, cofounder and Chief Creative Officer of
Bellabeat, believes that analyzing smart device fitness data could help
unlock new growth opportunities for the company.
Business Task:
I was asked to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices, they then want me to select one of Bellabeat’s products, and analyze smart device data to gain insight into how consumers are using their smart devices to apply these insights in my presentation.
The insights I will discover will then help guide marketing strategy for the company. I will present my analysis to the Bellabeat executive team along with our high-level recommendations for Bellabeat’s marketing strategy.
Case Study Road Map
Q1: Where is your data stored?
A1: FitBit
Fitness Tracker Data
Q2: How is the data organized? Is it in long or wide
format?
A2: The data contains 18 data files, most of them
are formatted in a long format, some are in wide format.
Q3: Are there issues with bias or credibility in this
data?
A3: The dataset has been made accessible by
Urška Sršen. Mrs. Sršen indicated the dataset might have some
limitations, and she encouraged the team to consider adding another data
to help address those limitations. The integrity of the data appears to
be reliable, the dataset is not perfect, I couldn’t find the files
descriptions, the sample size was small, only 33 users out of nearly an
estimated 30 millions FitBit users in 2016, which means the dataset only
account for 0.000096% of the total population. The sample of the
datasets should have been around 380 participants to the get a
confidence level of 95%, and a margin of error of +/- 5% There was also
the demographics of FitBit, compared to Bellabeat, and at the time of
this analysis, we couldn’t determine whether both demographics are
similar, the gender for example is very important for Bellabeat, while
the data from FitBit did not contain any information about the gender,
whether they were female, male, or both. We feel there is a bias in the
sample, and the gender could be a bias too, since we couldn’t confirm
the gender in the FitBit data.
Q4: Does your data is ROCCC?
A4: NO, I will
explain what is ROCCC:
1- R = Reliability: Low, the data was
not reliable enough to be used for our analysis to help guide marketing
strategy for the company, due to the low number of participants, and the
gender was unknown (gender is a very important part of Bellabeat).
2- O = Originality: Low, 3rd party data.
3- C =
Comprehensiveness: Low, the data not comprehensive, no information
about the participants demographics, such as gender, age, location, and
health status.
4- C = Current: Low, this analysis is been
done in 2022, the Fitbit dataset is outdated, it was created back in
2016, there have been a lot of changes, and new trends in consumer’s
wellness smart devices in the past 6 years, as well as the way how’s the
data is collected.
5- C = Cited: Low, the dataset is
distributed survey via Amazon Mechanical Turk between
03.12.2016-05.12.2016, we can’t check whether this is a reliable source
or not.
Q5: How are you addressing licensing, privacy, security, and
accessibility?
A5: The data gathered has been
anonymized, no personal information included.
Q6 + Q7 + Q8: How did you verify the data’s integrity? How does
it help you answer your question? Are there any problems with the
data?
A6 + A7 + A8: the data is insufficient to
provide a good, and comprehensive insights to Bellabeat. Our analysis
can only provide some a few hints, a reliable, and larger datasets are
needed for give better directions, and insights.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("devtools")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(readr)
library(tidyr)
library(devtools)
## Loading required package: usethis
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2)
daily_activity <- read.csv("dailyActivity_merged.csv")
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
daily_activity <- daily_activity %>% filter(TotalDistance !=0)
daily_activity <- daily_activity %>% filter(TotalSteps !=0)
nrow(daily_activity[duplicated(daily_activity),])
## [1] 0
sleep_day <- read.csv("sleepDay_merged.csv")
head(sleep_day)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
sleep_day_2 <- sleep_day %>%
separate(SleepDay, c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(sleep_day_2)
## Id Date Time TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 1 327
## 2 1503960366 4/13/2016 12:00:00 2 384
## 3 1503960366 4/15/2016 12:00:00 1 412
## 4 1503960366 4/16/2016 12:00:00 2 340
## 5 1503960366 4/17/2016 12:00:00 1 700
## 6 1503960366 4/19/2016 12:00:00 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
colnames(sleep_day_2)
## [1] "Id" "Date" "Time"
## [4] "TotalSleepRecords" "TotalMinutesAsleep" "TotalTimeInBed"
nrow(sleep_day_2[duplicated(sleep_day_2),])
## [1] 3
nrow(sleep_day_2)
## [1] 413
sleep_day_2 <- unique(sleep_day_2)
nrow(sleep_day_2)
## [1] 410
ggplot(data=daily_activity) +
geom_point(mapping=aes(x=TotalSteps, y=Calories), color="purple") +
geom_smooth(mapping=aes(x=TotalSteps, y=Calories)) +
labs(title="Relationship Between TotalSteps Vs. Calories Burned", x="TotalSteps", y="Calories Burned (kcal)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
sleep_day_2 %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
ggplot(data=sleep_day_2) +
geom_point(mapping=aes(x=TotalMinutesAsleep, y=TotalTimeInBed), color="red") +
geom_smooth(mapping=aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
labs(title="Total Minutes Asleep vs. the Total Time in Bed", x="Total Minutes Alseep", y="Total Time in Bed (min)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Based on the datastes from FitBit, not to forget the limitations of this dataset, I see that Bellabeat can use their app to encourage users of setting new goals, set an alert about not being active on specific time/day. They can offer an app incentives, partner with gyms to offer some discounted membership, or even offer to join an online activity groups, live online workouts.