The following case study analyzes fitness tracker information of Fitbit users and provides business recommendations to the Bellabeat wellness company which manufactures high-tech health-focused products for women. Conclusions of the analysis are designed to benefit the Bellabeat app with new growth strategies and ideas.
Provide marketing strategy recommendations for Bellabeat through analyzing trends and gaining insights into non-Bellabeat smart device usage.
Our key stakeholders include the following:
We focus our recommendations on the Bellabeat app - one of the Bellabeat products which provides users with personal health statistics on their activity, sleep, stress, menstrual cycle, and mindfulness habits. It is designed to empower women to lead a healthy lifestyle by keeping them informed, accountable, and motivated. The Bellabeat app collects data from the company’s smart wellness products.
The dataset used for this case study is Fitbit Fitness Tracker Data which is publicly available on Kaggle: https://www.kaggle.com/datasets/arashnic/fitbit.
It contains personal fitness tracker information such as physical activity, heart rate and sleep, of thirty Fitbit users who submitted their data to the Amazon Mechanical Turk’s survey in the period of 04/12/2016 - 05/12/2016.
The dataset is open source. It can be viewed, downloaded, modified and reused by public. Hence, we do not need to gain owner’s permission to use the dataset for this project.
The dataset contains 18 CSV documents, all in a long format. Each Fitbit user has a unique ID and multiple rows allocated to them tracking different attributes by day, hour or minute.
As the size of our sample appears to be small with 30 users, we inspect the tables in Microsoft Excel and build pivot tables to better understand our dataset. After sorting and filtering the data via pivot tables, we count the number of observations and confirm the duration of analysis - 31 days.
We observe some tables are a product of other tables merged together (e.g. dailyActivity_merged contains the same information as dailyCalories_merged, dailyIntensities_merged, dailySteps_merged combined together). Other tables like heartrate_seconds_merged and weightLogInfo_merged present heart rate and weight stats of 7 and 8 users respectively. We disregard these documents due to a very small sample size which cannot be representative of the population to identify general trends for.
We choose the following datasets to proceed with our analysis:
dailyActivity_merged: daily activity stats of 33 users over the period of 31 days. Columns include steps, distance, intensities, calories;
sleepDay_merged: daily sleep logs of 24 users during 31 days. Columns include: number of sleep records a day, total minutes asleep and total minutes in bed;
hourlyIntensities_merged: hourly total and average intensity of 33 users over 31 days;
hourlySteps_merged: hourly total steps of 33 users during the period of 31 days.
We read into Fitbit’s measurement of active minutes to understand what intensity metric means. Active minutes translate into exerting more energy than while resting and pushing your metabolic equivalent of task (MET) above resting metabolic rate of 1 (sedentary minutes). Light intensity activities such as brisk walking require roughly twice the energy (MET=~2) that people spend at rest. Fair intensity (MET from 3 to 6) activities include fast walking and yoga while high intensity exercise (MET>=6) involves running, playing soccer, fast cycling etc.
The number of study participants is quite low and might not represent the whole population accurately. We do not possess any demographic information on Fitbit users to understand if a sampling bias could be present. The study was conducted over six years ago and lasted for only one month. Taking into account the dataset’s limitations, we conduct this case study to practice data analysis skills.
We perform our analysis using R as we can manipulate, process, and visualize data all in one place - RStudio. R language is beginner-friendly, while R packages make data analysis quick and efficient.
We first install and load packages needed for the analysis:
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(skimr)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(here)
## here() starts at /cloud/project
install.packages("ggpubr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggpubr)
We upload and import the four datasets we use for this case study.
daily_activity <- read.csv("dailyActivity_merged.csv")
daily_sleep <- read.csv("sleepDay_merged.csv")
hourly_steps <- read.csv("hourlySteps_merged.csv")
hourly_intensities <- read.csv("hourlyIntensities_merged.csv")
We preview the datasets to ensure all data was imported correctly.
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
head(daily_sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
head(hourly_steps)
## Id ActivityHour StepTotal
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
head(hourly_intensities)
## Id ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 4/12/2016 12:00:00 AM 20 0.333333
## 2 1503960366 4/12/2016 1:00:00 AM 8 0.133333
## 3 1503960366 4/12/2016 2:00:00 AM 7 0.116667
## 4 1503960366 4/12/2016 3:00:00 AM 0 0.000000
## 5 1503960366 4/12/2016 4:00:00 AM 0 0.000000
## 6 1503960366 4/12/2016 5:00:00 AM 0 0.000000
We use the glimpse function to further check every column and corresponding data type.
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(daily_sleep)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
glimpse(hourly_steps)
## Rows: 22,099
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ StepTotal <int> 373, 160, 151, 0, 0, 0, 0, 0, 250, 1864, 676, 360, 253, 2…
glimpse(hourly_intensities)
## Rows: 22,099
## Columns: 4
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 15039…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/1…
## $ TotalIntensity <int> 20, 8, 7, 0, 0, 0, 0, 0, 13, 30, 29, 12, 11, 6, 36, 5…
## $ AverageIntensity <dbl> 0.333333, 0.133333, 0.116667, 0.000000, 0.000000, 0.0…
We start cleaning and formatting our dataframes. First, we check for duplicates and delete them if present.
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
sum(duplicated(hourly_intensities))
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0
Daily_sleep dataframe has three duplicates which we remove to begin our cleaning process.
daily_sleep <- daily_sleep %>%
distinct()
We double check if the duplicates have been removed.
sum(duplicated(daily_sleep))
## [1] 0
Duplicates have been removed successfully. We further drop any missing values in all the datasets.
daily_activity <- daily_activity %>%
drop_na()
daily_sleep <- daily_sleep %>%
drop_na()
hourly_intensities <- hourly_intensities %>%
drop_na()
hourly_steps <- hourly_steps %>%
drop_na()
We initially confirmed the number of users in each data frame via Excel pivot tables:
We verify the number of unique IDs to ensure they match the number of study participants.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(daily_sleep$Id)
## [1] 24
n_distinct(hourly_intensities$Id)
## [1] 33
n_distinct(hourly_steps$Id)
## [1] 33
While examining the dataframes earlier, we noticed that date’s data type is character in all four dataframes. We convert day records from string to date format, hours to date-time format.
daily_activity <- daily_activity %>%
rename(Date = ActivityDate) %>%
mutate(Date = mdy(Date))
daily_sleep <- daily_sleep %>%
rename(Date = SleepDay) %>%
mutate(Date = mdy_hms(Date))
hourly_intensities <- hourly_intensities %>%
rename(DateTime = ActivityHour) %>%
mutate(DateTime = as.POSIXct(DateTime, format="%m/%d/%Y %I:%M:%S %p"))
hourly_steps <- hourly_steps %>%
rename(DateTime = ActivityHour) %>%
mutate(DateTime = as.POSIXct(DateTime, format="%m/%d/%Y %I:%M:%S %p"))
We merge daily_sleep with daily_activity and hourly_steps with hourly_intensities to observe any correlation between variables using Id and Date/DateTime as primary keys to join the tables.
daily_activity_sleep <- merge(daily_activity, daily_sleep, by = c ("Id", "Date"))
head(daily_activity_sleep)
## Id Date TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12 13162 8.50 8.50
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-15 9762 6.28 6.28
## 4 1503960366 2016-04-16 12669 8.16 8.16
## 5 1503960366 2016-04-17 9705 6.48 6.48
## 6 1503960366 2016-04-19 15506 9.88 9.88
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.14 1.26
## 4 0 2.71 0.41
## 5 0 3.19 0.78
## 6 0 3.53 1.32
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 2.83 0 29
## 4 5.04 0 36
## 5 2.51 0 38
## 6 5.03 0 50
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 34 209 726 1745
## 4 10 221 773 1863
## 5 20 164 539 1728
## 6 31 264 775 2035
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1 327 346
## 2 2 384 407
## 3 1 412 442
## 4 2 340 367
## 5 1 700 712
## 6 1 304 320
hourly_intensities_steps <- merge(hourly_intensities, hourly_steps, by = c ("Id", "DateTime"))
head(hourly_intensities_steps)
## Id DateTime TotalIntensity AverageIntensity StepTotal
## 1 1503960366 2016-04-12 00:00:00 20 0.333333 373
## 2 1503960366 2016-04-12 01:00:00 8 0.133333 160
## 3 1503960366 2016-04-12 02:00:00 7 0.116667 151
## 4 1503960366 2016-04-12 03:00:00 0 0.000000 0
## 5 1503960366 2016-04-12 04:00:00 0 0.000000 0
## 6 1503960366 2016-04-12 05:00:00 0 0.000000 0
Lack of steps/activity on Fridays and Sundays:
Fitbit users do not walk the recommended 8,000 steps on Fridays and Sundays. Bellabeat can implement notifications to encourage women to hit their daily activity goal every day. Awards (7-day step goal streak, 14-day step goal streak, etc.) and competitions with other users/friends could motivate them to stay consistent. Higher number of steps evidently correlates with higher number of calories burned - displaying calorie expenditure could be motivational to some.
Users are the most active between 5-7pm and lunch breaks:
Fitbit users are the most active between 5-7pm and 12-2pm. Bellabeat can send notifications to remind women to exercise/take a walk after work and during lunch breaks. A reward system could be set up to empower more movement.
Not sleeping the recommended 8 hours:
Fitbit users are not sleeping the recommended 8 hours. To promote overall wellness and optimize physical performance among its customers, Bellabeat should remind them not to sacrifice sleep as it is essential to good health and longevity. Suggesting breathing techniques before bedtime, reminding to reduce device use in the late evening, motivating to stay consistent with sleep schedule are some of the ideas to promote sleep duration.
Sedentary behavior & duration of sleep:
Higher sedentary time correlates with lower duration of sleep. Bellabeat needs to be reminding its users of health risks of sedentary behavior and significance of proper sleep hygiene. Notifications to stand up and walk around if a user does not move for an hour, posting articles on the importance of sleep, giving users an option to set up their desired bed time with automatic reminders - are all great ways to reduce sedentary time and increase sleep duration.
Fairly active as base consumers:
Majority of Fitbit users are fairly active. Bellabeat needs to educate its customers of the health benefits of living an active (or at least, fairly active) lifestyle. Various fitness/wellness articles and research studies with straight-to-the-point summaries should be published on a daily/weekly basis to keep users informed and motivated.
Frequent educational articles on the importance of exercising, walking and sleeping;
Reward system to keep users motivated: awards for 7-day step goal/sleep streak, longest walk/run/workout, etc. Opportunity to compete with other users and friends;
Notifications to stand up and move around at the end of each sedentary hour;
Reminders to start getting ready for sleep with low device use and breathing exercises;
Notifications to reach step/workout goals throughout the day with displaying how close the users are to their daily target.