Introduction

This is a case study about Bellabeat. Bellabeat is a high-tech manufacturer of health-focused product for women. The products are Bellabeat app, Leaf, Time and Spring. Bellabeat also offers a subscription-based membership program for the users. The marketing strategies of Bellabeat are investing year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engaging consumers on Twitter. Bellabeat also runs video ads on Youtube and ads on Google Display Network.


ASK Phase

Business Task

To identify trends in smart device usage and determine on ways to apply insights gained to improve Bellabeat products to better serve the customers.

Key Stakeholders

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer.
  • Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team.

PREPARE Phase

Dataset Used

The dataset that will be used in this case study is FitBit Fitness Tracker Data made by Möbius which is avaliable on Kaggle. This dataset is under CC0: Public Domain license.

This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03/12/2016 to 05/12/2016. This dataset contains personal tracker data from 30 eligible FitBit users.

There are 18 csv files in the original dataset. Before moving further, it is better to check whether the dataset ROCCC or not.

  • Reliable: The dataset was collected from 30 FitBit users which is considered a small dataset, so might not be super reliable.
  • Original: It is a third party data collected from a survey via Amazon Mechanical Turk.
  • Comprehensive: The dataset is comprehensive in terms of details which includes steps, calories, activity levels and sleep data of the users. However, the collected data spans across two months only which might be slightly insufficient.
  • Current: The dataset contains data between 03/12/2016 to 05/12/2016 which is around 7 years old.
  • Cited: The source is well cited.

Data Selection

Since there are 18 csv files in the dataset, not all csv files are chosen for this case study. Only the csv files listed below are selected:

  • dailyActivity_merged.csv
  • dailyCalories_merged.csv
  • dailyIntensities_merged.csv
  • dailySteps_merged.csv
  • sleepDay_merged.csv
  • weightLogInfo_merged.csv
  • hourlyCalories_merged.csv
  • hourlySteps_merged.csv

PROCESS Phase

The pre-processing of the csv files is performed using Google Sheets and Google BigQuery before importing into RStudio.

First, data cleaning procedures which includes trimming white spaces (if any) and removing duplicates (if any) are performed for all csv files. There are no white spaces in all the csv files. However, we have found and removed 3 duplicated rows from sleepDay_merged.csv.

Upon further investigations using Google Sheets, it was confirmed that all columns in dailyCalories_merged.csv, dailySteps_merged.csv and dailyIntensities_merged.csv are included in dailyActivity_merged.csv. Therefore, those 3 csv files can be ignored. The cleaned dailyActivity_merged.csv is then saved as daily_activity_cleaned.csv.

Next, hourlyCalories_merged.csv and hourlySteps_merged.csv are joined using Google BigQuery. The query used to perform this join is:

SELECT
  calories.Id AS Id,
  calories.Date AS Date,
  calories.Time AS Time,
  calories.Calories AS Calories,
  steps.StepTotal AS Total_Steps
FROM
  my-data-project-1-401106.Bellabeat.hr_calories AS calories
INNER JOIN
  my-data-project-1-401106.Bellabeat.hr_steps AS steps
ON 
  calories.Id = steps.Id AND calories.Date = steps.Date AND calories.Time = steps.Time
ORDER BY
  Id, Date, Time

The results of the query is exported and saved as a csv file named hour_calories_steps.csv .


ANALYZE Phase

Loading Needed R Packages

First, we load the R packages needed for our analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load csv Files

Next, we will load the 4 csv files using the read_csv() function.

activity <- read_csv("daily_activity_cleaned.csv")
sleep <- read_csv("sleepDay_merged.csv")
weight <- read_csv("weightLogInfo_merged.csv")
calo_step <- read_csv("hour_calories_steps.csv")

Exploring the data

With the csv files loaded, we can start to explore the datasets. The first details that will be checked is the number of distinct Id in each of the datasets.

n_distinct(activity$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
n_distinct(calo_step$Id)
## [1] 33

The output shows the number of participants present in each of the dataset. Notice that all 33 participants involved in contributing daily activities, calories and steps data. Only 24 participants contribute sleep data. Since there are just 8 participants contribute to weight data, it is reasonable to not include it in the analysis as the sample size is insignificant.

First, we will focus on activity dataset.

head(as.data.frame(activity))
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

Note that the columns LoggedActivitiesDistance and SedentaryActiveDistance will be removed since it does not provide information that will impact the analysis.

activity <- activity[-c(6,10)]

Also, the column ActivityDate is renamed as Date. Then, view the data frame again to check the changes made.

activity <- activity %>% rename(Date=ActivityDate)
head(as.data.frame(activity))
##           Id      Date TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016      13162          8.50            8.50
## 2 1503960366 4/13/2016      10735          6.97            6.97
## 3 1503960366 4/14/2016      10460          6.74            6.74
## 4 1503960366 4/15/2016       9762          6.28            6.28
## 5 1503960366 4/16/2016      12669          8.16            8.16
## 6 1503960366 4/17/2016       9705          6.48            6.48
##   VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 1               1.88                     0.55                6.06
## 2               1.57                     0.69                4.71
## 3               2.44                     0.40                3.91
## 4               2.14                     1.26                2.83
## 5               2.71                     0.41                5.04
## 6               3.19                     0.78                2.51
##   VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## 1                25                  13                  328              728
## 2                21                  19                  217              776
## 3                30                  11                  181             1218
## 4                29                  34                  209              726
## 5                36                  10                  221              773
## 6                38                  20                  164              539
##   Calories
## 1     1985
## 2     1797
## 3     1776
## 4     1745
## 5     1863
## 6     1728

After that, we will explore some descriptive statistics for activity dataset.

activity %>% 
  select(TotalSteps, TotalDistance, Calories) %>% 
  summary()
##    TotalSteps    TotalDistance       Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :4900
activity %>% 
  select(SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes) %>% 
  summary()
##  SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes
##  Min.   :   0.0   Min.   :  0.0        Min.   :  0.00      Min.   :  0.00   
##  1st Qu.: 729.8   1st Qu.:127.0        1st Qu.:  0.00      1st Qu.:  0.00   
##  Median :1057.5   Median :199.0        Median :  6.00      Median :  4.00   
##  Mean   : 991.2   Mean   :192.8        Mean   : 13.56      Mean   : 21.16   
##  3rd Qu.:1229.5   3rd Qu.:264.0        3rd Qu.: 19.00      3rd Qu.: 32.00   
##  Max.   :1440.0   Max.   :518.0        Max.   :143.00      Max.   :210.00

Some discoveries from the outputs above:

  • The average total steps per day for all the participants are 7638 steps, which is slightly less than the recommended 10000 steps by Centers for Disease Control and Prevention (CDC). Since taking less than 5000 steps per day is considered sedentary, the participants in the dataset can be considered slightly active than most of the population.
  • The mean calories burned throughout the day is 2304.
  • The unit for distance is not mentioned in the dataset. Thus, we can only state that the average distance per day is 5.49 unit distance.
  • Notice that the mean sedentary minutes is around 991 minutes or approximately 16 hours per day, which is astonishing.
  • On average, there is only less than half an hour of active minutes per day (including fairly active and very active minutes).

Next, we focus on calo_step dataset, which is an hourly dataset.

head(as.data.frame(calo_step))
##           Id       Date     Time Calories Total_Steps
## 1 1503960366 2016-04-12 00:00:00       81         373
## 2 1503960366 2016-04-12 01:00:00       61         160
## 3 1503960366 2016-04-12 02:00:00       59         151
## 4 1503960366 2016-04-12 03:00:00       47           0
## 5 1503960366 2016-04-12 04:00:00       48           0
## 6 1503960366 2016-04-12 05:00:00       48           0

Similarly, descriptive statistics for the dataset calo_step is presented below.

calo_step %>% 
  select(Calories, Total_Steps) %>% 
  summary()
##     Calories       Total_Steps     
##  Min.   : 42.00   Min.   :    0.0  
##  1st Qu.: 63.00   1st Qu.:    0.0  
##  Median : 83.00   Median :   40.0  
##  Mean   : 97.39   Mean   :  320.2  
##  3rd Qu.:108.00   3rd Qu.:  357.0  
##  Max.   :948.00   Max.   :10554.0

Some discoveries from the output above:

  • The average calories burned per hour is 97.39.
  • The average total steps per hour is 320.2.

Moving on, we focus on sleep dataset.

head(as.data.frame(sleep))
##           Id  SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 4/12/2016                 1                327            346
## 2 1503960366 4/13/2016                 2                384            407
## 3 1503960366 4/15/2016                 1                412            442
## 4 1503960366 4/16/2016                 2                340            367
## 5 1503960366 4/17/2016                 1                700            712
## 6 1503960366 4/19/2016                 1                304            320

Since, we have two variables, which are TotalMinutesAsleep and TotalTimeInBed, we can calculate the percentage of time where the participants are actually asleep while laying in bed.

sleep <- sleep %>% mutate(PercentSleep = (TotalMinutesAsleep/TotalTimeInBed)*100)
head(sleep)
## # A tibble: 6 × 6
##           Id SleepDay  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                 <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016                 1                327            346
## 2 1503960366 4/13/2016                 2                384            407
## 3 1503960366 4/15/2016                 1                412            442
## 4 1503960366 4/16/2016                 2                340            367
## 5 1503960366 4/17/2016                 1                700            712
## 6 1503960366 4/19/2016                 1                304            320
## # ℹ 1 more variable: PercentSleep <dbl>

Similarly, we will explore some descriptive statistics for the dataset.

sleep %>% 
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed, PercentSleep) %>% 
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed   PercentSleep   
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0   Min.   : 49.84  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8   1st Qu.: 91.18  
##  Median :1.00      Median :432.5      Median :463.0   Median : 94.26  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5   Mean   : 91.65  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0   3rd Qu.: 96.06  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0   Max.   :100.00

Some discoveries from the output above:

  • The average percentage of time where the participants are actually asleep while laying in bed is 91.65%, with a mean total minutes asleep of around 419 minutes (approximately 7 hours) per night. The amount of sleep for the participants getting per night is considered healthy as it is recommended for adults to sleep 7 or more hours per night.
  • Most participants in the dataset sleep 1 time per day.

SHARE Phase

Visualization of Dataset

Moving on, we will try to find relationships or trends in the dataset. First, we visualize the variables in activity dataset. The first scatter plot is to illustrate the relationship between calories burned per day and total steps per day.

ggplot(data = activity, aes(x = TotalSteps, y = Calories)) + 
  geom_point() + geom_smooth() +  
  labs(title = "Calories Burned per Day Vs. Total Steps per Day", x = "Total Steps per Day",
       y = "Calories Burned per Day")

From the scatter plot above, we can notice that there is a positive correlation between calories burned per day and total steps per day. This comes with no surprise because the more we walk, the more calories burned.

Next, another scatter plot is plotted to show the relationship between very active minutes per day and total steps per day.

ggplot(data = activity, aes(x = TotalSteps, y = VeryActiveMinutes)) + 
  geom_point() + geom_smooth() +  
  labs(title = "Very Active Minutes per Day Vs. Total Steps per Day", x = "Total Steps per Day",
       y = "Very Active Minutes per Day")

From the scatter plot above, we shows that there is a positive correlation between the two variables. We can make a guess where those days with a higher number of total steps involve very active periods in the day where the participants engage in jogging or running session.

Moving on, the activity and sleep dataset will be joined together using inner_join() function on columns Id and Date. This joined dataset will be used to discover relationship between daily activities and sleep.

join_acti_slp <- inner_join(x = activity, y = sleep, by = c("Id"="Id","Date"="SleepDay"))

The scatter plot below shows the relationship between sedentary minutes per day and total minutes asleep at night.

ggplot(data = join_acti_slp, aes(x = TotalMinutesAsleep, y = SedentaryMinutes)) + 
  geom_point() + geom_smooth() +  
  labs(title = "Sedentary Minutes per Day Vs. Total Minutes Asleep at Night",
       x = "Total Minutes Asleep at Night", y = "Sedentary Minutes per Day")

From the scatter plot above, notice that there is a negative correlation between the two variables. The participants will have less amount of sleep at night if they have a higher amount of sedentary minutes in that day.

Next, we will perform aggregation on calo_step dataset for visualization. The dataset is grouped by time and mean calories burned per hour and mean total steps per hour are calculated for every hour of the day.

calo_step_agg <- calo_step %>% group_by(Time) %>% 
  summarize(mean_hr_calories=mean(Calories), mean_hr_steps=mean(Total_Steps))

Two column charts are plotted using this dataset as shown below.

ggplot(data = calo_step_agg) +
  geom_col(mapping = aes(x = Time, y = mean_hr_calories)) +
  labs(title = "Average Calories Burned throughout the Day",
       y = "Calories Burned")

ggplot(data = calo_step_agg) +
  geom_col(mapping = aes(x = Time, y = mean_hr_steps)) +
  labs(title = "Average Number of Steps Walked throughout the Day",
       y = "Number of Steps Walked")

From the first column chart above, notice that most calories are burned in the evening around 5 pm to 7 pm. This corresponds well to the second column chart which indicates that there are higher number of steps walked in the same period.

Insights Gained from Analysis and Visualization

  1. There is a positive correlation between calories burned per day and total steps per day.
  2. There is a positive correlation between very active minutes per day and total steps per day.
  3. There is a negative correlation between sedentary minutes per day and total minutes asleep at night.
  4. Most calories are burned in the evening around 5 pm to 7 pm.
  5. There are higher number of steps walked around 5 pm to 7 pm.
  6. The average total steps per day for all the participants are slightly less than the recommended amount by CDC.

ACT phase

With the insights gained, we hope to solve the stated business task which is to determine on ways to apply insights gained to improve Bellabeat products to better serve the customers. The aim of Bellabeat is to empower women with knowledge about their own health and habits. Therefore, the target audience of the company is women. Notice that most activities happen around 5pm to 7pm, this might indicates that the most of the participants are in the workforce and working until 5pm. Thus, the main target audience of the company can be women working a full-time job in office. The advertisement about helping women in the workforce to achieve a healthier lifestyle through using Bellabeat’s products can be created and advertised on different social media platforms.

Recommendations

  • Since most physical activities happen in the evening, Bellabeat app can use notification to notify users to engage in light activities to increase their activity levels during break times (lunch break, after dinner, or even in the early morning). It is ideal to pair the notification with some empowering and motivating quotes.
  • Include an option to display a widget on home screen of the phone which display a progress bar of reaching daily recommended amount of 10000 steps.
  • Leaf and Time can be used to collect user’s sleep data. The sleep data collected can be used by the app to determine normal sleep cycle of the user. Then, Bellabeat app can notify the user to go to bed at the user’s normal sleeping hour to create a healthy sleeping habit for the user.
  • Bellabeat membership can offer its users with different fitness goals with personalized workout routines and videos to follow along. Also, Bellabeat can provide free trial of the membership. Since there is a negative correlation (not a causation) between sedentary minutes per day and total minutes asleep at night, having some workout which decrease the sedentary minutes per day might be able to help user to improve their sleep.

Thank you for going through my first case study on Bellabeat. I will appreciate any comments and recommendations for improvement.