Capstone: Bellabeat Case Study

“Leaf Urban”, one of Bellabeat’s stylish products

1. Introduction

The Google Data Analytics Certificate offered on Coursera.org provides an 8-course curriculum which teaches entry-level data analysts a variety of data-related skills and techniques. Through videos lead by Google employee instructors, quizzes on course content, and hands-on practice demonstrations with programs like SQL, Excel, Tableau, and R, this curriculum teaches how to navigate data in all steps of its journey. In preparation for real entry-level jobs, the courses offer real-life data set examples and tangible ways to practice data skills in a methodical framework, called the “Six Steps of Data Analysis”: Ask, Prepare, Process, Analyze, Share, and Act. This case study will use the same framework, excluding the “Act” phase, to investigate the smart device company, Bellabeat.

The background information, scenario, and requirements for this company were provided by the Google Data Analytics Certificate course, coming specifically from the eighth and final module, “Data Analytics Capstone Project: Complete a Case Study”. The goal of this case study is to demonstrate the knowledge, process, problem-solving, and skillsets learned from the previous modules of the curriculum and apply it to a world-use case.

2. Company background

Bellabeat is a high-tech company that manufactures health-focused smart products founded by Urška Sršen. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively.

2.1 Business Task

Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

This case study will focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The discovered insights will then help guide marketing strategy for the company. The analysis and high-level recommendations for Bellabeat’s marketing strategy will be presented to the Bellabeat executive team.

2.2 Stakeholders

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.

2.3 Products

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals

2.4 Source of data

The data used for this case study is Fit Bit Tracker data sourced from Kaggle.com (CC0: Public Domain, dataset made available through Mobius).

The information, formatting, and guidline of this case study comes from Coursera’s Google Data Analytics Capstone module (the 8th and final module): Case Study 2: How Can a Wellness Technology Company Play it Smart?.

3. Ask

3.1 Guiding Questions

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Additional guiding questions:

Who are the major players in the smart device market?
Which Bellabeat product aligns with these trends the most?
What are some limitations and assumptions of the data and of our analysis?
What haven’t we considered?

3.2 Key Tasks

Identify consumer usage trends of non-Bellabeat devices using publicly available data.
Relate aspects of identified trends to Bellabeat consumer base.
Choose a Bellabeat product that will best fit these trends and boost Bellabeat’s growth.
Identify a marketing strategy based on these trends and chosen product.
Provide recommendations to Bellabeats’ stakeholders.
Define any limitations, assumptions, and missing information.

3.3 Deliverables

A clear summary of the business task.
A description of all data sources used.
Documentation of any cleaning or manipulation of data.
A summary of your analysis.
Supporting visualizations and key findings.
Your top high-level content recommendations based on your analysis.

3.4 Business Task

We want to find any relevant trends from the data source and create recommendations to key stakeholders about Bellabeat’s marketing strategy.

4 Prepare

4.1 Package loading

Packages used in this case study:

library(tidyverse) #collection of R packages which we will be using often

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr) #for data manipulation
library(ggplot2) #data visualization package
library(ggpubr) #extensive visualizations with ggplot2
library(sqldf) #for running SQL commands within R

## Loading required package: gsubfn

## Loading required package: proto

## Warning in fun(libname, pkgname): couldn't connect to display ":0"

## Loading required package: RSQLite

library(lubridate) #for working with dates in R

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(janitor) #for data examination and cleaning

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr) #for summary statistics in R
library(tidyr) #for organizing tabular data
library(RColorBrewer) #for color palettes

4.2 Importing data sets with assignments

## Reads in dataset containing daily activities. 
activity <- read.csv("dailyActivity_merged.csv")
## Reads in dataset containing daily calorie expenditures.
calories <- read.csv("dailyCalories_merged.csv")
## Reads in dataset containing daily intensities.
intensities <- read.csv("dailyIntensities_merged.csv")
## Reads in dataset containing daily steps.
steps <- read.csv("dailySteps_merged.csv")
## Reads in dataset for sleep data.
sleep <- read.csv("sleepDay_merged.csv")
## Reads in data set for logged weight.
weight <- read.csv("weightLogInfo_merged.csv")
## Reads in data set for hourly steps.
hourly_steps <- read.csv("hourlySteps_merged.csv")

4.3 Data verification

Examining the first few rows of every data set:

head(activity)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

head(calories)

##           Id ActivityDay Calories
## 1 1503960366   4/12/2016     1985
## 2 1503960366   4/13/2016     1797
## 3 1503960366   4/14/2016     1776
## 4 1503960366   4/15/2016     1745
## 5 1503960366   4/16/2016     1863
## 6 1503960366   4/17/2016     1728

head(intensities)

##           Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366   4/12/2016              728                  328
## 2 1503960366   4/13/2016              776                  217
## 3 1503960366   4/14/2016             1218                  181
## 4 1503960366   4/15/2016              726                  209
## 5 1503960366   4/16/2016              773                  221
## 6 1503960366   4/17/2016              539                  164
##   FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1                  13                25                       0
## 2                  19                21                       0
## 3                  11                30                       0
## 4                  34                29                       0
## 5                  10                36                       0
## 6                  20                38                       0
##   LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1                6.06                     0.55               1.88
## 2                4.71                     0.69               1.57
## 3                3.91                     0.40               2.44
## 4                2.83                     1.26               2.14
## 5                5.04                     0.41               2.71
## 6                2.51                     0.78               3.19

head(sleep)

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

head(weight)

##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12

head(steps)

##           Id ActivityDay StepTotal
## 1 1503960366   4/12/2016     13162
## 2 1503960366   4/13/2016     10735
## 3 1503960366   4/14/2016     10460
## 4 1503960366   4/15/2016      9762
## 5 1503960366   4/16/2016     12669
## 6 1503960366   4/17/2016      9705

head(hourly_steps)

##           Id          ActivityHour StepTotal
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366  4/12/2016 1:00:00 AM       160
## 3 1503960366  4/12/2016 2:00:00 AM       151
## 4 1503960366  4/12/2016 3:00:00 AM         0
## 5 1503960366  4/12/2016 4:00:00 AM         0
## 6 1503960366  4/12/2016 5:00:00 AM         0

It seems as though the columns from calories, intensities, and steps are subsets of activity. Using a trick I found via this Kaggle capstone project, we will be using SQL queries in R to check for subsets via sqldf():

sqldf("SELECT COUNT()
      FROM activity
      LEFT JOIN calories ON
      activity.Id = calories.Id AND
      activity.ActivityDate = calories.ActivityDay AND
      activity.Calories = calories.Calories")

##   COUNT()
## 1     940

sqldf("SELECT COUNT()
      FROM activity
      LEFT JOIN steps ON
      activity.Id = steps.Id AND
      activity.ActivityDate = steps.ActivityDay AND
      activity.TotalSteps = steps.StepTotal")

##   COUNT()
## 1     940

sqldf("SELECT COUNT()
      FROM activity 
      LEFT JOIN intensities  ON 
      activity.Id = intensities.Id AND 
      activity.ActivityDate = intensities.ActivityDay AND 
      activity.SedentaryMinutes = intensities.SedentaryMinutes AND
      activity.LightlyActiveMinutes = intensities.LightlyActiveMinutes AND
      activity.FairlyActiveMinutes = intensities.FairlyActiveMinutes AND
      activity.VeryActiveMinutes = intensities.VeryActiveMinutes AND
      activity.SedentaryActiveDistance = intensities.SedentaryActiveDistance AND
      activity.LightActiveDistance = intensities.LightActiveDistance AND
      activity.ModeratelyActiveDistance = intensities.ModeratelyActiveDistance AND
      activity.VeryActiveDistance = intensities.VeryActiveDistance")

##   COUNT()
## 1     940

Each query returned 940 counted observations, so it is true that the columns from calories, steps, and intensities are already included in the data set, activity. We do not have to include calories, steps, or intensities data sets in our analysis.

n_distinct(activity$Id)

## [1] 33

n_distinct(sleep$Id)

## [1] 24

n_distinct(weight$Id)

## [1] 8

The sample size for weight data is relatively small, so it will not be included.

# list rows of data that have missing values

activity[!complete.cases(activity),]

##  [1] Id                       ActivityDate             TotalSteps              
##  [4] TotalDistance            TrackerDistance          LoggedActivitiesDistance
##  [7] VeryActiveDistance       ModeratelyActiveDistance LightActiveDistance     
## [10] SedentaryActiveDistance  VeryActiveMinutes        FairlyActiveMinutes     
## [13] LightlyActiveMinutes     SedentaryMinutes         Calories                
## <0 rows> (or 0-length row.names)

sleep[!complete.cases(sleep),]

## [1] Id                 SleepDay           TotalSleepRecords  TotalMinutesAsleep
## [5] TotalTimeInBed    
## <0 rows> (or 0-length row.names)

Checking for and removing duplicates:

sum(duplicated(activity))

## [1] 0

sum(duplicated(sleep))

## [1] 3

Our sleep data table has 3 duplicate rows - lets remove those:

sleep <- sleep[!duplicated(sleep), ]

sum(duplicated(sleep))

## [1] 0

5 Process

5.1 Fix date formatting

# activity
activity$ActivityDate=as.POSIXct(activity$ActivityDate,
                                 format='%m/%d/%Y',
                                 tz=Sys.timezone())
activity$Date<-format(activity$ActivityDate,
                      format='%m/%d/%Y')
head(activity)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-04-12      13162          8.50            8.50
## 2 1503960366   2016-04-13      10735          6.97            6.97
## 3 1503960366   2016-04-14      10460          6.74            6.74
## 4 1503960366   2016-04-15       9762          6.28            6.28
## 5 1503960366   2016-04-16      12669          8.16            8.16
## 6 1503960366   2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories       Date
## 1                  13                  328              728     1985 04/12/2016
## 2                  19                  217              776     1797 04/13/2016
## 3                  11                  181             1218     1776 04/14/2016
## 4                  34                  209              726     1745 04/15/2016
## 5                  10                  221              773     1863 04/16/2016
## 6                  20                  164              539     1728 04/17/2016

# sleep
sleep$SleepDay=as.POSIXct(sleep$SleepDay,
                          format='%m/%d/%Y %I:%M:%S %p',
                          tz=Sys.timezone())
sleep$Date<-format(sleep$SleepDay,
                   format='%m/%d/%Y')

head(sleep)

##           Id   SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
##         Date
## 1 04/12/2016
## 2 04/13/2016
## 3 04/15/2016
## 4 04/16/2016
## 5 04/17/2016
## 6 04/19/2016

# hourly steps
hourly_steps$ActivityHour=as.POSIXct(hourly_steps$ActivityHour,
                                     format='%m/%d/%Y %I:%M:%S %p', 
                                     tz=Sys.timezone())
hourly_steps$Date <- format(hourly_steps$ActivityHour,
                            format='%m/%d/%Y')
hourly_steps$Hour <- format(hourly_steps$ActivityHour,
                            format='%I:%M:%S')
head(hourly_steps)

##           Id        ActivityHour StepTotal       Date     Hour
## 1 1503960366 2016-04-12 00:00:00       373 04/12/2016 12:00:00
## 2 1503960366 2016-04-12 01:00:00       160 04/12/2016 01:00:00
## 3 1503960366 2016-04-12 02:00:00       151 04/12/2016 02:00:00
## 4 1503960366 2016-04-12 03:00:00         0 04/12/2016 03:00:00
## 5 1503960366 2016-04-12 04:00:00         0 04/12/2016 04:00:00
## 6 1503960366 2016-04-12 05:00:00         0 04/12/2016 05:00:00

5.2 Merging data tables

intersect(as.character(sleep$Date), as.character(activity$Date))

##  [1] "04/12/2016" "04/13/2016" "04/15/2016" "04/16/2016" "04/17/2016"
##  [6] "04/19/2016" "04/20/2016" "04/21/2016" "04/23/2016" "04/24/2016"
## [11] "04/25/2016" "04/26/2016" "04/28/2016" "04/29/2016" "04/30/2016"
## [16] "05/01/2016" "05/02/2016" "05/03/2016" "05/05/2016" "05/06/2016"
## [21] "05/07/2016" "05/08/2016" "05/09/2016" "05/10/2016" "05/11/2016"
## [26] "04/14/2016" "04/22/2016" "04/27/2016" "05/04/2016" "05/12/2016"
## [31] "04/18/2016"

# inner join for activity and sleep
activity_sleep<-merge(activity, sleep,
             by=c("Id", "Date"), all.x = TRUE)

head(activity_sleep)

##           Id       Date ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 04/12/2016   2016-04-12      13162          8.50            8.50
## 2 1503960366 04/13/2016   2016-04-13      10735          6.97            6.97
## 3 1503960366 04/14/2016   2016-04-14      10460          6.74            6.74
## 4 1503960366 04/15/2016   2016-04-15       9762          6.28            6.28
## 5 1503960366 04/16/2016   2016-04-16      12669          8.16            8.16
## 6 1503960366 04/17/2016   2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories   SleepDay
## 1                  13                  328              728     1985 2016-04-12
## 2                  19                  217              776     1797 2016-04-13
## 3                  11                  181             1218     1776       <NA>
## 4                  34                  209              726     1745 2016-04-15
## 5                  10                  221              773     1863 2016-04-16
## 6                  20                  164              539     1728 2016-04-17
##   TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1                 1                327            346
## 2                 2                384            407
## 3                NA                 NA             NA
## 4                 1                412            442
## 5                 2                340            367
## 6                 1                700            712

5.3 Summary statistics of data

activity_sleep %>% 
  select(TotalSteps, TotalDistance, Calories, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, TotalMinutesAsleep, TotalTimeInBed, TotalSleepRecords) %>% 
  summary()

##    TotalSteps    TotalDistance       Calories    VeryActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0   Min.   :  0.00   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:1828   1st Qu.:  0.00   
##  Median : 7406   Median : 5.245   Median :2134   Median :  4.00   
##  Mean   : 7638   Mean   : 5.490   Mean   :2304   Mean   : 21.16   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:2793   3rd Qu.: 32.00   
##  Max.   :36019   Max.   :28.030   Max.   :4900   Max.   :210.00   
##                                                                   
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes TotalMinutesAsleep
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0.0   Min.   : 58.0     
##  1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8   1st Qu.:361.0     
##  Median :  6.00      Median :199.0        Median :1057.5   Median :432.5     
##  Mean   : 13.56      Mean   :192.8        Mean   : 991.2   Mean   :419.2     
##  3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5   3rd Qu.:490.0     
##  Max.   :143.00      Max.   :518.0        Max.   :1440.0   Max.   :796.0     
##                                                            NA's   :530       
##  TotalTimeInBed  TotalSleepRecords
##  Min.   : 61.0   Min.   :1.000    
##  1st Qu.:403.8   1st Qu.:1.000    
##  Median :463.0   Median :1.000    
##  Mean   :458.5   Mean   :1.119    
##  3rd Qu.:526.0   3rd Qu.:1.000    
##  Max.   :961.0   Max.   :3.000    
##  NA's   :530     NA's   :530

The average steps taken by users was 7,638 per day, which is lower than an ideal 10,000 steps set by the CDC
Users spent an average time of ~7 hours of sleep each day.
Users were sedentary for an average of 16.52 hours per day - over two-thirds of the day is spent being inactive.
Users expend an average of 2,304 calories per day.
The means for both VeryActiveMinutes & FairlyActiveMinutes are much greater than their respective medians - most likely, the data is right-skewed, with few users having very high activity minutes and most users having low activity minutes.

#data frame for highlighting outliers in total steps
highlight_df <- activity_sleep %>% 
  filter(TotalSteps>25000)

5.4 Classify observations

We will separate observations into fitness groups based on walking lifestyle: “Sedentary, Slightly Active, Fairly Active, and Very Active”.

activity_sleep$walking_lifestyle <- ifelse(
  (activity_sleep$TotalSteps <= 3790),"Sedentary",
  ifelse((activity_sleep$TotalSteps > 3790 & activity_sleep$TotalSteps <= mean(activity_sleep$TotalSteps)), "Slightly Active", 
         ifelse((activity_sleep$TotalSteps > mean(activity_sleep$TotalSteps) & activity_sleep$TotalSteps <= 10727), "Fairly Active",
                ifelse((activity_sleep$TotalSteps > 10727), "Very Active", "Other"
                       )
                )
         )
  
)



head(activity_sleep)

##           Id       Date ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 04/12/2016   2016-04-12      13162          8.50            8.50
## 2 1503960366 04/13/2016   2016-04-13      10735          6.97            6.97
## 3 1503960366 04/14/2016   2016-04-14      10460          6.74            6.74
## 4 1503960366 04/15/2016   2016-04-15       9762          6.28            6.28
## 5 1503960366 04/16/2016   2016-04-16      12669          8.16            8.16
## 6 1503960366 04/17/2016   2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories   SleepDay
## 1                  13                  328              728     1985 2016-04-12
## 2                  19                  217              776     1797 2016-04-13
## 3                  11                  181             1218     1776       <NA>
## 4                  34                  209              726     1745 2016-04-15
## 5                  10                  221              773     1863 2016-04-16
## 6                  20                  164              539     1728 2016-04-17
##   TotalSleepRecords TotalMinutesAsleep TotalTimeInBed walking_lifestyle
## 1                 1                327            346       Very Active
## 2                 2                384            407       Very Active
## 3                NA                 NA             NA     Fairly Active
## 4                 1                412            442     Fairly Active
## 5                 2                340            367       Very Active
## 6                 1                700            712     Fairly Active

6 Analyze and Share

6.1 Scatterplot of variables

activity_sleep$pc <- predict(prcomp(~TotalSteps+Calories, activity_sleep))[,1]

ggplot(activity_sleep, aes(x=TotalSteps, y=Calories, color=walking_lifestyle)) +
  geom_jitter(alpha=.5) +
  #highlights outliers on the scatterplot
  geom_point(data=highlight_df, 
             aes(x=TotalSteps, y=Calories),
             color='red',
             alpha=0.5)+
  geom_smooth(size=0.5, show.legend=FALSE)+
  labs(colour="Lifestyle",
       title= "Daily Steps vs Calories Burned", 
       x="Daily Steps",
       y="Calories Burned",
       caption="Data Source: Fitabase Data, 4.12.16-5.12.16")+
  theme(plot.title=element_text(hjust=0.5))+
  theme_minimal()+
  theme_bw(base_size=16)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Insights:

Generally, the more daily steps taken results in more calories burned for that day. Note the outliers on the graph.

Does activity affect sleep?

ggplot(activity_sleep, aes(x=TotalSteps, y=TotalMinutesAsleep, color=walking_lifestyle))+
  geom_jitter(size=0.5, alpha=0.5)+
  geom_smooth(show.legend=FALSE)+
  labs(colour="Lifestyle", title="Daily Steps vs Minutes slept", x="Daily Steps", y="Minutes Slept", caption="Data Source: Fitabase Data, 4.12.16-5.12.16")+
  theme(plot.title=element_text(hjust=0.5))+
  theme_minimal()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (stat_smooth).

## Warning: Removed 530 rows containing missing values (geom_point).

Insights:

There is no obvious correlation between the number of steps a user took and how much they slept for a date.

6.2 Distributions by group

A table that shows the sums and percentages for each group:

user_type <- activity_sleep %>% 
  group_by(walking_lifestyle) %>% 
  summarise(total=n()) %>% 
  mutate(total_percent=scales::percent (total/sum(total)))

user_type

## # A tibble: 4 × 3
##   walking_lifestyle total total_percent
##   <chr>             <int> <chr>        
## 1 Fairly Active       223 23.72%       
## 2 Sedentary           236 25.11%       
## 3 Slightly Active     246 26.17%       
## 4 Very Active         235 25.00%

Pie chart showing percentages of user categories. The code was taken from this tutorial on making pie charts in ggplot2.

#lets make a pie graph showing the number the percentages from each user category.

ggplot(user_type,
       aes(x="", y=total, fill=walking_lifestyle))+
  geom_bar(stat="identity", width=1, color="white")+
  coord_polar("y", start=0)+
  theme_void()+
  theme(plot.title=element_text(hjust=0.5))+
  #add percentages as labels on chart
  geom_text(aes(label=total_percent),
            color="white",
            position=position_stack(vjust=0.5))+
  labs(title="User Category Types",
       fill="Lifestyle")

Insights:

The counts for every category are, generally, evenly distributed across the 4 types.

Let’s see these counts visualized:

6.2.2 Histograms of different variables

#binwidth using Freedman-Diaconis rule
bw <- 2*IQR(activity_sleep$TotalSteps)/length(activity_sleep)^(1/3)

#histogram of total steps
ggplot(activity_sleep, aes(TotalSteps, fill=walking_lifestyle)) +
  geom_histogram(binwidth=bw, alpha=0.5) +
  geom_vline(aes(xintercept=mean(TotalSteps)),
             color="black",
             linetype="dashed",
             size=1,
             alpha=0.5)+
  labs(fill="Lifestyle",
       title="Distribution of daily steps",
       x="Steps",
       y="Count",
       caption="Data Source: Fitabase Data, 4.12.16-5.12.16")+
  theme(plot.title=element_text(hjust=0.5))+
  scale_fill_discrete(name = "Lifestyle")

Some insights:

The histogram is right skewed, with the bulk of the observations falling around the mean of 7,638 steps.
Most of the outliers are seen in the Very Active lifestyle category, starting from 20,000+ steps.

6.3 Boxplots of variables

How do different user categories (Sedentary/Lightly Active/Fairly Active/Very Active) perform with other variables?

Sedentary Minutes
Very Active Minutes
Calories
Sleep
Distance

ggplot(activity_sleep, 
       aes(x=walking_lifestyle, y= SedentaryMinutes, fill=walking_lifestyle))+
  geom_boxplot(show.legend=FALSE)+
  scale_fill_brewer(palette="Blues")+
   theme(plot.title=element_text(hjust=0.5))+
  labs(title="Daily Sedentary Minutes by Walking Lifestyle",
       x="Walking Lifestyle",
       y="Sedentary Minutes",
       caption="Data Source: Fitabase Data 4.12.16-5.12.16")

As expected, users who had a sedentary walking lifestyle witnessed more sedentary minutes.

The discrepancy between each walking category is small.
The range for the Sedentary Category is large, where some observations have close to 0 sedentary minutes. Low steps and low sedentary minutes may indicate users were doing some other type of exercise which did not involve walking.

ggplot(activity_sleep,
       aes(x=walking_lifestyle, y=VeryActiveMinutes,
           fill=walking_lifestyle))+
  geom_boxplot(show.legend=FALSE)+
  scale_fill_brewer(palette="Blues")+
   theme(plot.title=element_text(hjust=0.5))+
  labs(title="Daily Very Active Minutes by Walking Lifestyle",
       x="Walking Lifestyle",
       y="Very Active Minutes",
       caption="Data Source: Fitabase Data 4.12.16-5.12.16")

The more steps on takes, the more very active minutes they tend to have:

Users who get lots of daily steps also have higher very active minutes, possibly indicating they are doing high intensity, cardio sports.

ggplot(activity_sleep,
       aes(x=walking_lifestyle, y=Calories,
           fill=walking_lifestyle))+
  geom_boxplot(show.legend=FALSE)+
  scale_fill_brewer(palette="Blues")+
   theme(plot.title=element_text(hjust=0.5))+
  labs(title="Daily Calories Burnt by Walking Lifestyle",
       x="Walking Lifestyle",
       y="Calories Burnt",
       caption="Data Source: Fitabase Data 4.12.16-5.12.16")

The average daily calories burnt increases as the user’s daily steps increases.

The range of calories burnt were large for every user category.

ggplot(activity_sleep,
       aes(x=walking_lifestyle, y=TotalDistance,
           fill=walking_lifestyle))+
  geom_boxplot(show.legend=FALSE)+
  scale_fill_brewer(palette="Blues")+
   theme(plot.title=element_text(hjust=0.5))+
  labs(title="Daily Distance by Walking Lifestyle",
       x="Walking Lifestyle",
       y="Daily Distance (in km)",
       caption="Data Source: Fitabase Data 4.12.16-5.12.16")

As expected, the more daily steps by a user indicates a higher daily distance.

Note the very small interquartile range.
Very Active has many outliers - this makes sense, considering our distribution is right-skewed for steps.

ggplot(activity_sleep, 
       aes(x=walking_lifestyle, y=TotalMinutesAsleep,
           fill=walking_lifestyle))+
  geom_boxplot(show.legend=FALSE)+
   theme(plot.title=element_text(hjust=0.5))+
  labs(title="Daily Minutes Asleep by Walking Lifestyle",
       x="Walking Lifestyle",
       y="Minutes Asleep",
       caption="Data Source: Fitabase Data 4.12.16-5.12.16")+
  scale_fill_brewer(palette="Blues")

## Warning: Removed 530 rows containing non-finite values (stat_boxplot).

Considering the high number of ouliers in the data, this graph indicates that we cannot make a conclusion about walking style and daily minutes asleep.

6.4 Usage

There were thirty participant who submitted personal tracker data - how often were users using their fitbit to track their activity?

#makes a new data table grouped by ID.

usage <- activity_sleep %>% 
  group_by(Id) %>% 
  summarize(tracked_days=sum(n()),
            avg_steps=mean(TotalSteps)) %>% 
  mutate(usage=case_when(
    tracked_days >= 1 & tracked_days <= 10 ~ "low",
    tracked_days > 10 & tracked_days <= 20 ~ "mid",
    tracked_days > 20 & tracked_days <= 31 ~"high",
  ))

head(usage)

## # A tibble: 6 × 4
##           Id tracked_days avg_steps usage
##        <dbl>        <int>     <dbl> <chr>
## 1 1503960366           31    12117. high 
## 2 1624580081           31     5744. high 
## 3 1644430081           30     7283. high 
## 4 1844505072           31     2580. high 
## 5 1927972279           31      916. high 
## 6 2022484408           31    11371. high

We can get an idea of total usage by finding number of users within each usage category:

usage_categories <- usage %>% 
  group_by(usage) %>% 
  summarize(user_count=n()) %>% 
  mutate(total_percent=scales::percent(user_count/sum(user_count)))

head(usage_categories)

## # A tibble: 3 × 3
##   usage user_count total_percent
##   <chr>      <int> <chr>        
## 1 high          29 87.9%        
## 2 low            1 3.0%         
## 3 mid            3 9.1%

87.9% of users have a high usage, or at least 21 days recorded.
There is only one user who recorded fewer than 10 days.

6.5 Usage time frame

Almost 88% of users used their fitness trackers at least 21 days. Did users use their trackers more towards the beginning or end of the trial?

hourly_trend <- (hourly_steps) %>% 
  group_by(Date) %>% 
  summarize(avg_usage_hrs=n()/33)

head(hourly_trend)

## # A tibble: 6 × 2
##   Date       avg_usage_hrs
##   <chr>              <dbl>
## 1 04/12/2016          24  
## 2 04/13/2016          24  
## 3 04/14/2016          24  
## 4 04/15/2016          23.8
## 5 04/16/2016          23.3
## 6 04/17/2016          23.3

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

#convert Date column from character to date
hourly_trend$Date <- as.Date(hourly_trend$Date, "%m/%d/%Y")

ggplot(hourly_trend, 
       aes(x=Date, y=avg_usage_hrs))+
  scale_x_date(breaks= date_breaks("1 day"),
               labels=date_format("%b-%d"),
               limits=(c(min(hourly_trend$Date),
                         max(hourly_trend$Date))),
               expand=c(.02,.02))+
  scale_y_continuous(limits=c(0,25),
                     breaks=seq(0,max(hourly_trend$avg_usage_hrs), by=4),
                     expand= c(0,.7))+
  labs(title="Daily Usage 4/12/2016-5/12/2016",
       y="Hours Tracker is Used",
       caption="Data Source: Fitabase Data 4.12.16-5.12.16")+
  theme(axis.text.x=element_text(angle=90),
        plot.title=element_text(size=16),
        panel.grid.major.x=element_line(color="grey60",
                                        linetype="solid",size=0.1),
        panel.background=element_blank())+
  geom_line()

The idea for this code is credited to Kaggle user IrenaShen1 through this case study workbook.

From 4/12/2016 to 5/12/2016, users tracker usage declined.

Starting from an average of 24 hours a day from April 12th to April 26th.
Usage dropped by about 3 hours on average after the first 18 days of the study.
Usage began to free fall starting May 5th, declining from an average of 20 daily hours used (May 5) to an average of 8 daily hours used a week later (May 12).

7. Act - Recommendations

7.1 Insights from analysis

The average number of daily steps by users was 7,638 steps from 4/12/2016 - 5/12/2016.
Users averaged around 7 hours of daily sleep from 4/12/2016 - 5/12/2016.
There was a positive trend line between the number of daily steps and the number of calories burnt for a user based on the available tracker data.
The distribution for daily steps was right-skewed, meaning that there were more outliers who had daily steps far above the mean.
A user’s walking lifestyle is linked to calories burnt, distance walked, number of daily active minutes, and number of daily sedentary minutes. However, a user’s walking lifestyle is not linked to sleep.
Tracker usage decreased overtime and dropped heavily by the end of the study, from an average daily use of 24 hours to 8 hours.

7.2 Recommendation

From these trends, we can make several recommendations to Urška Sršen and Bellabeat’s marketing strategy team. These key findings indicate a strategy pertaining to the Bellabeat app.

Based on the findings, it appears that retention rate is a big issue, as users used their Fit Bit trackers an average of 16 hours less from the beginning to the end of the study. To illicit excitement and staying power, it is important to engage users through development of the app. The app has an abysmal 2.6/5 stars on the Google Play Store and 4.4/5 stars on Apple’s app store - our recommendations will be based on bolstering the app’s success with Bellabeat’s users.

User retention rate is defined as the number of days that a user continues to use a certain product after its purchase or acquisition¹. Tracking data as early as possible², personalizing user experiences³, offering two-way communication between the brand and customer⁴, adjusting push notifications to be more receptive with users⁵, reducing the cognitive load - or unecessary noise - within the app⁶, mapping the user interface according to “thumb zones”⁷, optimizing the app’s onboarding process⁸, and gamifying the app via rewawrd systems and seratonin-responses⁹ are a few of the many tactics used for building a user-engaging app strategy.

Bellabeat wants to gain market share by out competing competitors such as Fit Bit, therefore a focus on its users and their relationship with their Bellebat products - app included - has to improve from its current 2.6 Google Play store rating. The trends indicate user retention lasts up to 30 days then drops substantially, which begs the question: what are ways in which Bellabeat can retain their customers better than Fit Bit does?

We recommend that Bellabeat invests into software developers and engineers who can create or update a fitness app that matches the aesthetic beauty of the company’s fitness devices. With a focus on customer service, user interface, and tracker reliability, the app will boost Bellabeat’s own reputation in the fitness tracker market while keeping users engaged for longer periods.

7.3 Limitations

There were several limiting factors to the data:

The gender was not specificied for the user in the data. Bellabeat is company focused on fitness devices catered to women, so it would have been helpful to see how different genders used their fitness trackers.
The data is limited to a one-month range from 4/12/2016-5/12/2016. First, we don’t know how the knowledge of this range impacted user retention of the tracker - there is a possibility that users stopped using the tracker since they knew the study would end soon. Second, a month’s worth of data does not give us other important trends to consider like seasonality.
The data is taken from a small sample size of 30 participants. This misleads our data by having overly weighted outliers and higher margins of error
The data was limited to fitness data taken from the tracker and did not consider data from a possible app that is connected to the tracker.
In addition to not knowing the user gender data, the user demographic data was limited. We were missing features such as age, weight, BMI as well as other health data like nutrition.
There is a limit in skill too. Since this is my first R notebook, I am certainly missing many pathways and ideas to visualizing, predicting, and analyzing the source data.

This was my very first R Workbook - thank you for reading through until the end. Here is a picture of a kitten as a reward

Capstone: Bellabeat Case Study

Jayke Sudana

Last updated: 5/1/2022

1. Introduction

2. Company background

2.1 Business Task

2.2 Stakeholders

2.3 Products

2.4 Source of data

3. Ask

3.1 Guiding Questions

3.2 Key Tasks

3.3 Deliverables

3.4 Business Task

4 Prepare

4.1 Package loading

4.2 Importing data sets with assignments

4.3 Data verification

5 Process

5.1 Fix date formatting

5.2 Merging data tables

5.3 Summary statistics of data

5.4 Classify observations

7. Act - Recommendations

7.1 Insights from analysis

7.2 Recommendation

7.3 Limitations