Capstone Case Study

Introduction

Welcome to my case study. My analysis will be on Bellabeat, a high-tech manufacturer of health-focused projects for women.

Scenario

Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and chief creative officer at Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices.

Stakeholders

Urška Sršen: Bellabeat’s cofounder and chief creative officer
Sando Mur: Mathematician and Bellabeat’s cofounder, key member of the Bellabeat executive team.
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing and reporting data that helps guide Bellabeat’s marketing strategy.

Products

Bellabeat app The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress
Time The wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Business Task

Analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices.

Questions:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Deliverables:

A clear summary of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of my analysis
Supporting visualizations and key findings
Top high-level content recommendations based on my analysis

Data Sources

Fitbit Fitness Tracker data (CC0: Public Domain, dataset made available through Mobius) This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It contains a total of 18 csv files and includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Limitations

Outdated - Data was collected in 2016
Small sample size - A sample size of 30 participants can skew our analysis and risk sampling bias
Demographics - Bellabeat is a company whose products are manufactured for and used by women. This data does not specify gender or age.
Third party source - Amazon is a third party source which makes our data less reliable.
Short time period - Data was collected for a month which is a short time period. Therefore, I would recommend collecting our own data or using other sources.

Loading packages and preparing the data

I will use R for all phases of my analysis

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(readr)
install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(dplyr)
install.packages("here")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(here)

## here() starts at /cloud/project

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

install.packages("skimr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(skimr)
install.packages("lubridate")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

install.packages("ggplot2")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(ggplot2)

Importing data

dailyActivity_merged, sleepDay_merged

dailyActivity_merged <- read_csv("Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sleepDay_merged <- read_csv("Fitabase Data 4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Taking a closer look at imported data structure

str(dailyActivity_merged)

## spec_tbl_df [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(sleepDay_merged)

## spec_tbl_df [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(dailyActivity_merged)

## # A tibble: 6 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

head(sleepDay_merged)

## # A tibble: 6 × 5
##           Id SleepDay           TotalSleepRecor… TotalMinutesAsl… TotalTimeInBed
##        <dbl> <chr>                         <dbl>            <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:00:0…                1              327            346
## 2 1503960366 4/13/2016 12:00:0…                2              384            407
## 3 1503960366 4/15/2016 12:00:0…                1              412            442
## 4 1503960366 4/16/2016 12:00:0…                2              340            367
## 5 1503960366 4/17/2016 12:00:0…                1              700            712
## 6 1503960366 4/19/2016 12:00:0…                1              304            320

colnames(dailyActivity_merged)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(sleepDay_merged)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

How many unique participants are there in each dataframe?

n_distinct(dailyActivity_merged$Id)

## [1] 33

n_distinct(sleepDay_merged$Id)

## [1] 24

Check for duplicate rows

nrow(dailyActivity_merged)

## [1] 940

nrow(sleepDay_merged)

## [1] 413

nrow(unique(dailyActivity_merged))

## [1] 940

nrow(unique(sleepDay_merged))

## [1] 410

Removing duplicate rows from sleep_day_merged

sleepDay <- unique(sleepDay_merged)

Create a new data frame, daily_activity_1 and rename columns

daily_activity_1 <- dailyActivity_merged %>%
  select("Id","Date"= "ActivityDate","TotalSteps", "SedentaryMinutes", "VeryActiveMinutes","FairlyActiveMinutes", "LightlyActiveMinutes", "Calories")
view(daily_activity_1)

Formatting dates in daily_activity_1

daily_activity_1$date <- mdy(daily_activity_1$Date)

Create a new dataframe, sleep_1 and convert Total Minutes Asleep to Total Hours asleep

sleep_1 <- sleepDay %>%
  select("Id", "SleepDay", "TotalMinutesAsleep", "TotalTimeInBed")%>%
  filter(TotalMinutesAsleep !=0)
sleep_1$Total_hrs_asleep <- round(sleep_1$TotalMinutesAsleep/60)

Merge daily_activity_1 to sleep_1 to create merged_data dataframe

merged_data <- merge(daily_activity_1, sleep_1, by = "Id")
summary(merged_data)

##        Id                Date             TotalSteps    SedentaryMinutes
##  Min.   :1.504e+09   Length:12348       Min.   :    0   Min.   :   0.0  
##  1st Qu.:3.977e+09   Class :character   1st Qu.: 4660   1st Qu.: 659.0  
##  Median :4.703e+09   Mode  :character   Median : 8585   Median : 734.0  
##  Mean   :5.021e+09                      Mean   : 8108   Mean   : 799.4  
##  3rd Qu.:6.962e+09                      3rd Qu.:11317   3rd Qu.: 853.0  
##  Max.   :8.792e+09                      Max.   :22988   Max.   :1440.0  
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes    Calories   
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:144.0        1st Qu.:1776  
##  Median :  8.00    Median : 10.00      Median :200.0        Median :2158  
##  Mean   : 23.94    Mean   : 17.34      Mean   :199.8        Mean   :2323  
##  3rd Qu.: 36.00    3rd Qu.: 24.00      3rd Qu.:258.0        3rd Qu.:2859  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :4900  
##       date              SleepDay         TotalMinutesAsleep TotalTimeInBed 
##  Min.   :2016-04-12   Length:12348       Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:2016-04-19   Class :character   1st Qu.:361.0      1st Qu.:402.0  
##  Median :2016-04-27   Mode  :character   Median :432.0      Median :462.0  
##  Mean   :2016-04-26                      Mean   :419.1      Mean   :458.2  
##  3rd Qu.:2016-05-04                      3rd Qu.:492.0      3rd Qu.:526.0  
##  Max.   :2016-05-12                      Max.   :796.0      Max.   :961.0  
##  Total_hrs_asleep
##  Min.   : 1.00   
##  1st Qu.: 6.00   
##  Median : 7.00   
##  Mean   : 6.99   
##  3rd Qu.: 8.00   
##  Max.   :13.00

n_distinct(merged_data$Id)

## [1] 24

Looking at the summary, it seems the participants may not use their devices on a regular basis since columns tracking calories, distance and activity had 0 as min which doesn’t make sense. The sleep dataframe only has 24 participants and daily activity has 33. Mean total steps was 8530, which is below the 10,000 steps per day the CDC recommends. Mean time asleep was 6.9 hours which is close to the 7-9 hours per day recommended by the National Sleep Foundation. The mean sedentary minutes was 799.4 minutes which is over 13 hours.

Remove Nulls for Total Steps and Calories

merged_data_2 <- merged_data %>%
 filter(TotalSteps !=0)%>%
  filter(Calories != 0)%>%
view(merged_data_2)

Calculating sum of minutes activity

VeryActiveMins <- sum(daily_activity_1$VeryActiveMinutes)
FairlyActiveMins <- sum(daily_activity_1$FairlyActiveMinutes)
LightlyActiveMins <- sum(daily_activity_1$LightlyActiveMinutes)
SedentaryMins <- sum(daily_activity_1$SedentaryMinutes)
TotalMinsActivity <- VeryActiveMins + FairlyActiveMins + LightlyActiveMins + SedentaryMins

Visualizing Data

ggplot(data = daily_activity_1)+
  geom_point(mapping = aes(x = SedentaryMinutes, y = TotalSteps), color = "red")+
  labs(title = "Total Steps v's Sedentary Minutes")

ggplot(data = daily_activity_1)+
  geom_point(mapping = aes(x = LightlyActiveMinutes, y = TotalSteps), color = "dark green")+
  labs(title = "Total Steps v's Lightly Active  Minutes")

ggplot(data = daily_activity_1)+
  geom_point(mapping = aes(x = VeryActiveMinutes, y = TotalSteps), color = "orange")+
  labs(title = "Total Steps v's Very Active  Minutes")

The above graphs show the relationship between daily steps and active minutes. Most participants seem to be sedentary to lightly active.

slices <- c(VeryActiveMins,FairlyActiveMins,LightlyActiveMins,SedentaryMins)
lbls <- c("VeryActive","FairlyActive","LightlyActive","Sedentary")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct)
lbls <- paste(lbls, "%", sep="")
pie(slices, labels = lbls, col = topo.colors(length(lbls)), main = "Percentage of Activity")

This pie chart clearly shows the percent of sedentary minutes recorded over 1 month by participants

ggplot(data = merged_data_2)+
  geom_point(mapping= aes(x= TotalMinutesAsleep, y= TotalTimeInBed), color = "blue")+
  labs(title = "Time in bed v's Minutes Asleep")

This graph shows that in general, most participants spent their time in bed sleeping.

ggplot(data=daily_activity_1)+
  geom_point(mapping = aes(x =TotalSteps, y = Calories), color = "purple")+
               geom_smooth(mapping = aes(x = TotalSteps, y = Calories)) +
  labs(title = "Total Steps v's Calories")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This graph shows the positive relationship between Total Steps and calories burned.

ggplot(data = merged_data_2)+
  geom_bar(mapping = aes(x = Total_hrs_asleep, fill = TotalSteps))+
  labs(title="Total steps v's Sleep", x="Hours Sleep", y="Total Steps")

This graph highlights the positive relationship between hours slept and daily activity

Key Findings

The data is limited and it seems that there are participants who didn’t track or log their data.
Most participants have recorded sedentary to light activity.
The average total steps is below the CDC recommended daily steps for a healthy lifestyle.
Sleep habits could be slightly improved. The average is 7-8 hours. Participants who sleep 7-8 hours log a higher total step count.
Sleep habits are not logged consistently. Participants may not want to wear the devices when sleeping and too busy to record their total sleep in the morning

Recommendations

Using the bellabeat app send users notifications that it’s time to start moving when periods of sedentary activity are detected.
Through the app, allow users to set goals and send reminders and progress throughout the day to help and motivate users to achieve them.
Use the bellabeat app and membership to give users helpful tips and recommend other useful resources for better sleep habits and making time for more physical activity.
Develop a device that users can wear comfortably at night to accurately record sleep habits.
Offer incentives for more tracked daily activity. This could encourage users to wear their devices and be more active. Maybe users could earn points that can be redeemed for bellabeat products or monthly membership subscriptions.
Devices could vibrate or chime to signal it’s time to start moving when users don’t have access to their apps.