Bellabeat Capstone Project

Ask–defining the task

In this case study for the Google Data Analytics Professional Certificate through Coursera, Urska Srsen, Chief Creative Officer of Bellabeat, a high-tech manufacturer of health-focused products for women, is asking for trends in the way consumers use fitness trackers that can inform Bellabeat’s marketing strategies. The data set provided for analysis is from a group of 30 FitBit users who voluntarily responded to a request for data. The case study guidelines ask that I focus on growth opportunities and on one of their fitness products. Using the Fitbit data, we should be able to look at trends showing how fitness trackers are used and apply those insights to make high level recommendations to inform Bellabeat’s marketing strategy. Stakeholders who would be presented with the recommendations include Bellabeat’s marketing team and cofounders.

Prepare–describing the data sources

Provided data set

Here is the information Srsen provided: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

This data set can be found at http://kaggle.com/arashic/fitbit. The data is stored in several CSV files, some in both long or wide form. The data from individual FitBit users has been merged into separate CSV files according to steps, daily activity, sleep, heart rate, intensity, calories burned, MET, and weight. The users are only identified by an id number.

The Ivy product made by Bellabeat tracks activities, steps, sleep, heart rate, meditation, and burned calories according to their website, and is designed specifically for women. The FitBit data set does include four of these measurements: activity, sleep, heart rate, and calories. The 30 respondents whose data make up the FitBit data set have been anonymized and may or may not be women. Trends from the data set can still be applicable, although the it does have limitations. Not knowing the sex of the individuals, the lack of stress measurements, and the small sample size (30) are all limitations of the data.

Additional resources that I needed

The metadata for the FitBit Fitness Tracker Data doesn’t contain descriptions or explanations of the column names. However, FitBit does provide a database dictionary in a pdf online at http://fitabase.com/resources/knowledge-base/exporting-data/data-dictionaries/ This was very helpful in decrypting the meaning of the column names, the metrics, and measurement units used.

Process-cleaning and manipulation of data

Uploading CSV files to R from kaggle.com/arashnic/fitbit

I downloaded a zipped file from Kaggle to my computer. From the files menu in RStudio, I clicked Upload and selected the zipped file.

Installing and loading common packages and libraries

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Loading CSV files to create data frames

Because Bellabeat products and FitBit products both record activity,sleep data, and heart rate data, I will load the sleep, activity, and heart rate CSV files into data frames so that I can view these files. The names for the data frames will be a shortened form of the CSV file names. The glimpse function lets me see information about each data frame.

sleep_day <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 0 Columns: 0
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sleep_day)
## Rows: 0
## Columns: 0
minute_sleep <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")
## Rows: 188521 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): date
## dbl (3): Id, value, logId
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(minute_sleep)
## Rows: 188,521
## Columns: 4
## $ Id    <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503…
## $ date  <chr> "4/12/2016 2:47:30 AM", "4/12/2016 2:48:30 AM", "4/12/2016 2:49:…
## $ value <dbl> 3, 2, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1…
## $ logId <dbl> 11380564589, 11380564589, 11380564589, 11380564589, 11380564589,…
daily_activity <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
heartrate_seconds_merged <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(heartrate_seconds_merged)
## Rows: 2,483,658
## Columns: 3
## $ Id    <dbl> 2022484408, 2022484408, 2022484408, 2022484408, 2022484408, 2022…
## $ Time  <chr> "4/12/2016 7:21:00 AM", "4/12/2016 7:21:05 AM", "4/12/2016 7:21:…
## $ Value <dbl> 97, 102, 105, 103, 101, 95, 91, 93, 94, 93, 92, 89, 83, 61, 60, …

The sleep_day data frame was devoid of data (the size of the file also told us that). I will go back to Kaggle and attempt to access that again.

The other data frame for sleep, minute_sleep, contains information including the user’s id, date, sleep state, and unique id for each sleep event.

The daily_activity data frame contains the user’s id, date, steps, distance, activity intensity level by distance and by time, and calories expended. There is a column to allow for logged (not tracked) activity distances, but in the entries shown by the glimpse function, no distances were logged. To see if there is any data for logged activities, I used the filter function to see entries with values greater than zero in that column.

(logging_occurances <- filter(daily_activity, LoggedActivitiesDistance > "0"))
## # A tibble: 32 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
##  1 6.78e9 4/26/2016          7091          5.27            5.27             1.96
##  2 6.96e9 4/21/2016         11835          9.71            7.88             4.08
##  3 6.96e9 4/25/2016         13239          9.27            9.08             2.79
##  4 6.96e9 5/9/2016          12342          8.72            8.68             3.17
##  5 7.01e9 4/12/2016         14172         10.3             9.48             4.87
##  6 7.01e9 4/13/2016         12862          9.65            8.60             4.85
##  7 7.01e9 4/14/2016         11179          8.24            7.48             3.29
##  8 7.01e9 4/18/2016         14816         11.0             9.91             4.93
##  9 7.01e9 4/19/2016         14194         10.5             9.5              4.94
## 10 7.01e9 4/20/2016         15566         11.3            10.4              4.92
## # … with 22 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

I downloaded the sleep day CSV file from Kaggle to my computer independently of the other files in a non zipped format and was able to get the data into a spreadsheet. Now I that I have the spreadsheet, I can load the files into RStudio. This gives me access to the data that didn’t load in the sleep set previously.

sleepDay_merged <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv%3FX-Goog-Algorithm=GOOG4-RSA-SHA256&amp;X-Goog-Credential=gcp-kaggle-com@kaggle-161607.iam.gserviceaccount.com%2F20220210%2Fauto%2Fstorage%2Fgoog4_request&amp;X-Goog-Date=20220210T174452Z&amp;X-.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sleepDay_merged)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

###Getting a better idea of what is going on in each data frame with some statistical summaries

daily_activity %>%
  select(TotalSteps,TotalDistance,SedentaryMinutes) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0
sleepDay_merged %>%
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

Sometimes there is more than one sleep record for a day – up to a max of 3 sleep sessions in a day.

heartrate_seconds_merged %>%
  select(Value) %>%
  summary()
##      Value       
##  Min.   : 36.00  
##  1st Qu.: 63.00  
##  Median : 73.00  
##  Mean   : 77.33  
##  3rd Qu.: 88.00  
##  Max.   :203.00

The statistics for heart rate value don’t seem very useful when grouped as a whole. Here is a look at the data in the heart rate file grouped by respondent Id.

by_id <- group_by(heartrate_seconds_merged, Id)
heart_rate_summary_by_id <- summarise(by_id, count = n(), mean = mean(Value, na.rm = TRUE), min = min(Value, na.rm = TRUE), max = max(Value, na.rm = TRUE))
print(heart_rate_summary_by_id)
## # A tibble: 14 × 5
##            Id  count  mean   min   max
##         <dbl>  <int> <dbl> <dbl> <dbl>
##  1 2022484408 154104  80.2    38   203
##  2 2026352035   2490  93.8    63   125
##  3 2347167796 152683  76.7    49   195
##  4 4020332650 285461  82.3    46   191
##  5 4388161847 249748  66.1    39   180
##  6 4558609924 192168  81.7    44   199
##  7 5553957443 255174  68.6    47   165
##  8 5577150313 248560  69.6    36   174
##  9 6117666160 158899  83.7    52   189
## 10 6775888955  32771  92.0    55   177
## 11 6962181067 266326  77.7    47   184
## 12 7007744171 133592  91.1    54   166
## 13 8792009665 122841  72.5    43   158
## 14 8877689391 228841  83.6    46   180

The meta data for the FitBit data set tells us that there are 30 respondents. I’m checking to see how many unique user ids are in the data frames.

n_distinct(daily_activity $ Id)
## [1] 33
n_distinct(minute_sleep $ Id)
## [1] 24
n_distinct(sleepDay_merged $ Id)
## [1] 24
n_distinct(heartrate_seconds_merged $ Id)
## [1] 14

The daily_activity data frame shows 33 unique ids. There is a possibility that some of the respondents had more than one type of device that they used to record fitness data, and also a likely possibility that trackers were lost and replaced with new ones. To explain the lower number of unique user id numbers in the sleep data frames, it is possible that not all volunteers chose to record their sleep. Even fewer respondents(14) have recorded heart rate files.

Merging the daily sleep data frame with the daily activity data frame

Both of these data frames have the id column which can be used to join them. However, they each have date columns with different names and different configurations of date. Before joining them I will clean up the date column in the sleepDay_merged table, rename columns to fit convention and be more informative in some cases, and select only the data I want to join.

daily_activity_ready <- select(daily_activity, Id, ActivityDate, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories) %>%
  rename(id = Id) %>%
  rename(distance_in_km = TotalDistance) %>%
  rename(total_steps = TotalSteps) %>%
  rename(sedentary_minutes = SedentaryMinutes) %>%
  rename(new_date = ActivityDate) %>%
  rename(tracked_kms = TrackerDistance) %>%
  rename(logged_kms = LoggedActivitiesDistance) %>%
  rename(very_active_minutes = VeryActiveMinutes) %>%
  rename(fairly_active_minutes = FairlyActiveMinutes) %>%
  rename(lightly_active_minutes = LightlyActiveMinutes)
daily_sleep_ready <- sleepDay_merged %>%
  separate(SleepDay, into = c("new_date", "hour", "am_or_pm"), sep = " ") %>%
  rename(id = Id) %>%
  rename(minutes_asleep = TotalMinutesAsleep) %>%
  rename(minutes_in_bed = TotalTimeInBed) %>%
  select(id, new_date, minutes_asleep, minutes_in_bed)

Now that the tables for daily sleep and daily activity are compatible, I can join them.

combined_data <- left_join(daily_activity_ready, daily_sleep_ready)
## Joining, by = c("id", "new_date")
glimpse(combined_data)
## Rows: 943
## Columns: 13
## $ id                     <dbl> 1503960366, 1503960366, 1503960366, 1503960366,…
## $ new_date               <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/20…
## $ total_steps            <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, …
## $ distance_in_km         <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.88,…
## $ tracked_kms            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.88,…
## $ logged_kms             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ very_active_minutes    <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 41,…
## $ fairly_active_minutes  <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21, …
## $ lightly_active_minutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, 21…
## $ sedentary_minutes      <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818, …
## $ Calories               <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035,…
## $ minutes_asleep         <dbl> 327, 384, NA, 412, 340, 700, NA, 304, 360, 325,…
## $ minutes_in_bed         <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377, 364,…

To check that we still have all 33 distinct user Id numbers:

n_distinct(combined_data $ id)
## [1] 33
head(combined_data)
## # A tibble: 6 × 13
##           id new_date  total_steps distance_in_km tracked_kms logged_kms
##        <dbl> <chr>           <dbl>          <dbl>       <dbl>      <dbl>
## 1 1503960366 4/12/2016       13162           8.5         8.5           0
## 2 1503960366 4/13/2016       10735           6.97        6.97          0
## 3 1503960366 4/14/2016       10460           6.74        6.74          0
## 4 1503960366 4/15/2016        9762           6.28        6.28          0
## 5 1503960366 4/16/2016       12669           8.16        8.16          0
## 6 1503960366 4/17/2016        9705           6.48        6.48          0
## # … with 7 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## #   minutes_in_bed <dbl>

Changing the date column from character to double and adding a weekday column

The date column is a character data type. I would like to change it to make it more useful as a date. I will mutate the dataframe to change the date to a date type and add a new column to show the day of the week for the activity.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
combined_data_completed <- combined_data %>%
  mutate(new_date = mdy(new_date)) %>%
  mutate(day_of_the_week = weekdays(new_date))

combined_data_completed$day_of_the_week <- factor(combined_data_completed$day_of_the_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday" ))
head(combined_data_completed)
## # A tibble: 6 × 14
##           id new_date   total_steps distance_in_km tracked_kms logged_kms
##        <dbl> <date>           <dbl>          <dbl>       <dbl>      <dbl>
## 1 1503960366 2016-04-12       13162           8.5         8.5           0
## 2 1503960366 2016-04-13       10735           6.97        6.97          0
## 3 1503960366 2016-04-14       10460           6.74        6.74          0
## 4 1503960366 2016-04-15        9762           6.28        6.28          0
## 5 1503960366 2016-04-16       12669           8.16        8.16          0
## 6 1503960366 2016-04-17        9705           6.48        6.48          0
## # … with 8 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## #   minutes_in_bed <dbl>, day_of_the_week <fct>

Share – Providing Insights, Graphics, and Recommendations

Providing Insights, Graphics, and High Level Recommendations in slide presentation format:

Link to Google Slide Presentation

Insights from Fitbit Data about How Smart Devices Are Used

  • Many respondents are not recording all the data available to them. All respondents reported activity; a smaller number also recorded sleep data; and even fewer of the respondents reported heart rate data.

  • Some days the respondents left their devices at home.

  • The FitBit data shows that users prefer to have their distance tracked automatically as opposed to logging their distance manually. Even when they had left their device behind for the day they did not later log a daily distance.

  • Smart device users may use their trackers when training for half marathons. Surprisingly, a user who normally averages small daily distances also recorded a distance equal to a half marathon.

Supporting Graphics

Applying Insights to Marketing

Users aren’t taking advantage of all the types of data their trackers record. Maybe they don’t know how the data could be useful to them. I would suggest when selling Ivy to package it with another of Bellabeat’s products, the Bellabeat membership, so that a coach can guide customers through getting all the benefits of their Ivy.

When I see that users of smart trackers are leaving their device behind for at least 24 hours, I think that one of two things may have happened: either having to charge the device made them forget it, or the device didn’t look formal enough to wear for what they had to do that day. Marketing can focus on the Ivy’s long charge and emphasize its design qualities that make it always appropriate.

Users prefer to track their distances instead of logging their distances. Marketing can emphasize the ease of tracking distance with Ivy and to lack of need to log activities.

Customers may use smart devices to train for or participate in half marathons. Marketing can find inspiration in the stories of women who regularly achieve long distances and also in the stories of women who usually have short average distances but sometimes reach impressive milestones.

Act – Final Conclusions

The marketing team can emphasize long charge, fashion-conscious design, ease of tracking, and inspirational training stories in their digital marketing of the Ivy to find new growth opportunities.

The Ivy can be sold in a package with the Bellabeat Membership to make women feel more confident that they are getting the full benefit of their Ivy by using more of the metrics tracked to support their well being.