Bellabeat Capstone Project

Ask–defining the task

In this case study for the Google Data Analytics Professional Certificate through Coursera, Urska Srsen, Chief Creative Officer of Bellabeat, a high-tech manufacturer of health-focused products for women, is asking for trends in the way consumers use fitness trackers that can inform Bellabeat’s marketing strategies. The data set provided for analysis is from a group of 30 FitBit users who voluntarily responded to a request for data. The case study guidelines ask that I focus on growth opportunities and on one of their fitness products. Using the Fitbit data, we should be able to look at trends showing how fitness trackers are used and apply those insights to make high level recommendations to inform Bellabeat’s marketing strategy. Stakeholders who would be presented with the recommendations include Bellabeat’s marketing team and cofounders.

Prepare–describing the data sources

Provided data set

Here is the information Srsen provided: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

This data set can be found at http://kaggle.com/arashic/fitbit. The data is stored in several CSV files, some in both long or wide form. The data from individual FitBit users has been merged into separate CSV files according to steps, daily activity, sleep, heart rate, intensity, calories burned, MET, and weight. The users are only identified by an id number.

The Ivy product made by Bellabeat tracks activities, steps, sleep, heart rate, meditation, and burned calories according to their website, and is designed specifically for women. The FitBit data set does include four of these measurements: activity, sleep, heart rate, and calories. The 30 respondents whose data make up the FitBit data set have been anonymized and may or may not be women. Trends from the data set can still be applicable, although the it does have limitations. Not knowing the sex of the individuals, the lack of stress measurements, and the small sample size (30) are all limitations of the data.

Additional resources that I needed

The metadata for the FitBit Fitness Tracker Data doesn’t contain descriptions or explanations of the column names. However, FitBit does provide a database dictionary in a pdf online at http://fitabase.com/resources/knowledge-base/exporting-data/data-dictionaries/ This was very helpful in decrypting the meaning of the column names, the metrics, and measurement units used.

Process-cleaning and manipulation of data

Uploading CSV files to R from kaggle.com/arashnic/fitbit

I downloaded a zipped file from Kaggle to my computer. From the files menu in RStudio, I clicked Upload and selected the zipped file.

Installing and loading common packages and libraries

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Loading CSV files to create data frames

Because Bellabeat products and FitBit products both record activity,sleep data, and heart rate data, I will load the sleep, activity, and heart rate CSV files into data frames so that I can view these files. The names for the data frames will be a shortened form of the CSV file names. The glimpse function lets me see information about each data frame.

sleep_day <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

## Rows: 0 Columns: 0
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(sleep_day)

## Rows: 0
## Columns: 0

minute_sleep <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")

## Rows: 188521 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): date
## dbl (3): Id, value, logId
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(minute_sleep)

## Rows: 188,521
## Columns: 4
## $ Id    <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503…
## $ date  <chr> "4/12/2016 2:47:30 AM", "4/12/2016 2:48:30 AM", "4/12/2016 2:49:…
## $ value <dbl> 3, 2, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1…
## $ logId <dbl> 11380564589, 11380564589, 11380564589, 11380564589, 11380564589,…

daily_activity <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

heartrate_seconds_merged <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(heartrate_seconds_merged)

## Rows: 2,483,658
## Columns: 3
## $ Id    <dbl> 2022484408, 2022484408, 2022484408, 2022484408, 2022484408, 2022…
## $ Time  <chr> "4/12/2016 7:21:00 AM", "4/12/2016 7:21:05 AM", "4/12/2016 7:21:…
## $ Value <dbl> 97, 102, 105, 103, 101, 95, 91, 93, 94, 93, 92, 89, 83, 61, 60, …

The sleep_day data frame was devoid of data (the size of the file also told us that). I will go back to Kaggle and attempt to access that again.

The other data frame for sleep, minute_sleep, contains information including the user’s id, date, sleep state, and unique id for each sleep event.

The daily_activity data frame contains the user’s id, date, steps, distance, activity intensity level by distance and by time, and calories expended. There is a column to allow for logged (not tracked) activity distances, but in the entries shown by the glimpse function, no distances were logged. To see if there is any data for logged activities, I used the filter function to see entries with values greater than zero in that column.

(logging_occurances <- filter(daily_activity, LoggedActivitiesDistance > "0"))

## # A tibble: 32 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
##  1 6.78e9 4/26/2016          7091          5.27            5.27             1.96
##  2 6.96e9 4/21/2016         11835          9.71            7.88             4.08
##  3 6.96e9 4/25/2016         13239          9.27            9.08             2.79
##  4 6.96e9 5/9/2016          12342          8.72            8.68             3.17
##  5 7.01e9 4/12/2016         14172         10.3             9.48             4.87
##  6 7.01e9 4/13/2016         12862          9.65            8.60             4.85
##  7 7.01e9 4/14/2016         11179          8.24            7.48             3.29
##  8 7.01e9 4/18/2016         14816         11.0             9.91             4.93
##  9 7.01e9 4/19/2016         14194         10.5             9.5              4.94
## 10 7.01e9 4/20/2016         15566         11.3            10.4              4.92
## # … with 22 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

I downloaded the sleep day CSV file from Kaggle to my computer independently of the other files in a non zipped format and was able to get the data into a spreadsheet. Now I that I have the spreadsheet, I can load the files into RStudio. This gives me access to the data that didn’t load in the sleep set previously.

sleepDay_merged <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv%3FX-Goog-Algorithm=GOOG4-RSA-SHA256&amp;X-Goog-Credential=gcp-kaggle-com@kaggle-161607.iam.gserviceaccount.com%2F20220210%2Fauto%2Fstorage%2Fgoog4_request&amp;X-Goog-Date=20220210T174452Z&amp;X-.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(sleepDay_merged)

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

###Getting a better idea of what is going on in each data frame with some statistical summaries

daily_activity %>%
  select(TotalSteps,TotalDistance,SedentaryMinutes) %>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

sleepDay_merged %>%
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

Sometimes there is more than one sleep record for a day – up to a max of 3 sleep sessions in a day.

heartrate_seconds_merged %>%
  select(Value) %>%
  summary()

##      Value       
##  Min.   : 36.00  
##  1st Qu.: 63.00  
##  Median : 73.00  
##  Mean   : 77.33  
##  3rd Qu.: 88.00  
##  Max.   :203.00

The statistics for heart rate value don’t seem very useful when grouped as a whole. Here is a look at the data in the heart rate file grouped by respondent Id.

by_id <- group_by(heartrate_seconds_merged, Id)
heart_rate_summary_by_id <- summarise(by_id, count = n(), mean = mean(Value, na.rm = TRUE), min = min(Value, na.rm = TRUE), max = max(Value, na.rm = TRUE))
print(heart_rate_summary_by_id)

## # A tibble: 14 × 5
##            Id  count  mean   min   max
##         <dbl>  <int> <dbl> <dbl> <dbl>
##  1 2022484408 154104  80.2    38   203
##  2 2026352035   2490  93.8    63   125
##  3 2347167796 152683  76.7    49   195
##  4 4020332650 285461  82.3    46   191
##  5 4388161847 249748  66.1    39   180
##  6 4558609924 192168  81.7    44   199
##  7 5553957443 255174  68.6    47   165
##  8 5577150313 248560  69.6    36   174
##  9 6117666160 158899  83.7    52   189
## 10 6775888955  32771  92.0    55   177
## 11 6962181067 266326  77.7    47   184
## 12 7007744171 133592  91.1    54   166
## 13 8792009665 122841  72.5    43   158
## 14 8877689391 228841  83.6    46   180

The meta data for the FitBit data set tells us that there are 30 respondents. I’m checking to see how many unique user ids are in the data frames.

n_distinct(daily_activity $ Id)

## [1] 33

n_distinct(minute_sleep $ Id)

## [1] 24

n_distinct(sleepDay_merged $ Id)

## [1] 24

n_distinct(heartrate_seconds_merged $ Id)

## [1] 14

The daily_activity data frame shows 33 unique ids. There is a possibility that some of the respondents had more than one type of device that they used to record fitness data, and also a likely possibility that trackers were lost and replaced with new ones. To explain the lower number of unique user id numbers in the sleep data frames, it is possible that not all volunteers chose to record their sleep. Even fewer respondents(14) have recorded heart rate files.

Merging the daily sleep data frame with the daily activity data frame

Both of these data frames have the id column which can be used to join them. However, they each have date columns with different names and different configurations of date. Before joining them I will clean up the date column in the sleepDay_merged table, rename columns to fit convention and be more informative in some cases, and select only the data I want to join.

daily_activity_ready <- select(daily_activity, Id, ActivityDate, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories) %>%
  rename(id = Id) %>%
  rename(distance_in_km = TotalDistance) %>%
  rename(total_steps = TotalSteps) %>%
  rename(sedentary_minutes = SedentaryMinutes) %>%
  rename(new_date = ActivityDate) %>%
  rename(tracked_kms = TrackerDistance) %>%
  rename(logged_kms = LoggedActivitiesDistance) %>%
  rename(very_active_minutes = VeryActiveMinutes) %>%
  rename(fairly_active_minutes = FairlyActiveMinutes) %>%
  rename(lightly_active_minutes = LightlyActiveMinutes)

daily_sleep_ready <- sleepDay_merged %>%
  separate(SleepDay, into = c("new_date", "hour", "am_or_pm"), sep = " ") %>%
  rename(id = Id) %>%
  rename(minutes_asleep = TotalMinutesAsleep) %>%
  rename(minutes_in_bed = TotalTimeInBed) %>%
  select(id, new_date, minutes_asleep, minutes_in_bed)

Now that the tables for daily sleep and daily activity are compatible, I can join them.

combined_data <- left_join(daily_activity_ready, daily_sleep_ready)

## Joining, by = c("id", "new_date")

glimpse(combined_data)

## Rows: 943
## Columns: 13
## $ id                     <dbl> 1503960366, 1503960366, 1503960366, 1503960366,…
## $ new_date               <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/20…
## $ total_steps            <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, …
## $ distance_in_km         <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.88,…
## $ tracked_kms            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.88,…
## $ logged_kms             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ very_active_minutes    <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 41,…
## $ fairly_active_minutes  <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21, …
## $ lightly_active_minutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, 21…
## $ sedentary_minutes      <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818, …
## $ Calories               <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035,…
## $ minutes_asleep         <dbl> 327, 384, NA, 412, 340, 700, NA, 304, 360, 325,…
## $ minutes_in_bed         <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377, 364,…

To check that we still have all 33 distinct user Id numbers:

n_distinct(combined_data $ id)

## [1] 33

head(combined_data)

## # A tibble: 6 × 13
##           id new_date  total_steps distance_in_km tracked_kms logged_kms
##        <dbl> <chr>           <dbl>          <dbl>       <dbl>      <dbl>
## 1 1503960366 4/12/2016       13162           8.5         8.5           0
## 2 1503960366 4/13/2016       10735           6.97        6.97          0
## 3 1503960366 4/14/2016       10460           6.74        6.74          0
## 4 1503960366 4/15/2016        9762           6.28        6.28          0
## 5 1503960366 4/16/2016       12669           8.16        8.16          0
## 6 1503960366 4/17/2016        9705           6.48        6.48          0
## # … with 7 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## #   minutes_in_bed <dbl>

Changing the date column from character to double and adding a weekday column

The date column is a character data type. I would like to change it to make it more useful as a date. I will mutate the dataframe to change the date to a date type and add a new column to show the day of the week for the activity.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

combined_data_completed <- combined_data %>%
  mutate(new_date = mdy(new_date)) %>%
  mutate(day_of_the_week = weekdays(new_date))

combined_data_completed$day_of_the_week <- factor(combined_data_completed$day_of_the_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday" ))
head(combined_data_completed)

## # A tibble: 6 × 14
##           id new_date   total_steps distance_in_km tracked_kms logged_kms
##        <dbl> <date>           <dbl>          <dbl>       <dbl>      <dbl>
## 1 1503960366 2016-04-12       13162           8.5         8.5           0
## 2 1503960366 2016-04-13       10735           6.97        6.97          0
## 3 1503960366 2016-04-14       10460           6.74        6.74          0
## 4 1503960366 2016-04-15        9762           6.28        6.28          0
## 5 1503960366 2016-04-16       12669           8.16        8.16          0
## 6 1503960366 2016-04-17        9705           6.48        6.48          0
## # … with 8 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## #   minutes_in_bed <dbl>, day_of_the_week <fct>

Analyze – Identifying Trends and Relationships

Using ggplot to visually explore the sleep and activity data

ggplot(data = combined_data, mapping = aes(x = total_steps, y = distance_in_km)) +
  geom_point()

I find it interesting to see the difference in stride length become apparent as the distances get longer. Also, three of the data points show total daily distances equivalent to or greater than a half marathon - about 21.1km.

I want to look at the same plot, faceted by id to look at individual patterns of activity, and also see which users had a daily recorded distance of over 21km.

install.packages("viridis")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

library(viridis)

## Loading required package: viridisLite

ggplot(data = combined_data, mapping = aes(x = total_steps, y = distance_in_km, color = distance_in_km > 21 )) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90))+
  facet_wrap(~id) +
  scale_color_viridis(discrete = TRUE, option = "cividis")

I’m surprised to see that one of the respondents who had one of the daily totals of distance equal to a half marathon normally stays below 10km. This respondent is the second graph on the first row above. That kind of data could be inspiring to other users who want to try a new level of distance. The other two data points of greatest daily distance are both from the last respondent graphed. This respondent has a wide range of daily activity distances.

I’m interested to see if there is a correlations with sleep. First I’ll try sleep and total distance.

ggplot(data = combined_data, mapping = aes(x = minutes_asleep, y = distance_in_km)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (stat_smooth).

## Warning: Removed 530 rows containing missing values (geom_point).

There is very little correlation. Now I’ll try minutes in bed and minutes asleep.

ggplot(data = combined_data, mapping = aes(x = minutes_asleep, y = minutes_in_bed)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (stat_smooth).

## Warning: Removed 530 rows containing missing values (geom_point).

Since minutes asleep is a component of minutes in bed, there is a correlation, however there are some outliers. What about the time it takes to get to sleep and the distance completed during the day? Could there be a correlation there? I need to add a new column for time awake in bed by mutating the data frame and calculating that column from the minutes in bed and minutes asleep columns.

mutate(combined_data, awake_in_bed = minutes_in_bed - minutes_asleep)

## # A tibble: 943 × 14
##            id new_date  total_steps distance_in_km tracked_kms logged_kms
##         <dbl> <chr>           <dbl>          <dbl>       <dbl>      <dbl>
##  1 1503960366 4/12/2016       13162           8.5         8.5           0
##  2 1503960366 4/13/2016       10735           6.97        6.97          0
##  3 1503960366 4/14/2016       10460           6.74        6.74          0
##  4 1503960366 4/15/2016        9762           6.28        6.28          0
##  5 1503960366 4/16/2016       12669           8.16        8.16          0
##  6 1503960366 4/17/2016        9705           6.48        6.48          0
##  7 1503960366 4/18/2016       13019           8.59        8.59          0
##  8 1503960366 4/19/2016       15506           9.88        9.88          0
##  9 1503960366 4/20/2016       10544           6.68        6.68          0
## 10 1503960366 4/21/2016        9819           6.34        6.34          0
## # … with 933 more rows, and 8 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## #   minutes_in_bed <dbl>, awake_in_bed <dbl>

ggplot(data = mutate(combined_data, awake_in_bed = minutes_in_bed - minutes_asleep), mapping = aes(x = awake_in_bed, y = distance_in_km)) +
  geom_point()+
  labs(title = "Distance Completed and Minutes Awake in Bed")

## Warning: Removed 530 rows containing missing values (geom_point).

As much as I want there to be a correlation between getting exercise and being able to drift off to sleep quickly, I am not seeing it here. I think other variables could be affecting time in bed awake, like personal habits. Some of the data points are showing people laying in bed awake for more than two hours. I hope they’re reading or watching a movie.

Here is a faceted look at that same graph. I’ve also added color to make it easier to see longer times awake in bed with an arbitrary division at greater than 30 minutes awake in bed.

mutate(combined_data, minutes_awake_in_bed = minutes_in_bed - minutes_asleep)

## # A tibble: 943 × 14
##            id new_date  total_steps distance_in_km tracked_kms logged_kms
##         <dbl> <chr>           <dbl>          <dbl>       <dbl>      <dbl>
##  1 1503960366 4/12/2016       13162           8.5         8.5           0
##  2 1503960366 4/13/2016       10735           6.97        6.97          0
##  3 1503960366 4/14/2016       10460           6.74        6.74          0
##  4 1503960366 4/15/2016        9762           6.28        6.28          0
##  5 1503960366 4/16/2016       12669           8.16        8.16          0
##  6 1503960366 4/17/2016        9705           6.48        6.48          0
##  7 1503960366 4/18/2016       13019           8.59        8.59          0
##  8 1503960366 4/19/2016       15506           9.88        9.88          0
##  9 1503960366 4/20/2016       10544           6.68        6.68          0
## 10 1503960366 4/21/2016        9819           6.34        6.34          0
## # … with 933 more rows, and 8 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## #   minutes_in_bed <dbl>, minutes_awake_in_bed <dbl>

ggplot(data = mutate(combined_data, minutes_awake_in_bed = minutes_in_bed - minutes_asleep), mapping = aes(x = minutes_awake_in_bed, y = distance_in_km, color = minutes_awake_in_bed > 30)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90))+
  facet_wrap(~id) +
  labs(title = "Distance Completed and Minutes Awake in Bed by Individual") + 
  scale_color_viridis(discrete = TRUE, option = "cividis")

## Warning: Removed 530 rows containing missing values (geom_point).

Three respondents spent more than 100 minutes in bed awake. Of the total of 24 who are recording their sleep behaviors, that’s 12.5% of the sample population. Long times awake in bed seems to be associated with specific individuals instead of correlated with daily distance totals. I want to try one last plot with time in bed awake, and compare it to calories burned during the day.

mutate(combined_data, awake_in_bed = minutes_in_bed - minutes_asleep)

## # A tibble: 943 × 14
##            id new_date  total_steps distance_in_km tracked_kms logged_kms
##         <dbl> <chr>           <dbl>          <dbl>       <dbl>      <dbl>
##  1 1503960366 4/12/2016       13162           8.5         8.5           0
##  2 1503960366 4/13/2016       10735           6.97        6.97          0
##  3 1503960366 4/14/2016       10460           6.74        6.74          0
##  4 1503960366 4/15/2016        9762           6.28        6.28          0
##  5 1503960366 4/16/2016       12669           8.16        8.16          0
##  6 1503960366 4/17/2016        9705           6.48        6.48          0
##  7 1503960366 4/18/2016       13019           8.59        8.59          0
##  8 1503960366 4/19/2016       15506           9.88        9.88          0
##  9 1503960366 4/20/2016       10544           6.68        6.68          0
## 10 1503960366 4/21/2016        9819           6.34        6.34          0
## # … with 933 more rows, and 8 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## #   minutes_in_bed <dbl>, awake_in_bed <dbl>

ggplot(data = mutate(combined_data, awake_in_bed = minutes_in_bed - minutes_asleep), mapping = aes(x = awake_in_bed, y = Calories)) +
  geom_point() +
  geom_smooth() +
  labs(title = "Calories Burned and Minutes Awake in Bed")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 530 rows containing non-finite values (stat_smooth).

## Warning: Removed 530 rows containing missing values (geom_point).

And that’s the closest I can find to a connection between exercise and sleeping for this group. If you stay in bed too long, you just don’t have the time to burn as many calories as those who go ahead and get out of bed.

Exploring tracked, logged, and total distances

I would like to explore tracker distance, logged distance, and total distance in the daily activity data.

ggplot(data = daily_activity_ready, mapping = aes(x = distance_in_km, y = tracked_kms + logged_kms)) +
  geom_point() +
  labs(title = "Do Tracked and Logged Distances Add up to Total Distances?")

I thought that tracked and logged distance would always combine to equal total distance. The plot above shows that is not true. It looks like it is possible that respondents who need to log additional distance do not get that distance added into total distance.

ggplot(data = daily_activity_ready, mapping = aes(x = distance_in_km, y = tracked_kms)) +
  geom_point() +
  labs(title = "Does Tracked Distance by Itself Equal Total Distance?")

Tracked distance alone does not always equal the total distance either. Just to explore that further, I’ll make a new data frame of just the instances where the distance_in_km does not equal the tracked_kms plus the logged_kms.

distance_anomalies <- mutate(combined_data, calc_dist = tracked_kms + logged_kms) %>%
  select(id, new_date, calc_dist, distance_in_km) %>%
  filter(calc_dist != distance_in_km)
glimpse(distance_anomalies)

## Rows: 33
## Columns: 4
## $ id             <dbl> 6775888955, 6962181067, 6962181067, 6962181067, 7007744…
## $ new_date       <chr> "4/26/2016", "4/21/2016", "4/25/2016", "5/9/2016", "4/1…
## $ calc_dist      <dbl> 7.229596, 11.961692, 11.865175, 11.847822, 14.349782, 1…
## $ distance_in_km <dbl> 5.27, 9.71, 9.27, 8.72, 10.29, 9.65, 8.24, 10.98, 10.48…

In a total of 943 observations, there are 33 instances of what I’m calling distance anomalies(where the distances just don’t add up).

But what I’m really wanting to explore for the stakeholders is the way people are using their smart trackers, and I can look and the percentage of distance tracked versus distance logged.

library(tidyr)
logging_occurance <- filter(combined_data_completed, logged_kms > "0")
tracking_occurance <- filter(combined_data_completed, tracked_kms > "0")  
ggplot() +
 geom_bar(data = tracking_occurance, mapping = aes(x = tracked_kms), color = "yellow", position = "dodge")+
  geom_bar(data = logging_occurance, mapping = aes(x = logged_kms), color = "blue", position = "dodge")+
  labs(title = "Instances of Tracking and Logged Distances", x= "kilometers", y = "Count") + 
  scale_color_viridis(discrete = TRUE, option = "viridis")

Clearly, the majority of distances are automatically tracked and recorded instead of being logged by the respondents themselves.

logging_occurance <- filter(combined_data_completed, logged_kms > "0")
logged <- summarize(logging_occurance, logged = sum(logged_kms))
tracking_occurance <- filter(combined_data_completed, tracked_kms > "0") 
tracked <- summarize(tracking_occurance, tracked = sum(tracked_kms))

percent_logged_distance <- (logged)/(tracked + logged)
logged

## # A tibble: 1 × 1
##   logged
##    <dbl>
## 1   104.

tracked

## # A tibble: 1 × 1
##   tracked
##     <dbl>
## 1   5176.

percent_logged_distance

##       logged
## 1 0.01965591

Only about 2% of respondents’ distances are logged. Respondents show a strong preference for tracking versus logging their distances.

Looking at Daily Behaviors

Does the day of the week matter as far as how many steps the respondents took?

steps_by_day <- combined_data_completed %>%
  group_by(day_of_the_week) %>%
  summarise(total_steps = sum(total_steps))
ggplot(data = steps_by_day) +
  geom_col(mapping = aes(x = day_of_the_week, y = total_steps, fill = day_of_the_week))+
  labs(title = "Total Steps by Day")

   scale_color_viridis(discrete = TRUE, option = "viridis")

## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: waiver
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     rescale: function
##     reset: function
##     scale_name: viridis
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

steps_by_day <- combined_data_completed %>%
  group_by(id,day_of_the_week) %>%
  summarise(total_steps = sum(total_steps))

## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.

ggplot(data = steps_by_day) +
  geom_col(mapping = aes(x = day_of_the_week, y = total_steps, fill = day_of_the_week)) +
  theme(axis.text.x = element_text(angle = 90))+
  facet_wrap(~id) +
  labs(title = "Total Steps by Day by Respondent")+
   scale_color_viridis(discrete = TRUE, option = "viridis")

I notice that some days, individual users record 0 steps. I would like to explore that.

no_steps <- combined_data_completed %>%
  select(total_steps, sedentary_minutes,day_of_the_week, logged_kms )%>%
  group_by(day_of_the_week)%>%
  filter(total_steps == 0)
ggplot(data = no_steps, mapping = aes(x= day_of_the_week, fill = day_of_the_week))+
  geom_bar()+
  labs(title = "Days with No Steps Recorded")+
  scale_color_viridis(discrete = TRUE, option = "viridis")

Friday is the day this group of respondents was least likely to not record steps.

I noticed in a few observations that the total steps were 0 and the sedentary minutes on the same day were 1440. A whole day, 24 hours, x 60 minutes per hour = 1440 minutes. That means the device recorded an entire day with no motion. I feel safe in assuming that the tracker was left behind at home those days. I’d like to see if there is a pattern in forgetting the device and the day of the week.

device_left_behind <- no_steps %>%
  filter(sedentary_minutes == 1440)

ggplot(data = device_left_behind, mapping = aes(x=day_of_the_week, fill = day_of_the_week))+
  geom_bar()+
  labs(title = "Days Device was Left Behind")+
  scale_color_viridis(discrete = TRUE, option = "viridis")

On any given day, these respondents were least likely to forget their tracker at home on Fridays. This plot looks stunningly similar to the one above it. I will check to see how many times the trackers recorded 0 steps but the sedentary minutes were not equal to an entire day.

left_behind_less_than_one_day  <- no_steps %>%
  filter(sedentary_minutes != 1440)
left_behind_less_than_one_day

## # A tibble: 5 × 4
## # Groups:   day_of_the_week [3]
##   total_steps sedentary_minutes day_of_the_week logged_kms
##         <dbl>             <dbl> <fct>                <dbl>
## 1           0               711 Thursday                 0
## 2           0               966 Thursday                 0
## 3           0              1407 Saturday                 0
## 4           0               111 Saturday                 0
## 5           0                48 Tuesday                  0

It’s not so surprising that those plots are so similar. Only 5 observations were removed by filtering out the sedentary minutes not equal to 1440. Most of the time (72 out of the 77 instances) that the device was left behind it recorded a full day of no activity. I’m curious to see if any of the respondents bothered with going back and manually logging distances for those days that they forgot their devices.

forgot_device_but_logged_anyway <- no_steps %>%
  filter(logged_kms > 0)
forgot_device_but_logged_anyway

## # A tibble: 0 × 4
## # Groups:   day_of_the_week [0]
## # … with 4 variables: total_steps <dbl>, sedentary_minutes <dbl>,
## #   day_of_the_week <fct>, logged_kms <dbl>

Nope. No one logged distance for days when the device was left behind.

Act – Final Conclusions

The marketing team can emphasize long charge, fashion-conscious design, ease of tracking, and inspirational training stories in their digital marketing of the Ivy to find new growth opportunities.

The Ivy can be sold in a package with the Bellabeat Membership to make women feel more confident that they are getting the full benefit of their Ivy by using more of the metrics tracked to support their well being.

Bellabeat Capstone Project

Debbie Reasons

2022/02/07

Bellabeat Capstone Project

Ask–defining the task

Prepare–describing the data sources

Provided data set

Additional resources that I needed

Process-cleaning and manipulation of data

Uploading CSV files to R from kaggle.com/arashnic/fitbit

Installing and loading common packages and libraries

Loading CSV files to create data frames

Merging the daily sleep data frame with the daily activity data frame

Changing the date column from character to double and adding a weekday column

Analyze – Identifying Trends and Relationships

Using ggplot to visually explore the sleep and activity data

Exploring tracked, logged, and total distances

Looking at Daily Behaviors

Act – Final Conclusions

Bellabeat Capstone Project

Debbie Reasons

2022/02/07

Bellabeat Capstone Project

Ask–defining the task

Prepare–describing the data sources

Provided data set

Additional resources that I needed

Process-cleaning and manipulation of data

Uploading CSV files to R from kaggle.com/arashnic/fitbit

Installing and loading common packages and libraries

Loading CSV files to create data frames

Merging the daily sleep data frame with the daily activity data frame

Changing the date column from character to double and adding a weekday column

Analyze – Identifying Trends and Relationships

Using ggplot to visually explore the sleep and activity data

Exploring tracked, logged, and total distances

Looking at Daily Behaviors

Share – Providing Insights, Graphics, and Recommendations

Providing Insights, Graphics, and High Level Recommendations in slide presentation format:

Insights from Fitbit Data about How Smart Devices Are Used

Supporting Graphics

Applying Insights to Marketing

Act – Final Conclusions