In this case study for the Google Data Analytics Professional Certificate through Coursera, Urska Srsen, Chief Creative Officer of Bellabeat, a high-tech manufacturer of health-focused products for women, is asking for trends in the way consumers use fitness trackers that can inform Bellabeat’s marketing strategies. The data set provided for analysis is from a group of 30 FitBit users who voluntarily responded to a request for data. The case study guidelines ask that I focus on growth opportunities and on one of their fitness products. Using the Fitbit data, we should be able to look at trends showing how fitness trackers are used and apply those insights to make high level recommendations to inform Bellabeat’s marketing strategy. Stakeholders who would be presented with the recommendations include Bellabeat’s marketing team and cofounders.
Here is the information Srsen provided: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
This data set can be found at http://kaggle.com/arashic/fitbit. The data is stored in several CSV files, some in both long or wide form. The data from individual FitBit users has been merged into separate CSV files according to steps, daily activity, sleep, heart rate, intensity, calories burned, MET, and weight. The users are only identified by an id number.
The Ivy product made by Bellabeat tracks activities, steps, sleep, heart rate, meditation, and burned calories according to their website, and is designed specifically for women. The FitBit data set does include four of these measurements: activity, sleep, heart rate, and calories. The 30 respondents whose data make up the FitBit data set have been anonymized and may or may not be women. Trends from the data set can still be applicable, although the it does have limitations. Not knowing the sex of the individuals, the lack of stress measurements, and the small sample size (30) are all limitations of the data.
The metadata for the FitBit Fitness Tracker Data doesn’t contain descriptions or explanations of the column names. However, FitBit does provide a database dictionary in a pdf online at http://fitabase.com/resources/knowledge-base/exporting-data/data-dictionaries/ This was very helpful in decrypting the meaning of the column names, the metrics, and measurement units used.
I downloaded a zipped file from Kaggle to my computer. From the files menu in RStudio, I clicked Upload and selected the zipped file.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Because Bellabeat products and FitBit products both record activity,sleep data, and heart rate data, I will load the sleep, activity, and heart rate CSV files into data frames so that I can view these files. The names for the data frames will be a shortened form of the CSV file names. The glimpse function lets me see information about each data frame.
sleep_day <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 0 Columns: 0
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sleep_day)
## Rows: 0
## Columns: 0
minute_sleep <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")
## Rows: 188521 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): date
## dbl (3): Id, value, logId
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(minute_sleep)
## Rows: 188,521
## Columns: 4
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503…
## $ date <chr> "4/12/2016 2:47:30 AM", "4/12/2016 2:48:30 AM", "4/12/2016 2:49:…
## $ value <dbl> 3, 2, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1…
## $ logId <dbl> 11380564589, 11380564589, 11380564589, 11380564589, 11380564589,…
daily_activity <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
heartrate_seconds_merged <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(heartrate_seconds_merged)
## Rows: 2,483,658
## Columns: 3
## $ Id <dbl> 2022484408, 2022484408, 2022484408, 2022484408, 2022484408, 2022…
## $ Time <chr> "4/12/2016 7:21:00 AM", "4/12/2016 7:21:05 AM", "4/12/2016 7:21:…
## $ Value <dbl> 97, 102, 105, 103, 101, 95, 91, 93, 94, 93, 92, 89, 83, 61, 60, …
The sleep_day data frame was devoid of data (the size of the file also told us that). I will go back to Kaggle and attempt to access that again.
The other data frame for sleep, minute_sleep, contains information including the user’s id, date, sleep state, and unique id for each sleep event.
The daily_activity data frame contains the user’s id, date, steps, distance, activity intensity level by distance and by time, and calories expended. There is a column to allow for logged (not tracked) activity distances, but in the entries shown by the glimpse function, no distances were logged. To see if there is any data for logged activities, I used the filter function to see entries with values greater than zero in that column.
(logging_occurances <- filter(daily_activity, LoggedActivitiesDistance > "0"))
## # A tibble: 32 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 6.78e9 4/26/2016 7091 5.27 5.27 1.96
## 2 6.96e9 4/21/2016 11835 9.71 7.88 4.08
## 3 6.96e9 4/25/2016 13239 9.27 9.08 2.79
## 4 6.96e9 5/9/2016 12342 8.72 8.68 3.17
## 5 7.01e9 4/12/2016 14172 10.3 9.48 4.87
## 6 7.01e9 4/13/2016 12862 9.65 8.60 4.85
## 7 7.01e9 4/14/2016 11179 8.24 7.48 3.29
## 8 7.01e9 4/18/2016 14816 11.0 9.91 4.93
## 9 7.01e9 4/19/2016 14194 10.5 9.5 4.94
## 10 7.01e9 4/20/2016 15566 11.3 10.4 4.92
## # … with 22 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
I downloaded the sleep day CSV file from Kaggle to my computer independently of the other files in a non zipped format and was able to get the data into a spreadsheet. Now I that I have the spreadsheet, I can load the files into RStudio. This gives me access to the data that didn’t load in the sleep set previously.
sleepDay_merged <- read_csv("Fitabase Data/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv%3FX-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com@kaggle-161607.iam.gserviceaccount.com%2F20220210%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20220210T174452Z&X-.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sleepDay_merged)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
###Getting a better idea of what is going on in each data frame with some statistical summaries
daily_activity %>%
select(TotalSteps,TotalDistance,SedentaryMinutes) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :1440.0
sleepDay_merged %>%
select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
Sometimes there is more than one sleep record for a day – up to a max of 3 sleep sessions in a day.
heartrate_seconds_merged %>%
select(Value) %>%
summary()
## Value
## Min. : 36.00
## 1st Qu.: 63.00
## Median : 73.00
## Mean : 77.33
## 3rd Qu.: 88.00
## Max. :203.00
The statistics for heart rate value don’t seem very useful when grouped as a whole. Here is a look at the data in the heart rate file grouped by respondent Id.
by_id <- group_by(heartrate_seconds_merged, Id)
heart_rate_summary_by_id <- summarise(by_id, count = n(), mean = mean(Value, na.rm = TRUE), min = min(Value, na.rm = TRUE), max = max(Value, na.rm = TRUE))
print(heart_rate_summary_by_id)
## # A tibble: 14 × 5
## Id count mean min max
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 2022484408 154104 80.2 38 203
## 2 2026352035 2490 93.8 63 125
## 3 2347167796 152683 76.7 49 195
## 4 4020332650 285461 82.3 46 191
## 5 4388161847 249748 66.1 39 180
## 6 4558609924 192168 81.7 44 199
## 7 5553957443 255174 68.6 47 165
## 8 5577150313 248560 69.6 36 174
## 9 6117666160 158899 83.7 52 189
## 10 6775888955 32771 92.0 55 177
## 11 6962181067 266326 77.7 47 184
## 12 7007744171 133592 91.1 54 166
## 13 8792009665 122841 72.5 43 158
## 14 8877689391 228841 83.6 46 180
The meta data for the FitBit data set tells us that there are 30 respondents. I’m checking to see how many unique user ids are in the data frames.
n_distinct(daily_activity $ Id)
## [1] 33
n_distinct(minute_sleep $ Id)
## [1] 24
n_distinct(sleepDay_merged $ Id)
## [1] 24
n_distinct(heartrate_seconds_merged $ Id)
## [1] 14
The daily_activity data frame shows 33 unique ids. There is a possibility that some of the respondents had more than one type of device that they used to record fitness data, and also a likely possibility that trackers were lost and replaced with new ones. To explain the lower number of unique user id numbers in the sleep data frames, it is possible that not all volunteers chose to record their sleep. Even fewer respondents(14) have recorded heart rate files.
Both of these data frames have the id column which can be used to join them. However, they each have date columns with different names and different configurations of date. Before joining them I will clean up the date column in the sleepDay_merged table, rename columns to fit convention and be more informative in some cases, and select only the data I want to join.
daily_activity_ready <- select(daily_activity, Id, ActivityDate, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories) %>%
rename(id = Id) %>%
rename(distance_in_km = TotalDistance) %>%
rename(total_steps = TotalSteps) %>%
rename(sedentary_minutes = SedentaryMinutes) %>%
rename(new_date = ActivityDate) %>%
rename(tracked_kms = TrackerDistance) %>%
rename(logged_kms = LoggedActivitiesDistance) %>%
rename(very_active_minutes = VeryActiveMinutes) %>%
rename(fairly_active_minutes = FairlyActiveMinutes) %>%
rename(lightly_active_minutes = LightlyActiveMinutes)
daily_sleep_ready <- sleepDay_merged %>%
separate(SleepDay, into = c("new_date", "hour", "am_or_pm"), sep = " ") %>%
rename(id = Id) %>%
rename(minutes_asleep = TotalMinutesAsleep) %>%
rename(minutes_in_bed = TotalTimeInBed) %>%
select(id, new_date, minutes_asleep, minutes_in_bed)
Now that the tables for daily sleep and daily activity are compatible, I can join them.
combined_data <- left_join(daily_activity_ready, daily_sleep_ready)
## Joining, by = c("id", "new_date")
glimpse(combined_data)
## Rows: 943
## Columns: 13
## $ id <dbl> 1503960366, 1503960366, 1503960366, 1503960366,…
## $ new_date <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/20…
## $ total_steps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, …
## $ distance_in_km <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.88,…
## $ tracked_kms <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.88,…
## $ logged_kms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ very_active_minutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 41,…
## $ fairly_active_minutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21, …
## $ lightly_active_minutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, 21…
## $ sedentary_minutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818, …
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035,…
## $ minutes_asleep <dbl> 327, 384, NA, 412, 340, 700, NA, 304, 360, 325,…
## $ minutes_in_bed <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377, 364,…
To check that we still have all 33 distinct user Id numbers:
n_distinct(combined_data $ id)
## [1] 33
head(combined_data)
## # A tibble: 6 × 13
## id new_date total_steps distance_in_km tracked_kms logged_kms
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5 0
## 2 1503960366 4/13/2016 10735 6.97 6.97 0
## 3 1503960366 4/14/2016 10460 6.74 6.74 0
## 4 1503960366 4/15/2016 9762 6.28 6.28 0
## 5 1503960366 4/16/2016 12669 8.16 8.16 0
## 6 1503960366 4/17/2016 9705 6.48 6.48 0
## # … with 7 more variables: very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## # minutes_in_bed <dbl>
The date column is a character data type. I would like to change it to make it more useful as a date. I will mutate the dataframe to change the date to a date type and add a new column to show the day of the week for the activity.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
combined_data_completed <- combined_data %>%
mutate(new_date = mdy(new_date)) %>%
mutate(day_of_the_week = weekdays(new_date))
combined_data_completed$day_of_the_week <- factor(combined_data_completed$day_of_the_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday" ))
head(combined_data_completed)
## # A tibble: 6 × 14
## id new_date total_steps distance_in_km tracked_kms logged_kms
## <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5 0
## 2 1503960366 2016-04-13 10735 6.97 6.97 0
## 3 1503960366 2016-04-14 10460 6.74 6.74 0
## 4 1503960366 2016-04-15 9762 6.28 6.28 0
## 5 1503960366 2016-04-16 12669 8.16 8.16 0
## 6 1503960366 2016-04-17 9705 6.48 6.48 0
## # … with 8 more variables: very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## # minutes_in_bed <dbl>, day_of_the_week <fct>
ggplot(data = combined_data, mapping = aes(x = total_steps, y = distance_in_km)) +
geom_point()
I find it interesting to see the difference in stride length become apparent as the distances get longer. Also, three of the data points show total daily distances equivalent to or greater than a half marathon - about 21.1km.
I want to look at the same plot, faceted by id to look at individual patterns of activity, and also see which users had a daily recorded distance of over 21km.
install.packages("viridis")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(viridis)
## Loading required package: viridisLite
ggplot(data = combined_data, mapping = aes(x = total_steps, y = distance_in_km, color = distance_in_km > 21 )) +
geom_point() +
theme(axis.text.x = element_text(angle = 90))+
facet_wrap(~id) +
scale_color_viridis(discrete = TRUE, option = "cividis")
I’m surprised to see that one of the respondents who had one of the daily totals of distance equal to a half marathon normally stays below 10km. This respondent is the second graph on the first row above. That kind of data could be inspiring to other users who want to try a new level of distance. The other two data points of greatest daily distance are both from the last respondent graphed. This respondent has a wide range of daily activity distances.
I’m interested to see if there is a correlations with sleep. First I’ll try sleep and total distance.
ggplot(data = combined_data, mapping = aes(x = minutes_asleep, y = distance_in_km)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (stat_smooth).
## Warning: Removed 530 rows containing missing values (geom_point).
There is very little correlation. Now I’ll try minutes in bed and minutes asleep.
ggplot(data = combined_data, mapping = aes(x = minutes_asleep, y = minutes_in_bed)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (stat_smooth).
## Warning: Removed 530 rows containing missing values (geom_point).
Since minutes asleep is a component of minutes in bed, there is a correlation, however there are some outliers. What about the time it takes to get to sleep and the distance completed during the day? Could there be a correlation there? I need to add a new column for time awake in bed by mutating the data frame and calculating that column from the minutes in bed and minutes asleep columns.
mutate(combined_data, awake_in_bed = minutes_in_bed - minutes_asleep)
## # A tibble: 943 × 14
## id new_date total_steps distance_in_km tracked_kms logged_kms
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5 0
## 2 1503960366 4/13/2016 10735 6.97 6.97 0
## 3 1503960366 4/14/2016 10460 6.74 6.74 0
## 4 1503960366 4/15/2016 9762 6.28 6.28 0
## 5 1503960366 4/16/2016 12669 8.16 8.16 0
## 6 1503960366 4/17/2016 9705 6.48 6.48 0
## 7 1503960366 4/18/2016 13019 8.59 8.59 0
## 8 1503960366 4/19/2016 15506 9.88 9.88 0
## 9 1503960366 4/20/2016 10544 6.68 6.68 0
## 10 1503960366 4/21/2016 9819 6.34 6.34 0
## # … with 933 more rows, and 8 more variables: very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## # minutes_in_bed <dbl>, awake_in_bed <dbl>
ggplot(data = mutate(combined_data, awake_in_bed = minutes_in_bed - minutes_asleep), mapping = aes(x = awake_in_bed, y = distance_in_km)) +
geom_point()+
labs(title = "Distance Completed and Minutes Awake in Bed")
## Warning: Removed 530 rows containing missing values (geom_point).
As much as I want there to be a correlation between getting exercise and being able to drift off to sleep quickly, I am not seeing it here. I think other variables could be affecting time in bed awake, like personal habits. Some of the data points are showing people laying in bed awake for more than two hours. I hope they’re reading or watching a movie.
Here is a faceted look at that same graph. I’ve also added color to make it easier to see longer times awake in bed with an arbitrary division at greater than 30 minutes awake in bed.
mutate(combined_data, minutes_awake_in_bed = minutes_in_bed - minutes_asleep)
## # A tibble: 943 × 14
## id new_date total_steps distance_in_km tracked_kms logged_kms
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5 0
## 2 1503960366 4/13/2016 10735 6.97 6.97 0
## 3 1503960366 4/14/2016 10460 6.74 6.74 0
## 4 1503960366 4/15/2016 9762 6.28 6.28 0
## 5 1503960366 4/16/2016 12669 8.16 8.16 0
## 6 1503960366 4/17/2016 9705 6.48 6.48 0
## 7 1503960366 4/18/2016 13019 8.59 8.59 0
## 8 1503960366 4/19/2016 15506 9.88 9.88 0
## 9 1503960366 4/20/2016 10544 6.68 6.68 0
## 10 1503960366 4/21/2016 9819 6.34 6.34 0
## # … with 933 more rows, and 8 more variables: very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## # minutes_in_bed <dbl>, minutes_awake_in_bed <dbl>
ggplot(data = mutate(combined_data, minutes_awake_in_bed = minutes_in_bed - minutes_asleep), mapping = aes(x = minutes_awake_in_bed, y = distance_in_km, color = minutes_awake_in_bed > 30)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90))+
facet_wrap(~id) +
labs(title = "Distance Completed and Minutes Awake in Bed by Individual") +
scale_color_viridis(discrete = TRUE, option = "cividis")
## Warning: Removed 530 rows containing missing values (geom_point).
Three respondents spent more than 100 minutes in bed awake. Of the total of 24 who are recording their sleep behaviors, that’s 12.5% of the sample population. Long times awake in bed seems to be associated with specific individuals instead of correlated with daily distance totals. I want to try one last plot with time in bed awake, and compare it to calories burned during the day.
mutate(combined_data, awake_in_bed = minutes_in_bed - minutes_asleep)
## # A tibble: 943 × 14
## id new_date total_steps distance_in_km tracked_kms logged_kms
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5 0
## 2 1503960366 4/13/2016 10735 6.97 6.97 0
## 3 1503960366 4/14/2016 10460 6.74 6.74 0
## 4 1503960366 4/15/2016 9762 6.28 6.28 0
## 5 1503960366 4/16/2016 12669 8.16 8.16 0
## 6 1503960366 4/17/2016 9705 6.48 6.48 0
## 7 1503960366 4/18/2016 13019 8.59 8.59 0
## 8 1503960366 4/19/2016 15506 9.88 9.88 0
## 9 1503960366 4/20/2016 10544 6.68 6.68 0
## 10 1503960366 4/21/2016 9819 6.34 6.34 0
## # … with 933 more rows, and 8 more variables: very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, Calories <dbl>, minutes_asleep <dbl>,
## # minutes_in_bed <dbl>, awake_in_bed <dbl>
ggplot(data = mutate(combined_data, awake_in_bed = minutes_in_bed - minutes_asleep), mapping = aes(x = awake_in_bed, y = Calories)) +
geom_point() +
geom_smooth() +
labs(title = "Calories Burned and Minutes Awake in Bed")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (stat_smooth).
## Warning: Removed 530 rows containing missing values (geom_point).
And that’s the closest I can find to a connection between exercise and sleeping for this group. If you stay in bed too long, you just don’t have the time to burn as many calories as those who go ahead and get out of bed.
I would like to explore tracker distance, logged distance, and total distance in the daily activity data.
ggplot(data = daily_activity_ready, mapping = aes(x = distance_in_km, y = tracked_kms + logged_kms)) +
geom_point() +
labs(title = "Do Tracked and Logged Distances Add up to Total Distances?")
I thought that tracked and logged distance would always combine to equal total distance. The plot above shows that is not true. It looks like it is possible that respondents who need to log additional distance do not get that distance added into total distance.
ggplot(data = daily_activity_ready, mapping = aes(x = distance_in_km, y = tracked_kms)) +
geom_point() +
labs(title = "Does Tracked Distance by Itself Equal Total Distance?")
Tracked distance alone does not always equal the total distance either. Just to explore that further, I’ll make a new data frame of just the instances where the distance_in_km does not equal the tracked_kms plus the logged_kms.
distance_anomalies <- mutate(combined_data, calc_dist = tracked_kms + logged_kms) %>%
select(id, new_date, calc_dist, distance_in_km) %>%
filter(calc_dist != distance_in_km)
glimpse(distance_anomalies)
## Rows: 33
## Columns: 4
## $ id <dbl> 6775888955, 6962181067, 6962181067, 6962181067, 7007744…
## $ new_date <chr> "4/26/2016", "4/21/2016", "4/25/2016", "5/9/2016", "4/1…
## $ calc_dist <dbl> 7.229596, 11.961692, 11.865175, 11.847822, 14.349782, 1…
## $ distance_in_km <dbl> 5.27, 9.71, 9.27, 8.72, 10.29, 9.65, 8.24, 10.98, 10.48…
In a total of 943 observations, there are 33 instances of what I’m calling distance anomalies(where the distances just don’t add up).
But what I’m really wanting to explore for the stakeholders is the way people are using their smart trackers, and I can look and the percentage of distance tracked versus distance logged.
library(tidyr)
logging_occurance <- filter(combined_data_completed, logged_kms > "0")
tracking_occurance <- filter(combined_data_completed, tracked_kms > "0")
ggplot() +
geom_bar(data = tracking_occurance, mapping = aes(x = tracked_kms), color = "yellow", position = "dodge")+
geom_bar(data = logging_occurance, mapping = aes(x = logged_kms), color = "blue", position = "dodge")+
labs(title = "Instances of Tracking and Logged Distances", x= "kilometers", y = "Count") +
scale_color_viridis(discrete = TRUE, option = "viridis")
Clearly, the majority of distances are automatically tracked and recorded instead of being logged by the respondents themselves.
logging_occurance <- filter(combined_data_completed, logged_kms > "0")
logged <- summarize(logging_occurance, logged = sum(logged_kms))
tracking_occurance <- filter(combined_data_completed, tracked_kms > "0")
tracked <- summarize(tracking_occurance, tracked = sum(tracked_kms))
percent_logged_distance <- (logged)/(tracked + logged)
logged
## # A tibble: 1 × 1
## logged
## <dbl>
## 1 104.
tracked
## # A tibble: 1 × 1
## tracked
## <dbl>
## 1 5176.
percent_logged_distance
## logged
## 1 0.01965591
Only about 2% of respondents’ distances are logged. Respondents show a strong preference for tracking versus logging their distances.
Does the day of the week matter as far as how many steps the respondents took?
steps_by_day <- combined_data_completed %>%
group_by(day_of_the_week) %>%
summarise(total_steps = sum(total_steps))
ggplot(data = steps_by_day) +
geom_col(mapping = aes(x = day_of_the_week, y = total_steps, fill = day_of_the_week))+
labs(title = "Total Steps by Day")
scale_color_viridis(discrete = TRUE, option = "viridis")
## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: colour
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: NA
## name: waiver
## palette: function
## palette.cache: NULL
## position: left
## range: <ggproto object: Class RangeDiscrete, Range, gg>
## range: NULL
## reset: function
## train: function
## super: <ggproto object: Class RangeDiscrete, Range, gg>
## rescale: function
## reset: function
## scale_name: viridis
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
steps_by_day <- combined_data_completed %>%
group_by(id,day_of_the_week) %>%
summarise(total_steps = sum(total_steps))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
ggplot(data = steps_by_day) +
geom_col(mapping = aes(x = day_of_the_week, y = total_steps, fill = day_of_the_week)) +
theme(axis.text.x = element_text(angle = 90))+
facet_wrap(~id) +
labs(title = "Total Steps by Day by Respondent")+
scale_color_viridis(discrete = TRUE, option = "viridis")
I notice that some days, individual users record 0 steps. I would like to explore that.
no_steps <- combined_data_completed %>%
select(total_steps, sedentary_minutes,day_of_the_week, logged_kms )%>%
group_by(day_of_the_week)%>%
filter(total_steps == 0)
ggplot(data = no_steps, mapping = aes(x= day_of_the_week, fill = day_of_the_week))+
geom_bar()+
labs(title = "Days with No Steps Recorded")+
scale_color_viridis(discrete = TRUE, option = "viridis")
Friday is the day this group of respondents was least likely to not record steps.
I noticed in a few observations that the total steps were 0 and the sedentary minutes on the same day were 1440. A whole day, 24 hours, x 60 minutes per hour = 1440 minutes. That means the device recorded an entire day with no motion. I feel safe in assuming that the tracker was left behind at home those days. I’d like to see if there is a pattern in forgetting the device and the day of the week.
device_left_behind <- no_steps %>%
filter(sedentary_minutes == 1440)
ggplot(data = device_left_behind, mapping = aes(x=day_of_the_week, fill = day_of_the_week))+
geom_bar()+
labs(title = "Days Device was Left Behind")+
scale_color_viridis(discrete = TRUE, option = "viridis")
On any given day, these respondents were least likely to forget their tracker at home on Fridays. This plot looks stunningly similar to the one above it. I will check to see how many times the trackers recorded 0 steps but the sedentary minutes were not equal to an entire day.
left_behind_less_than_one_day <- no_steps %>%
filter(sedentary_minutes != 1440)
left_behind_less_than_one_day
## # A tibble: 5 × 4
## # Groups: day_of_the_week [3]
## total_steps sedentary_minutes day_of_the_week logged_kms
## <dbl> <dbl> <fct> <dbl>
## 1 0 711 Thursday 0
## 2 0 966 Thursday 0
## 3 0 1407 Saturday 0
## 4 0 111 Saturday 0
## 5 0 48 Tuesday 0
It’s not so surprising that those plots are so similar. Only 5 observations were removed by filtering out the sedentary minutes not equal to 1440. Most of the time (72 out of the 77 instances) that the device was left behind it recorded a full day of no activity. I’m curious to see if any of the respondents bothered with going back and manually logging distances for those days that they forgot their devices.
forgot_device_but_logged_anyway <- no_steps %>%
filter(logged_kms > 0)
forgot_device_but_logged_anyway
## # A tibble: 0 × 4
## # Groups: day_of_the_week [0]
## # … with 4 variables: total_steps <dbl>, sedentary_minutes <dbl>,
## # day_of_the_week <fct>, logged_kms <dbl>
Nope. No one logged distance for days when the device was left behind.
The marketing team can emphasize long charge, fashion-conscious design, ease of tracking, and inspirational training stories in their digital marketing of the Ivy to find new growth opportunities.
The Ivy can be sold in a package with the Bellabeat Membership to make women feel more confident that they are getting the full benefit of their Ivy by using more of the metrics tracked to support their well being.