R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Introduction to the case study scenario

In this case study, you will imagine you are working for Bellabeat, a high-tech manufacturer of health-focused products for women, and meet different characters and team members. You are a junior data analyst working on the marketing analyst team at Bellabeat. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.

Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

Characters

  • Urška Sršen: Bellabeat’s co founder and Chief Creative Officer

  • Sando Mur: Mathematician and Bellabeat cofounder; key member of the Bellabeat executive team

  • Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.

Products

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
    Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Ask

Urška Sršen (Bellabeat’s co-founder, Chief Creative Officer) asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation.

Questions

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Business Task

Analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and use those insights to create high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Prepare

Description of all data used

In this scenario, Sršen encourages you to use public data that explores smart device users’ daily habits. Specifically she requests you use FitBit Fitness Tracker (https://www.kaggle.com/datasets/arashnic/fitbit) Data (CC0: Public Domain, dataset made available through Mobiushttps://www.kaggle.com/arashnic): This Kaggle data set contains personal fitness tracker from thirty FitBit users. Thirty eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Process

Load packages

Loaded the packages needed for the analysis using the library() function.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.7      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.0 
## ✔ readr   2.1.2      ✔ forcats 0.5.1 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
library(skimr)
library(here)
## here() starts at C:/Users/yeiro/OneDrive/Desktop/Bellabeat Case Study data

Collect Data

Imported the csv files from the kaggle dataset that were used for analysis.

activity_daily <- read.csv("dailyActivity_merged.csv")
calories_daily <- read.csv("dailyCalories_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
sleep_daily <- read.csv("sleepDay_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")

Inspect the data

Checked the data to see if there were any structural changes that might be needed. Also checked to see how many observations were in each data frame and take a look at the column names for each.

head(activity_daily)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
head(calories_daily)
##           Id ActivityDay Calories
## 1 1503960366   4/12/2016     1985
## 2 1503960366   4/13/2016     1797
## 3 1503960366   4/14/2016     1776
## 4 1503960366   4/15/2016     1745
## 5 1503960366   4/16/2016     1863
## 6 1503960366   4/17/2016     1728
head(daily_steps)
##           Id ActivityDay StepTotal
## 1 1503960366   4/12/2016     13162
## 2 1503960366   4/13/2016     10735
## 3 1503960366   4/14/2016     10460
## 4 1503960366   4/15/2016      9762
## 5 1503960366   4/16/2016     12669
## 6 1503960366   4/17/2016      9705
head(sleep_daily)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
head(weight_log)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12
str(activity_daily)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(calories_daily)
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(daily_steps)
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
str(sleep_daily)
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
str(weight_log)
## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

Looked at the number of distinct ids to confirm how many unique participants there were for each data frame.

n_distinct(activity_daily$Id)
## [1] 33
n_distinct(calories_daily$Id)
## [1] 33
n_distinct(daily_steps$Id)
## [1] 33
n_distinct(sleep_daily$Id)
## [1] 24
n_distinct(weight_log$Id)
## [1] 8

Also checked the consistency across the data by seeing how many observations there were in each data frame.

nrow(activity_daily)
## [1] 940
nrow(calories_daily)
## [1] 940
nrow(daily_steps)
## [1] 940
nrow(sleep_daily)
## [1] 413
nrow(weight_log)
## [1] 67

Format the data

After looking at the column names, I decided to rename the ActivityDay, SleepDay and Date columns in the other data frames to be consistent. Changed format to date for ActivityDate column in each data frame.

calories_daily <- calories_daily %>%
  rename(ActivityDate = ActivityDay) %>%
  rename(calories = Calories) %>%
  mutate(ActivityDate = as_date(ActivityDate, format = "%m/%d/%Y"))

daily_steps <- daily_steps %>%
  rename(ActivityDate = ActivityDay) %>%
  rename(TotalSteps = StepTotal) %>%
  mutate(ActivityDate = as_date(ActivityDate, format = "%m/%d/%Y"))

activity_daily <- activity_daily %>%
  rename(calories = Calories) %>%
  mutate(ActivityDate = as_date(ActivityDate, format = "%m/%d/%Y"))

sleep_daily <- sleep_daily %>%
    rename(ActivityDate = SleepDay) %>%
  mutate(ActivityDate = as_date(ActivityDate, format = "%m/%d/%Y"))

weight_log <- weight_log %>%
    rename(ActivityDate = Date) %>%
  mutate(ActivityDate = as_date(ActivityDate, format = "%m/%d/%Y"))

Remove bad data

Remove duplicates and drop NA.

activity_daily <- activity_daily %>%
  distinct() %>%
  drop_na()
calories_daily <- calories_daily %>%
  distinct() %>%
  drop_na()
daily_steps <- daily_steps %>%
  distinct() %>%
  drop_na()
sleep_daily <- sleep_daily %>%
  distinct() %>%
  drop_na()
weight_log <- weight_log %>%
  distinct() %>%
  drop_na()

Analyze

Next the data frames were summarized to take a high level look at the data.

summary(activity_daily)
##        Id             ActivityDate          TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Median :2016-04-26   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019   Max.   :28.030  
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
##     calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900
summary(calories_daily)
##        Id             ActivityDate           calories   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :   0  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.:1828  
##  Median :4.445e+09   Median :2016-04-26   Median :2134  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   :2304  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:2793  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :4900
summary(daily_steps)
##        Id             ActivityDate          TotalSteps   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790  
##  Median :4.445e+09   Median :2016-04-26   Median : 7406  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019
summary(sleep_daily)
##        Id             ActivityDate        TotalSleepRecords TotalMinutesAsleep
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :1.00      Min.   : 58.0     
##  1st Qu.:3.977e+09   1st Qu.:2016-04-19   1st Qu.:1.00      1st Qu.:361.0     
##  Median :4.703e+09   Median :2016-04-27   Median :1.00      Median :432.5     
##  Mean   :4.995e+09   Mean   :2016-04-26   Mean   :1.12      Mean   :419.2     
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:1.00      3rd Qu.:490.0     
##  Max.   :8.792e+09   Max.   :2016-05-12   Max.   :3.00      Max.   :796.0     
##  TotalTimeInBed 
##  Min.   : 61.0  
##  1st Qu.:403.8  
##  Median :463.0  
##  Mean   :458.5  
##  3rd Qu.:526.0  
##  Max.   :961.0
summary(weight_log)
##        Id             ActivityDate           WeightKg      WeightPounds  
##  Min.   :1.504e+09   Min.   :2016-04-17   Min.   :52.60   Min.   :116.0  
##  1st Qu.:2.208e+09   1st Qu.:2016-04-20   1st Qu.:57.55   1st Qu.:126.9  
##  Median :2.912e+09   Median :2016-04-24   Median :62.50   Median :137.8  
##  Mean   :2.912e+09   Mean   :2016-04-24   Mean   :62.50   Mean   :137.8  
##  3rd Qu.:3.616e+09   3rd Qu.:2016-04-28   3rd Qu.:67.45   3rd Qu.:148.7  
##  Max.   :4.320e+09   Max.   :2016-05-02   Max.   :72.40   Max.   :159.6  
##       Fat             BMI        IsManualReport         LogId          
##  Min.   :22.00   Min.   :22.65   Length:2           Min.   :1.461e+12  
##  1st Qu.:22.75   1st Qu.:23.85   Class :character   1st Qu.:1.461e+12  
##  Median :23.50   Median :25.05   Mode  :character   Median :1.462e+12  
##  Mean   :23.50   Mean   :25.05                      Mean   :1.462e+12  
##  3rd Qu.:24.25   3rd Qu.:26.25                      3rd Qu.:1.462e+12  
##  Max.   :25.00   Max.   :27.45                      Max.   :1.462e+12

Merge the Data

After reviewing the summaries for each dataset, merged the data frames together.

 merged_calories_activity <- merge(activity_daily, calories_daily, by=c("Id", "ActivityDate", "calories"))

user_activity <- merge(merged_calories_activity, daily_steps, by= c("Id", "ActivityDate", "TotalSteps"))

weight_and_activity <- merge(user_activity, weight_log, by=c("Id", "ActivityDate"))


str(user_activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
str(weight_and_activity)
## 'data.frame':    2 obs. of  21 variables:
##  $ Id                      : num  1.50e+09 4.32e+09
##  $ ActivityDate            : Date, format: "2016-05-02" "2016-04-17"
##  $ TotalSteps              : int  14727 29
##  $ calories                : int  2004 1464
##  $ TotalDistance           : num  9.71 0.02
##  $ TrackerDistance         : num  9.71 0.02
##  $ LoggedActivitiesDistance: num  0 0
##  $ VeryActiveDistance      : num  3.21 0
##  $ ModeratelyActiveDistance: num  0.57 0
##  $ LightActiveDistance     : num  5.92 0.02
##  $ SedentaryActiveDistance : num  0 0
##  $ VeryActiveMinutes       : int  41 0
##  $ FairlyActiveMinutes     : int  15 0
##  $ LightlyActiveMinutes    : int  277 3
##  $ SedentaryMinutes        : int  798 1363
##  $ WeightKg                : num  52.6 72.4
##  $ WeightPounds            : num  116 160
##  $ Fat                     : int  22 25
##  $ BMI                     : num  22.6 27.5
##  $ IsManualReport          : chr  "True" "True"
##  $ LogId                   : num  1.46e+12 1.46e+12

Then checked the summaries for the merged data frames

summary(user_activity)
##        Id             ActivityDate          TotalSteps       calories   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0   Min.   :   0  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790   1st Qu.:1828  
##  Median :4.445e+09   Median :2016-04-26   Median : 7406   Median :2134  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638   Mean   :2304  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727   3rd Qu.:2793  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019   Max.   :4900  
##  TotalDistance    TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.490   Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.713   3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0

There are several interesting findings in the summary including:

  • Average daily sedentary time was 991.2 minutes (16 hours and 31 minutes). Over 2/3 of the day on average. This could be explained by work and sleep if the average user’s day typically involved 8 hours of each.

  • The weight_and_activity data frame only had 2 observations from two distinct id numbers just as the weight_log did, which means only 2 out of the 33 users involved in the study logged their weight, only once each, during the period when the data was collected.

    • This suggests users aren’t actively logging their weight which could be an area of interest.

    • Obtaining user feedback on how they feel about using the feature and how likely they are to use it would be helpful. While weight plays a significant part in a fitness journey, users might not want to enter it and insight into why can help improve the product

Additional columns were added using the ActivityDate column to include day, month, year, and day of the week

user_activity$date <- as.Date(user_activity$ActivityDate)
user_activity$month <- format(as.Date(user_activity$date), "%m")
user_activity$day <- format(as.Date(user_activity$date), "%d")
user_activity$year <- format(as.Date(user_activity$date), "%Y")
user_activity$day_of_week <- format(as.Date(user_activity$date), "%A")

user_activity$day_of_week <- ordered(user_activity$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

Average calories, steps and distance by day of week using user_activity dataframe:

user_activity %>% 
  mutate(weekday = wday(ActivityDate, label = TRUE)) %>%  
  group_by(weekday) %>% 
  summarise(average_calories = mean(TotalSteps) ,avg_steps = mean(calories), avg_distance = mean(TotalDistance)) %>%        
  arrange(weekday)  
## # A tibble: 7 × 4
##   weekday average_calories avg_steps avg_distance
##   <ord>              <dbl>     <dbl>        <dbl>
## 1 Sun                6933.     2263          5.03
## 2 Mon                7781.     2324.         5.55
## 3 Tue                8125.     2356.         5.83
## 4 Wed                7559.     2303.         5.49
## 5 Thu                7406.     2200.         5.31
## 6 Fri                7448.     2332.         5.31
## 7 Sat                8153.     2355.         5.85

User_Activity and Sleep merge and summary:

user_sleep_and_activity <- merge(user_activity, sleep_daily, by= c("Id", "ActivityDate"))

str(user_sleep_and_activity)
## 'data.frame':    410 obs. of  23 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : int  13162 10735 9762 12669 9705 15506 10544 9819 14371 10039 ...
##  $ calories                : int  1985 1797 1745 1863 1728 2035 1786 1775 1949 1788 ...
##  $ TotalDistance           : num  8.5 6.97 6.28 8.16 6.48 ...
##  $ TrackerDistance         : num  8.5 6.97 6.28 8.16 6.48 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.14 2.71 3.19 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 1.26 0.41 0.78 ...
##  $ LightActiveDistance     : num  6.06 4.71 2.83 5.04 2.51 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 29 36 38 50 28 19 41 39 ...
##  $ FairlyActiveMinutes     : int  13 19 34 10 20 31 12 8 21 5 ...
##  $ LightlyActiveMinutes    : int  328 217 209 221 164 264 205 211 262 238 ...
##  $ SedentaryMinutes        : int  728 776 726 773 539 775 818 838 732 709 ...
##  $ date                    : Date, format: "2016-04-12" "2016-04-13" ...
##  $ month                   : chr  "04" "04" "04" "04" ...
##  $ day                     : chr  "12" "13" "15" "16" ...
##  $ year                    : chr  "2016" "2016" "2016" "2016" ...
##  $ day_of_week             : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 3 4 6 7 1 3 4 5 7 1 ...
##  $ TotalSleepRecords       : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep      : int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed          : int  346 407 442 367 712 320 377 364 384 449 ...
summary(user_sleep_and_activity)
##        Id             ActivityDate          TotalSteps       calories   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :   17   Min.   : 257  
##  1st Qu.:3.977e+09   1st Qu.:2016-04-19   1st Qu.: 5189   1st Qu.:1841  
##  Median :4.703e+09   Median :2016-04-27   Median : 8913   Median :2207  
##  Mean   :4.995e+09   Mean   :2016-04-26   Mean   : 8515   Mean   :2389  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:11370   3rd Qu.:2920  
##  Max.   :8.792e+09   Max.   :2016-05-12   Max.   :22770   Max.   :4900  
##                                                                         
##  TotalDistance    TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.010   Min.   : 0.010   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 3.592   1st Qu.: 3.592   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 6.270   Median : 6.270   Median :0.0000           Median : 0.570    
##  Mean   : 6.012   Mean   : 6.007   Mean   :0.1089           Mean   : 1.446    
##  3rd Qu.: 8.005   3rd Qu.: 7.950   3rd Qu.:0.0000           3rd Qu.: 2.360    
##  Max.   :17.540   Max.   :17.540   Max.   :4.0817           Max.   :12.540    
##                                                                               
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   :0.010       Min.   :0.0000000      
##  1st Qu.:0.0000           1st Qu.:2.540       1st Qu.:0.0000000      
##  Median :0.4200           Median :3.665       Median :0.0000000      
##  Mean   :0.7439           Mean   :3.791       Mean   :0.0009268      
##  3rd Qu.:1.0375           3rd Qu.:4.918       3rd Qu.:0.0000000      
##  Max.   :6.4800           Max.   :9.480       Max.   :0.1100000      
##                                                                      
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  2.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:158.0        1st Qu.: 631.2  
##  Median :  9.00    Median : 11.00      Median :208.0        Median : 717.0  
##  Mean   : 25.05    Mean   : 17.92      Mean   :216.5        Mean   : 712.1  
##  3rd Qu.: 38.00    3rd Qu.: 26.75      3rd Qu.:263.0        3rd Qu.: 782.8  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1265.0  
##                                                                             
##       date               month               day                year          
##  Min.   :2016-04-12   Length:410         Length:410         Length:410        
##  1st Qu.:2016-04-19   Class :character   Class :character   Class :character  
##  Median :2016-04-27   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2016-04-26                                                           
##  3rd Qu.:2016-05-04                                                           
##  Max.   :2016-05-12                                                           
##                                                                               
##     day_of_week TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Sunday   :55   Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  Monday   :46   1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Tuesday  :65   Median :1.00      Median :432.5      Median :463.0  
##  Wednesday:66   Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  Thursday :64   3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Friday   :57   Max.   :3.00      Max.   :796.0      Max.   :961.0  
##  Saturday :57

Averages for user activity:

  • calories: 2304

  • distance: 5.49

  • steps: 7638

Average Sleep:

  • Calories: 2389

  • Distance: 6.012

  • steps: 8515

Users that logged their sleep information averaged higher amounts of calories burned, steps taken and total distance traveled compared to all users in the study. This could suggest that using the sleep feature could have made users more active, or at least suggest a correlation between users and logging sleep.

Average calories, distance , steps, minutes of sleep and time spent in bed by day of the week for user_sleep_and_activity data frame :

user_sleep_and_activity %>% 
  mutate(weekday = wday(ActivityDate, label = TRUE)) %>%  
  group_by(weekday) %>% 
  summarise(avg_hours_of_sleep = mean(TotalMinutesAsleep)/60, avg_hours_in_bed = mean(TotalTimeInBed)/60, avg_steps = mean(TotalSteps), avg_calories_burned = mean(calories), avg_distance = mean(TotalDistance)) %>%       
  arrange(weekday)  
## # A tibble: 7 × 6
##   weekday avg_hours_of_sleep avg_hours_in_bed avg_steps avg_calories_burned
##   <ord>                <dbl>            <dbl>     <dbl>               <dbl>
## 1 Sun                   7.55             8.39     7298.               2277.
## 2 Mon                   6.99             7.62     9273.               2432.
## 3 Tue                   6.74             7.39     9183.               2496.
## 4 Wed                   7.24             7.83     8023.               2378.
## 5 Thu                   6.69             7.25     8184.               2307.
## 6 Fri                   6.76             7.42     7901.               2330.
## 7 Sat                   6.98             7.66     9871.               2507.
## # … with 1 more variable: avg_distance <dbl>

Data Visualizations

The visualization below was created to see the average number of hours spent either asleep or in bed.

hours_in_bed_day_of_week <- user_sleep_and_activity %>% 
  mutate(weekday = wday(ActivityDate, label = TRUE)) %>%  
  group_by(weekday) %>% 
  summarise(avg_hours_of_sleep = mean(TotalMinutesAsleep)/60, avg_hours_in_bed = mean(TotalTimeInBed)/60, avg_steps = mean(TotalSteps), avg_calories_burned = mean(calories), avg_distance = mean(TotalDistance)) %>%       
  arrange(weekday)  %>% 
  ggplot(aes(x = weekday, y = avg_hours_in_bed, fill = avg_hours_of_sleep)) +
  geom_col(position = "dodge") + 
  labs(title =  "Average Number of Hours Spent in Bed by Day of the Week", subtitle = "Color filled in by average amount of hours slept")

hours_in_bed_day_of_week

Observations:

Next the average amount of steps taken each day of the week was visualized. Distance was included as the fill since the two are correlated.

steps_day_of_week <- user_sleep_and_activity %>% 
  mutate(weekday = wday(ActivityDate, label = TRUE)) %>%  
  group_by(weekday) %>% 
  summarise(avg_hours_of_sleep = mean(TotalMinutesAsleep)/60, avg_hours_in_bed = mean(TotalTimeInBed)/60, avg_steps = mean(TotalSteps), avg_calories_burned = mean(calories), avg_distance = mean(TotalDistance)) %>%       
  arrange(weekday)  %>% 
  ggplot(aes(x = weekday, y = avg_steps, fill = avg_distance)) +
  geom_col(position = "dodge") + 
  labs(title =  "Average Number of Steps Taken by Day of the Week", subtitle = "Color filled in by average total distance", caption = "The metrics above are representative of the 24 users who logged sleep information")

steps_day_of_week

Next all users were included to check the average amount of steps taken each day of the week .

all_steps_day_of_week <- user_activity %>% 
  mutate(weekday = wday(ActivityDate, label = TRUE)) %>%  
  group_by(weekday) %>% 
  summarise(avg_steps = mean(TotalSteps), avg_calories_burned = mean(calories), avg_distance = mean(TotalDistance)) %>%         
  arrange(weekday)  %>% 
  ggplot(aes(x = weekday, y = avg_steps, fill = avg_distance)) +
  geom_col(position = "dodge") + 
  labs(title =  "Average Number of Steps Taken by Day of the Week", subtitle = "Color filled in by average total distance", caption = "The metrics above are representative of all users who logged their information")

all_steps_day_of_week

Observations from comparing both charts for each user group’s steps:

Relationship between calories and steps

cal_steps <- ggplot(data=user_activity, aes(x=calories, y=TotalSteps)) + 
  geom_point() + geom_smooth() + labs(title="Calories vs. Total Steps")

cal_steps
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There was a positive relationship between calories and total steps, which was to be expected.

Recommendations

  1. Market and encourage the use of features such as sleep in addition to activity tracking.
    • Users that logged their sleep info were more active on average than the users that did not.
  2. Find ways to improve the weight log feature. Collect user feedback on that feature to understand why it is not being used.
    • See if users are comfortable using it or what ways it could be improved
  3. Add push notifications or reminders to walk on certain days of the week that users tend to take fewer steps.
    • Marketing could focus on how the app can look at you weekly information to suggest activity plans for each day of the week.