Introduction and Background

Bellabeat is a data-centered wellness femtech company located in San Francisco, founded by Urška Sršen and Sandro Mur in 2013. It designed and engineered fashionable wellness trackers that help users to manage their health and wellness while conforming to women’s lifestyles on any occasion.

While the company’s growth has extended internationally, Sršen, the CEO, is looking for other growth opportunities. She’s particularly interested in analyzing usage data from non-Bellabeat users, such as those from Fitbit. She believes that the insight gained from these analyses could reveal valuable trends for potential growth opportunities that could help Bellabeat to strategize its marketing plan. To this end, I have been asked to serve as a data analyst for Bellabeat’s Marketing team (fictional) and to gather insights by analyzing device usage data by Fitbit consumers.

The analysis will follow Google Data Analytics’ framework, namely the six phases of data analysis: Ask, Prepare, Process, Analyze, Share, and Act.

1. Ask

In this phase of the analysis, the focus will be on identifying the business task and stakeholders, and determining how the insights gained could help to drive business decisions forward.

Statement of Business Tasks

The task entails utilizing insights gained from the analysis to provide answers to the following questions:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat’s marketing strategy?

Key Stakeholders:

Urška Sršen: cofounder and CEO
Sando Mur: Mathematician and co-founder; key member of the executive team

2. Prepare

This stage begins by examining the raw data used for this analysis for integrity, security, and credibility. The suggested data is the Fitbit Fitness Tracker Data by Mobius and is publically available (CC0: Public Domain) in Kaggle. According to the website, the datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk over a period af two months, from 2016-03-12 to 2016-05-12. Thirty eligible Fitbit users consented to the submission of personal tracker data, including daily, hourly, and minute-level output for physical activity, heart rate (per second), and sleep monitoring in a total of 18 .csv files of datasets arranged in long format.

ROCCC (Reliable, Original, Comprehensive, Current, and Cited) Analysis

Reliability: Low - data were collected for only 30 individuals with unknown gender, age, and ethnicity over a period of only 60 days. It’s not sure if the data would provide a representative population and duration of usage. . However, one could assume that all the participants are women since Bellabeat’s tracker devices are made for female health and wellness only. Therefore, gender data might not be applicable to the current analysis, except when Bellabeat plans to also extend its offerings to men.
Original: About Medium, while the data were generated by a third party, they were provided directly by actual Fitbit users, albeit via Amazon Mechanical Turk and participation is voluntary.
Comprehensive: High-Medium, while the data contain the necessary information to do the needed analysis, some data tables do not have data entries for all 30 individuals and in some cases, the entire column is blank.
Current: NO, the data are more than seven years old as of 2023-02, therefore, both the data and device capabilities are outdated.
Cited: Yes, the source is well-documented and contains metadata.

Data Selection

The tracker data were registered at daily, hourly and minute levels. For this study, daily-level activities that include calories burnt would give a higher-level view of the analysis. For the weightLogInfo_merged.csv file, while it only consists of eight users’ tracker data, it would be worth examining how this group of users chose activity’s intensity and duration, and the ensuing calories consumed. Since BMI (Kg/m²) = weight/height², which can be used as a screening tool to identify potential weight problems of an individual. It would also be interesting to find out if those who logged weight data were underweight (BMI<18.5), normal weight (BMI=18.5-24.9), overweight (BMI=25-29.9), or obese (BMI>30), according to the CDC. The files chosen for this study are as follows:

dailyActivity_merged.csv
sleepDay_merged.csv
weightLogInfo_merged.csv

3. Data Processing

Installing and loading common packages and libraries

library(magrittr)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()

library(dplyr)
library(lubridate)

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(readr)
library(ggplot2)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr)
library(hms)

## 
## Attaching package: 'hms'
## 
## The following object is masked from 'package:lubridate':
## 
##     hms

library(readr)
library(knitr)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Import Data to R, and create a data frame for each imported file followed by identifying all columns in each data frame created.

daily_activity <- read.csv("~/Downloads/dailyActivity_merged.csv")

sleep_day <- read.csv("~/Downloads/sleepDay_merged.csv")

weight_log_info<- read.csv("~/Downloads/weightLogInfo_merged.csv")

Identify all the columns in the new data frames for daily_activity data frame using the head() and glimpse() functions.

head(daily_activity)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

head(sleep_day)

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

glimpse(sleep_day)

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

head(weight_log_info)

##           Id       Date WeightKg WeightPounds Fat   BMI IsManualReport
## 1 1503960366 2016-05-02     52.6     115.9631  22 22.65           TRUE
## 2 1503960366 2016-05-03     52.6     115.9631  NA 22.65           TRUE
## 3 1927972279 2016-04-13    133.5     294.3171  NA 47.54          FALSE
## 4 2873212765 2016-04-21     56.7     125.0021  NA 21.45           TRUE
## 5 2873212765 2016-05-12     57.3     126.3249  NA 21.69           TRUE
## 6 4319703577 2016-04-17     72.4     159.6147  25 27.45           TRUE
##         LogId
## 1 1.46223e+12
## 2 1.46232e+12
## 3 1.46051e+12
## 4 1.46128e+12
## 5 1.46310e+12
## 6 1.46094e+12

glimpse(weight_log_info)

## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date           <chr> "2016-05-02", "2016-05-03", "2016-04-13", "2016-04-21",…
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat            <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId          <dbl> 1.46223e+12, 1.46232e+12, 1.46051e+12, 1.46128e+12, 1.4…

Look at some summary statistics

Find the number of unique participants in each data frame

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(sleep_day$Id)

## [1] 24

n_distinct(weight_log_info$Id)

## [1] 8

Data Cleaning

Checking for duplicates for each data frame.

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(sleep_day))

## [1] 3

sum(duplicated(weight_log_info))

## [1] 0

Removing duplicates and NA from each data frame

daily_activity <- daily_activity %>%
distinct() %>%
drop_na()

sleep_day <- sleep_day %>%
distinct() %>%
drop_na()

Check if duplicates and NA were removed from sleep_day

sum(duplicated(sleep_day))

## [1] 0

Cleaning column names

Tried clean_names(), it printed the entire data frame, so will try another method

daily_activity<- daily_activity %>% clean_names()

### Check if column names are now 'clean'- all columns are in a consistent format

colnames(daily_activity)

##  [1] "id"                         "activity_date"             
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ activity_date              <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/1…
## $ total_steps                <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…

### Clean the sleep_day data frame
sleep_day <- sleep_day %>% clean_names()

### Check if column names are now clean
colnames(sleep_day)

## [1] "id"                   "sleep_day"            "total_sleep_records" 
## [4] "total_minutes_asleep" "total_time_in_bed"

glimpse(sleep_day)

## Rows: 410
## Columns: 5
## $ id                   <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1…
## $ sleep_day            <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM",…
## $ total_sleep_records  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ total_minutes_asleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430,…
## $ total_time_in_bed    <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449,…

### Clean the weight_log_info data frame()

weight_log_info <- weight_log_info %>% clean_names()

### Check if column names are now 'clean'

colnames(weight_log_info)

## [1] "id"               "date"             "weight_kg"        "weight_pounds"   
## [5] "fat"              "bmi"              "is_manual_report" "log_id"

glimpse(weight_log_info)

## Rows: 67
## Columns: 8
## $ id               <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 28732…
## $ date             <chr> "2016-05-02", "2016-05-03", "2016-04-13", "2016-04-21…
## $ weight_kg        <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3…
## $ weight_pounds    <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159…
## $ fat              <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, N…
## $ bmi              <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.2…
## $ is_manual_report <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ log_id           <dbl> 1.46223e+12, 1.46232e+12, 1.46051e+12, 1.46128e+12, 1…

Change the data type of the column dates for all data frames to the ISO format: yyyy-mm-dd and convert the names of the corresponding date column to “Date”.

daily_activity <- daily_activity %>%
rename(Date = activity_date) %>%
mutate(Date = as_date(Date, format = "%m/%d/%Y"))

### Check to see if the data type of the dates has been changed 

head(daily_activity)

##           id       Date total_steps total_distance tracker_distance
## 1 1503960366 2016-04-12       13162           8.50             8.50
## 2 1503960366 2016-04-13       10735           6.97             6.97
## 3 1503960366 2016-04-14       10460           6.74             6.74
## 4 1503960366 2016-04-15        9762           6.28             6.28
## 5 1503960366 2016-04-16       12669           8.16             8.16
## 6 1503960366 2016-04-17        9705           6.48             6.48
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 1.88                       0.55
## 2                          0                 1.57                       0.69
## 3                          0                 2.44                       0.40
## 4                          0                 2.14                       1.26
## 5                          0                 2.71                       0.41
## 6                          0                 3.19                       0.78
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  6.06                         0                  25
## 2                  4.71                         0                  21
## 3                  3.91                         0                  30
## 4                  2.83                         0                  29
## 5                  5.04                         0                  36
## 6                  2.51                         0                  38
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                    13                    328               728     1985
## 2                    19                    217               776     1797
## 3                    11                    181              1218     1776
## 4                    34                    209               726     1745
## 5                    10                    221               773     1863
## 6                    20                    164               539     1728

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ Date                       <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
## $ total_steps                <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…

sleep_day <- sleep_day %>%
  rename(Date = sleep_day) %>%
  mutate(Date = as_date(Date,format ="%m/%d/%Y %I:%M:%S %p"))

### Check to see if the data type of the dates has been changed

head(sleep_day)

##           id       Date total_sleep_records total_minutes_asleep
## 1 1503960366 2016-04-12                   1                  327
## 2 1503960366 2016-04-13                   2                  384
## 3 1503960366 2016-04-15                   1                  412
## 4 1503960366 2016-04-16                   2                  340
## 5 1503960366 2016-04-17                   1                  700
## 6 1503960366 2016-04-19                   1                  304
##   total_time_in_bed
## 1               346
## 2               407
## 3               442
## 4               367
## 5               712
## 6               320

colnames(sleep_day)

## [1] "id"                   "Date"                 "total_sleep_records" 
## [4] "total_minutes_asleep" "total_time_in_bed"

glimpse(sleep_day)

## Rows: 410
## Columns: 5
## $ id                   <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1…
## $ Date                 <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, …
## $ total_sleep_records  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ total_minutes_asleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430,…
## $ total_time_in_bed    <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449,…

### The date data type for the weight_log_info data frame has been cleaned using Excel 
### so the dates are already in ISO format, only need to rename the column name, from
### "date" to "Date".

weight_log_info <- weight_log_info %>%
  rename("Date" = "date") 

### Check to see if the data type of the dates has been changed
head(weight_log_info)

##           id       Date weight_kg weight_pounds fat   bmi is_manual_report
## 1 1503960366 2016-05-02      52.6      115.9631  22 22.65             TRUE
## 2 1503960366 2016-05-03      52.6      115.9631  NA 22.65             TRUE
## 3 1927972279 2016-04-13     133.5      294.3171  NA 47.54            FALSE
## 4 2873212765 2016-04-21      56.7      125.0021  NA 21.45             TRUE
## 5 2873212765 2016-05-12      57.3      126.3249  NA 21.69             TRUE
## 6 4319703577 2016-04-17      72.4      159.6147  25 27.45             TRUE
##        log_id
## 1 1.46223e+12
## 2 1.46232e+12
## 3 1.46051e+12
## 4 1.46128e+12
## 5 1.46310e+12
## 6 1.46094e+12

glimpse(weight_log_info)

## Rows: 67
## Columns: 8
## $ id               <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 28732…
## $ Date             <chr> "2016-05-02", "2016-05-03", "2016-04-13", "2016-04-21…
## $ weight_kg        <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3…
## $ weight_pounds    <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159…
## $ fat              <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, N…
## $ bmi              <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.2…
## $ is_manual_report <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ log_id           <dbl> 1.46223e+12, 1.46232e+12, 1.46051e+12, 1.46128e+12, 1…

4. Analyze

Look at some summary statistics

For the daily activity data frame:

daily_activity %>% 
  select(total_steps,total_distance,sedentary_minutes) %>%
summary()

##   total_steps    total_distance   sedentary_minutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   
##  Median : 7406   Median : 5.245   Median :1057.5   
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

For the sleep data frame:

sleep_day %>%
select(total_sleep_records,
total_minutes_asleep,
total_time_in_bed) %>%
summary()

##  total_sleep_records total_minutes_asleep total_time_in_bed
##  Min.   :1.00        Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:1.00        1st Qu.:361.0        1st Qu.:403.8    
##  Median :1.00        Median :432.5        Median :463.0    
##  Mean   :1.12        Mean   :419.2        Mean   :458.5    
##  3rd Qu.:1.00        3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :3.00        Max.   :796.0        Max.   :961.0

Exploring a few plots

First will examine the relationship between steps taken in a day and sedentary minutes. This might help to engage more consumers to start walking more.

ggplot(data=daily_activity, aes(x=total_steps, y=sedentary_minutes, color=calories)) + 
  geom_point() + 
  labs(title = "Figure 1: How Sedentary Time Varies with Step Counts") + 
  theme(plot.title=element_text(face="bold"))

Figure 1 Analysis

At first glance of this, it doesn’t seem to make sense that there are two clusters of points for the same type of plot (sendary_minutes vs. total_steps). ). One obvious observation is that the two clusters correlate negatively with an increase in step counts. This makes sense since when one spends more time walking/running, it leaves less time for sedentary activities.

The presence of two clusters of points with similar trends could be due to the intensity levels of activity (steps) one engages in. When one spends more time on light_activities, e.g., slow walking consumes calories at a slower rate. While in this case, one could attain bigger steps, it takes more time to accomplish the same step counts as compared to a fast stride. Therefore, leaving less time for sedentary activities. The opposite is true for fast walking (very_active_minutes), it leaves users with more time for sedentary activities. Therefore, the more activity (e.g. fast walking) represents the upper cluster of points and the less intense activities (e.g. slow walking) form the lower cluster of points of the two scattered plots.

We’ll examine the relationship between minutes asleep and time in bed

ggplot(data=sleep_day, aes(x=total_minutes_asleep, y=total_time_in_bed, color=total_time_in_bed)) +
  geom_point() + 
  labs(title = "Figure 2: Relation between Time in Bed and Time Asleep") + 
  theme(plot.title=element_text(face="bold"))

Figure 2 Analysis

In this plot, there are two groups of points parallel to each other with the smaller group above the larger one between ~180 (3 hours) and ~400 minutes (6.7 hours) of total_ minutes_asleep. This amount to an average of ~39 minutes that the users are awake in bed. This could be indicative of either bad sleepers or users doing other things in bed, such as reading a book or checking on their smartphone.

Merging the first two datasets together and taking a look at all the columns of the merged data frame.

merged_data <- merge(sleep_day, daily_activity, by= c("id", "Date"))
head(merged_data)

##           id       Date total_sleep_records total_minutes_asleep
## 1 1503960366 2016-04-12                   1                  327
## 2 1503960366 2016-04-13                   2                  384
## 3 1503960366 2016-04-15                   1                  412
## 4 1503960366 2016-04-16                   2                  340
## 5 1503960366 2016-04-17                   1                  700
## 6 1503960366 2016-04-19                   1                  304
##   total_time_in_bed total_steps total_distance tracker_distance
## 1               346       13162           8.50             8.50
## 2               407       10735           6.97             6.97
## 3               442        9762           6.28             6.28
## 4               367       12669           8.16             8.16
## 5               712        9705           6.48             6.48
## 6               320       15506           9.88             9.88
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 1.88                       0.55
## 2                          0                 1.57                       0.69
## 3                          0                 2.14                       1.26
## 4                          0                 2.71                       0.41
## 5                          0                 3.19                       0.78
## 6                          0                 3.53                       1.32
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  6.06                         0                  25
## 2                  4.71                         0                  21
## 3                  2.83                         0                  29
## 4                  5.04                         0                  36
## 5                  2.51                         0                  38
## 6                  5.03                         0                  50
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                    13                    328               728     1985
## 2                    19                    217               776     1797
## 3                    34                    209               726     1745
## 4                    10                    221               773     1863
## 5                    20                    164               539     1728
## 6                    31                    264               775     2035

glimpse(merged_data)

## Rows: 410
## Columns: 18
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ Date                       <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-0…
## $ total_sleep_records        <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ total_minutes_asleep       <int> 327, 384, 412, 340, 700, 304, 360, 325, 361…
## $ total_time_in_bed          <int> 346, 407, 442, 367, 712, 320, 377, 364, 384…
## $ total_steps                <int> 13162, 10735, 9762, 12669, 9705, 15506, 105…
## $ total_distance             <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1…
## $ moderately_active_distance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0…
## $ light_active_distance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <int> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73,…
## $ fairly_active_minutes      <int> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 2…
## $ lightly_active_minutes     <int> 328, 217, 209, 221, 164, 264, 205, 211, 262…
## $ sedentary_minutes          <int> 728, 776, 726, 773, 539, 775, 818, 838, 732…
## $ calories                   <int> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 1…

Take a look at how many participants are in the merged_data (inner join)

n_distinct(merged_data$id)

## [1] 24

There are only 24 distinct ids, indicating that some ids in the daily activity dataset must have been filtered out during the merge process.

We’ll execute a full join to include all the participants from both datasets.

combined_data <- merge(daily_activity, sleep_day, by=c ("id", 'Date'), all = TRUE)

Check if all the columns are there using the head(), colnames() and glimpse() functions.

head(combined_data)

##           id       Date total_steps total_distance tracker_distance
## 1 1503960366 2016-04-12       13162           8.50             8.50
## 2 1503960366 2016-04-13       10735           6.97             6.97
## 3 1503960366 2016-04-14       10460           6.74             6.74
## 4 1503960366 2016-04-15        9762           6.28             6.28
## 5 1503960366 2016-04-16       12669           8.16             8.16
## 6 1503960366 2016-04-17        9705           6.48             6.48
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 1.88                       0.55
## 2                          0                 1.57                       0.69
## 3                          0                 2.44                       0.40
## 4                          0                 2.14                       1.26
## 5                          0                 2.71                       0.41
## 6                          0                 3.19                       0.78
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  6.06                         0                  25
## 2                  4.71                         0                  21
## 3                  3.91                         0                  30
## 4                  2.83                         0                  29
## 5                  5.04                         0                  36
## 6                  2.51                         0                  38
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                    13                    328               728     1985
## 2                    19                    217               776     1797
## 3                    11                    181              1218     1776
## 4                    34                    209               726     1745
## 5                    10                    221               773     1863
## 6                    20                    164               539     1728
##   total_sleep_records total_minutes_asleep total_time_in_bed
## 1                   1                  327               346
## 2                   2                  384               407
## 3                  NA                   NA                NA
## 4                   1                  412               442
## 5                   2                  340               367
## 6                   1                  700               712

colnames( combined_data)

##  [1] "id"                         "Date"                      
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"                   "total_sleep_records"       
## [17] "total_minutes_asleep"       "total_time_in_bed"

glimpse(combined_data)

## Rows: 940
## Columns: 18
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ Date                       <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
## $ total_steps                <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…
## $ total_sleep_records        <int> 1, 2, NA, 1, 2, 1, NA, 1, 1, 1, NA, 1, 1, 1…
## $ total_minutes_asleep       <int> 327, 384, NA, 412, 340, 700, NA, 304, 360, …
## $ total_time_in_bed          <int> 346, 407, NA, 442, 367, 712, NA, 320, 377, …

Check total participants in combined_data - are all 33 participants included after the full join?

n_distinct(combined_data$id)

## [1] 33

Yes, the full_join_data contains all participants, i.e. 33 total

Next will first remove the NAs and duplicates if they exist in the dataset

Check combined_daa for duplicates

sum(is.na(combined_data))

## [1] 1590

Replace NA with ‘0’

combined_data <- combined_data %>%
mutate_if(is.numeric, ~replace(., is.na(.), 0))

### Check for NA again to make sure they are all gone
sum(is.na(combined_data))

## [1] 0

Great, no more NAs.

Users’ activity pattern through the week, including associated step-counts, calories burnt, and time spent on various activities: overall, very intense and sedentary.

combined_data$weekday <- wday(combined_data$Date, label=TRUE, abbr=FALSE)

Check if the ‘weekday’ column is added using the head() and glimpse() functions

head(combined_data)

##           id       Date total_steps total_distance tracker_distance
## 1 1503960366 2016-04-12       13162           8.50             8.50
## 2 1503960366 2016-04-13       10735           6.97             6.97
## 3 1503960366 2016-04-14       10460           6.74             6.74
## 4 1503960366 2016-04-15        9762           6.28             6.28
## 5 1503960366 2016-04-16       12669           8.16             8.16
## 6 1503960366 2016-04-17        9705           6.48             6.48
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 1.88                       0.55
## 2                          0                 1.57                       0.69
## 3                          0                 2.44                       0.40
## 4                          0                 2.14                       1.26
## 5                          0                 2.71                       0.41
## 6                          0                 3.19                       0.78
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  6.06                         0                  25
## 2                  4.71                         0                  21
## 3                  3.91                         0                  30
## 4                  2.83                         0                  29
## 5                  5.04                         0                  36
## 6                  2.51                         0                  38
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                    13                    328               728     1985
## 2                    19                    217               776     1797
## 3                    11                    181              1218     1776
## 4                    34                    209               726     1745
## 5                    10                    221               773     1863
## 6                    20                    164               539     1728
##   total_sleep_records total_minutes_asleep total_time_in_bed   weekday
## 1                   1                  327               346   Tuesday
## 2                   2                  384               407 Wednesday
## 3                   0                    0                 0  Thursday
## 4                   1                  412               442    Friday
## 5                   2                  340               367  Saturday
## 6                   1                  700               712    Sunday

glimpse(combined_data)

## Rows: 940
## Columns: 19
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ Date                       <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
## $ total_steps                <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…
## $ total_sleep_records        <dbl> 1, 2, 0, 1, 2, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1…
## $ total_minutes_asleep       <dbl> 327, 384, 0, 412, 340, 700, 0, 304, 360, 32…
## $ total_time_in_bed          <dbl> 346, 407, 0, 442, 367, 712, 0, 320, 377, 36…
## $ weekday                    <ord> Tuesday, Wednesday, Thursday, Friday, Satur…

Yes, the ‘weekday’ column is added as the last column in the data frame.

Now we’ll take a look at the summary statistics of the final combined_data data frame

combined_data %>% 
  select(total_steps, total_distance, very_active_distance,
         moderately_active_distance,moderately_active_distance, 
         light_active_distance, sedentary_active_distance, 
         sedentary_minutes,very_active_minutes, fairly_active_minutes, 
         lightly_active_minutes,calories) %>% 
  summary()

##   total_steps    total_distance   very_active_distance
##  Min.   :    0   Min.   : 0.000   Min.   : 0.000      
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 0.000      
##  Median : 7406   Median : 5.245   Median : 0.210      
##  Mean   : 7638   Mean   : 5.490   Mean   : 1.503      
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 2.053      
##  Max.   :36019   Max.   :28.030   Max.   :21.920      
##  moderately_active_distance light_active_distance sedentary_active_distance
##  Min.   :0.0000             Min.   : 0.000        Min.   :0.000000         
##  1st Qu.:0.0000             1st Qu.: 1.945        1st Qu.:0.000000         
##  Median :0.2400             Median : 3.365        Median :0.000000         
##  Mean   :0.5675             Mean   : 3.341        Mean   :0.001606         
##  3rd Qu.:0.8000             3rd Qu.: 4.782        3rd Qu.:0.000000         
##  Max.   :6.4800             Max.   :10.710        Max.   :0.110000         
##  sedentary_minutes very_active_minutes fairly_active_minutes
##  Min.   :   0.0    Min.   :  0.00      Min.   :  0.00       
##  1st Qu.: 729.8    1st Qu.:  0.00      1st Qu.:  0.00       
##  Median :1057.5    Median :  4.00      Median :  6.00       
##  Mean   : 991.2    Mean   : 21.16      Mean   : 13.56       
##  3rd Qu.:1229.5    3rd Qu.: 32.00      3rd Qu.: 19.00       
##  Max.   :1440.0    Max.   :210.00      Max.   :143.00       
##  lightly_active_minutes    calories   
##  Min.   :  0.0          Min.   :   0  
##  1st Qu.:127.0          1st Qu.:1828  
##  Median :199.0          Median :2134  
##  Mean   :192.8          Mean   :2304  
##  3rd Qu.:264.0          3rd Qu.:2793  
##  Max.   :518.0          Max.   :4900

head(combined_data)

##           id       Date total_steps total_distance tracker_distance
## 1 1503960366 2016-04-12       13162           8.50             8.50
## 2 1503960366 2016-04-13       10735           6.97             6.97
## 3 1503960366 2016-04-14       10460           6.74             6.74
## 4 1503960366 2016-04-15        9762           6.28             6.28
## 5 1503960366 2016-04-16       12669           8.16             8.16
## 6 1503960366 2016-04-17        9705           6.48             6.48
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 1.88                       0.55
## 2                          0                 1.57                       0.69
## 3                          0                 2.44                       0.40
## 4                          0                 2.14                       1.26
## 5                          0                 2.71                       0.41
## 6                          0                 3.19                       0.78
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  6.06                         0                  25
## 2                  4.71                         0                  21
## 3                  3.91                         0                  30
## 4                  2.83                         0                  29
## 5                  5.04                         0                  36
## 6                  2.51                         0                  38
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                    13                    328               728     1985
## 2                    19                    217               776     1797
## 3                    11                    181              1218     1776
## 4                    34                    209               726     1745
## 5                    10                    221               773     1863
## 6                    20                    164               539     1728
##   total_sleep_records total_minutes_asleep total_time_in_bed   weekday
## 1                   1                  327               346   Tuesday
## 2                   2                  384               407 Wednesday
## 3                   0                    0                 0  Thursday
## 4                   1                  412               442    Friday
## 5                   2                  340               367  Saturday
## 6                   1                  700               712    Sunday

glimpse(combined_data)

## Rows: 940
## Columns: 19
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ Date                       <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
## $ total_steps                <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…
## $ total_sleep_records        <dbl> 1, 2, 0, 1, 2, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1…
## $ total_minutes_asleep       <dbl> 327, 384, 0, 412, 340, 700, 0, 304, 360, 32…
## $ total_time_in_bed          <dbl> 346, 407, 0, 442, 367, 712, 0, 320, 377, 36…
## $ weekday                    <ord> Tuesday, Wednesday, Thursday, Friday, Satur…

Next will aggregate data of combined_data to see how the corresponding activities vary on each weekday.

activity_summary <- combined_data %>% 
  group_by(weekday) %>% 
  summarise(avg_daily_steps = mean(total_steps), 
            avg_sedentary_hours = (mean(sedentary_minutes)/60), 
            avg_daily_calories = mean(calories), 
            avg_weekday_active_hr = (mean(very_active_minutes) + 
                                       mean(fairly_active_minutes) +
                                       mean(lightly_active_minutes))/60, 
            avg_very_active_hr = mean(very_active_minutes)/60, 
            percent_activity_hr = (mean(very_active_minutes)/(mean(very_active_minutes) +
                                                                       mean(fairly_active_minutes) +
                                                                       mean(lightly_active_minutes)))*100)

colnames(activity_summary)

## [1] "weekday"               "avg_daily_steps"       "avg_sedentary_hours"  
## [4] "avg_daily_calories"    "avg_weekday_active_hr" "avg_very_active_hr"   
## [7] "percent_activity_hr"

head(activity_summary)

## # A tibble: 6 × 7
##   weekday   avg_daily_steps avg_sedentary_hours avg_da…¹ avg_w…² avg_v…³ perce…⁴
##   <ord>               <dbl>               <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
## 1 Sunday              6933.                16.5    2263     3.47   0.333    9.58
## 2 Monday              7781.                17.1    2324.    3.82   0.385   10.1 
## 3 Tuesday             8125.                16.8    2356.    3.91   0.383    9.78
## 4 Wednesday           7559.                16.5    2303.    3.73   0.346    9.29
## 5 Thursday            7406.                16.0    2200.    3.61   0.323    8.95
## 6 Friday              7448.                16.7    2332.    3.94   0.334    8.48
## # … with abbreviated variable names ¹avg_daily_calories,
## #   ²avg_weekday_active_hr, ³avg_very_active_hr, ⁴percent_activity_hr

glimpse(activity_summary)

## Rows: 7
## Columns: 7
## $ weekday               <ord> Sunday, Monday, Tuesday, Wednesday, Thursday, Fr…
## $ avg_daily_steps       <dbl> 6933.231, 7780.867, 8125.007, 7559.373, 7405.837…
## $ avg_sedentary_hours   <dbl> 16.50427, 17.13236, 16.78936, 16.49133, 16.03322…
## $ avg_daily_calories    <dbl> 2263.000, 2324.208, 2356.013, 2302.620, 2199.571…
## $ avg_weekday_active_hr <dbl> 3.474793, 3.819444, 3.910526, 3.728889, 3.613152…
## $ avg_very_active_hr    <dbl> 0.3330579, 0.3851389, 0.3825658, 0.3463333, 0.32…
## $ percent_activity_hr   <dbl> 9.584968, 10.083636, 9.782974, 9.287843, 8.95255…

Next, we’ll examine how the average steps, calories burnt, and time spent on various activities: overall, very intense (very_active), and sedentary from Sunday through Saturday.

Total step counts on each weekday of the week.

### Total step counts on each weekday of the week.

ggplot(data=activity_summary, mapping=aes(x= weekday,y=avg_daily_steps, fill=weekday))+
geom_bar(stat="identity")+
  labs(title="Figure 3: Average Step Counts on Each Weekday") + 
   theme(plot.title=element_text(face="bold"))

Figure 3 Analysis

This plot clearly shows that users are most active on Tuesdays and Saturdays, accumulating >8,000 steps, which is within the recommended daily 7,500 - 10,000 steps, depending on age and sex. Sundays are the least active day with an average of ~7,000 step counts. This could be because some users spend time attending church services and relaxing before the start of another work week.

### Calories burnt on each weekday through the week

ggplot(data=activity_summary, mapping=aes(x= weekday, avg_daily_calories, fill=weekday))+
geom_bar(stat="identity")+
  labs(title="Figure 4: Average Calories Burnt Each Weekday") +
  theme(plot.title=element_text(face="bold"))

Figure 4 Analysis

This plot shows that Tuesdays and Saturdays burn the most calories, ~2,400 after users have taken > 8,000 steps. However, the number of calories burnt do not quite correlate with the step counts from Wednesdays to Fridays. This might indicate that the intensity of activity (very_active, fairly_active, and lightly_active) and the combined effect of the total time that users engaged in each type of activity play a major role in the consumption of calories. So, will examine other activity factors that might affect calorie consumption.

As discussed in Figure 1 above, one needs to spend more time in lightly _active activities in order to achieve the same step counts and calories burnt, as compared to very_active activities that can attain the same effect in shoter time. In general, calories burnt would trend positively with an increase in step counts and the amount of time spent on any type of activity burns calories at different rates. However, while one would see similar correlations when plotting the overall calories vs total step counts, the points on the plots would be widely spread out since the data is an aggregate of all three types of activity intensity. Therefore next, we’ll examine how the duration of time users engaged in overall activities varies through the weekday.

### Total activity hours (very_active, + farly_active + lightly_active) vary through the week

ggplot(data=activity_summary, mapping=aes(x= weekday, avg_weekday_active_hr, fill=weekday))+
geom_bar(stat="identity")+
  labs(title="Figure 5: Average Weekday Overall Activity Hours") +
  theme(plot.title=element_text(face="bold"))

Figure 5 Analysis

This plot shows the amount of time engaged in overall activities correlates very closely with calories burnt and somewhat with Figure 3 for step counts through the week. So next we’ll examine if the intensity of activities also plays a role in calories burnt. Since very intense activities (very_active) are supposed to burn calories at higher rates, so next will see how the amount of time spent on very intense activities change through the week.

### Activity Hours users spent on very_active activities each weekday.

ggplot(data=activity_summary, mapping=aes(x= weekday, avg_very_active_hr, fill=weekday))+
  geom_bar(stat="identity")+
  labs(title="Figure 6a: Very_active Activity Hours on Each Weekay") +
  theme(plot.title=element_text(face="bold"))

###  Percent Activity Hours users spent on very_active activities each weekday.

ggplot(data=activity_summary, mapping=aes(x= weekday, percent_activity_hr, fill=weekday))+
geom_bar(stat="identity")+
  labs(title="Figure 6b: Percent Very_active Activity Hours") +
  theme(plot.title=element_text(face="bold"))

Figures 6a and 6b Analysis

Figures 6a and 6b actually show a small decrease in the amount of time spent on very intense activities on Sundays and Wednesdays through Fridays but the number of calories burnt did not. This means that there might be other factors accounting for the discrepancy. However, this plot correlates fairly with Figures 3 (avg_daily_steps vs. weekday) and 5 (avg_weekday_active_hr vs. weekday). It also bears a similar weekday trend with Figures 4 (avg_daily_calories vs. weekday) and Figure 5 (avg_weekday_active_hr vs weekday), except for a small time decrease on Sundays and Wednesdays to Fridays.

However, it’s unclear how the Bellabeat App measures intensities of activity and sedentary time. Does it only consider step-accrued activities? Figure 1 indicates that sedentary time is at its maximum of 24 and ~17 hours when the step counts = 0 for the upper and lower clusters, respectively. When the step counts ~1,500, the sedentary time decreases to ~17 and ~8.5 hours for the upper and lower clusters.

If the intensity or sedentary measurements were only based on steps accrued as Figure 1 seems to imply, it might have missed stationary high-intensity activities such as weightlifting and resulted in more sedentary time as shown in Figure 7, even though less time were engaged in overall activity and very_active activities. Since activities like sitting down or standing up doing chores, learning new materials, and working at their desk or in front of their computers consume calories at a different rate, Bellabeat will need to clarify how sedentary time and intensities of activity are measured. This is especially important since most users spend an average of > ~16 hours of sedentary time a day.

###  How much average sedentary time do users spend each weekday

ggplot(data=activity_summary, mapping=aes(x= weekday, avg_sedentary_hours, fill=weekday))+
geom_bar(stat="identity")+
  labs(title="Figure 7: Average Weekday Sedentary Hours") +
  theme(plot.title=element_text(face="bold"))

Figure 7 Analysis

It’s striking to see this plot trends closely with Figure 4, calories burnt through the week, except Saturdays when the time spent on sedentary activities is slightly lower than one would expect similar to that of Fridays. As discussed in Figure 6 above, if sedentary time means no steps accumulated, then intense activities like weightlifting would be considered sedentary even though it burns lots of calories without accumulating steps. Further, participants who did not register sleep time seem to consider sleep as a sedentary activity. Also, if sedentary time decrease on Saturdays, does it mean that users have more time to engage in step-accrued activities, e.g. running?

Next, we’ll examine the BMI data using the weight_log_info data frame and will first do the following:

‘fat’ and ‘is_manual_report’ columns: Since most users did not register them, so will not be used in the analysis.
‘log_id’ column: It contains only log_id and does not contain activity data, which will be removed prior to data analysis.

In the absence of ‘fat’ information, the analysis will be based on BMI (bmi(Kg/m²) = weight/height²) data since BMI is a health screening tool, especially for potential weight problems.

After removing the unwanted columns stated above, will add a column “bmi_normal_weight” in the data frame

weight_log_info_new = weight_log_info[c("id","Date", "weight_kg", "weight_pounds", "bmi")]

colnames(weight_log_info_new)

## [1] "id"            "Date"          "weight_kg"     "weight_pounds"
## [5] "bmi"

head(weight_log_info_new)

##           id       Date weight_kg weight_pounds   bmi
## 1 1503960366 2016-05-02      52.6      115.9631 22.65
## 2 1503960366 2016-05-03      52.6      115.9631 22.65
## 3 1927972279 2016-04-13     133.5      294.3171 47.54
## 4 2873212765 2016-04-21      56.7      125.0021 21.45
## 5 2873212765 2016-05-12      57.3      126.3249 21.69
## 6 4319703577 2016-04-17      72.4      159.6147 27.45

glimpse(weight_log_info_new)

## Rows: 67
## Columns: 5
## $ id            <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 28732127…
## $ Date          <chr> "2016-05-02", "2016-05-03", "2016-04-13", "2016-04-21", …
## $ weight_kg     <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, 6…
## $ weight_pounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.61…
## $ bmi           <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25, …

Next will create three data frames:

bmi_df1 (BMI=18.5-24.9), bmi_df2 (BMI=25-29.9) and bmi_df3 (BMI>30) for each category of bmi. Since the weight_log_info dataset does not contain bmi data < 18.5. Therefore, all analyses will be performed for the normal weight, overweight and obese BMI only.

Data frame for the normal_weight bmi

### Create the normal weight bmi data frame
bmi_df1 <- subset(weight_log_info_new, bmi>=18.5 & bmi <= 24.9)

head(bmi_df1)

##            id       Date weight_kg weight_pounds   bmi
## 1  1503960366 2016-05-02      52.6      115.9631 22.65
## 2  1503960366 2016-05-03      52.6      115.9631 22.65
## 4  2873212765 2016-04-21      56.7      125.0021 21.45
## 5  2873212765 2016-05-12      57.3      126.3249 21.69
## 14 6962181067 2016-04-12      62.5      137.7889 24.39
## 15 6962181067 2016-04-13      62.1      136.9071 24.24

glimpse(bmi_df1)

## Rows: 34
## Columns: 5
## $ id            <dbl> 1503960366, 1503960366, 2873212765, 2873212765, 69621810…
## $ Date          <chr> "2016-05-02", "2016-05-03", "2016-04-21", "2016-05-12", …
## $ weight_kg     <dbl> 52.6, 52.6, 56.7, 57.3, 62.5, 62.1, 61.7, 61.5, 62.0, 61…
## $ weight_pounds <dbl> 115.9631, 115.9631, 125.0021, 126.3249, 137.7889, 136.90…
## $ bmi           <dbl> 22.65, 22.65, 21.45, 21.69, 24.39, 24.24, 24.10, 24.00, …

### Rename the bmi column of bmi_normal_weight to bmi_normal_weight, according to the definition of CDC

bmi_df1 <- bmi_df1  %>%  
  rename(bmi_normal_weight = bmi)

head(bmi_df1)

##            id       Date weight_kg weight_pounds bmi_normal_weight
## 1  1503960366 2016-05-02      52.6      115.9631             22.65
## 2  1503960366 2016-05-03      52.6      115.9631             22.65
## 4  2873212765 2016-04-21      56.7      125.0021             21.45
## 5  2873212765 2016-05-12      57.3      126.3249             21.69
## 14 6962181067 2016-04-12      62.5      137.7889             24.39
## 15 6962181067 2016-04-13      62.1      136.9071             24.24

glimpse(bmi_df1)

## Rows: 34
## Columns: 5
## $ id                <dbl> 1503960366, 1503960366, 2873212765, 2873212765, 6962…
## $ Date              <chr> "2016-05-02", "2016-05-03", "2016-04-21", "2016-05-1…
## $ weight_kg         <dbl> 52.6, 52.6, 56.7, 57.3, 62.5, 62.1, 61.7, 61.5, 62.0…
## $ weight_pounds     <dbl> 115.9631, 115.9631, 125.0021, 126.3249, 137.7889, 13…
## $ bmi_normal_weight <dbl> 22.65, 22.65, 21.45, 21.69, 24.39, 24.24, 24.10, 24.…

### Data frame for the overweight bmi

bmi_df2 <- subset(weight_log_info_new, bmi>=25 & bmi <= 29.9)

head(bmi_df2)

##            id       Date weight_kg weight_pounds   bmi
## 6  4319703577 2016-04-17      72.4      159.6147 27.45
## 7  4319703577 2016-05-04      72.3      159.3942 27.38
## 8  4558609924 2016-04-18      69.7      153.6622 27.25
## 9  4558609924 2016-04-25      70.3      154.9850 27.46
## 10 4558609924 2016-05-01      69.9      154.1031 27.32
## 11 4558609924 2016-05-02      69.2      152.5599 27.04

glimpse(bmi_df2)

## Rows: 32
## Columns: 5
## $ id            <dbl> 4319703577, 4319703577, 4558609924, 4558609924, 45586099…
## $ Date          <chr> "2016-04-17", "2016-05-04", "2016-04-18", "2016-04-25", …
## $ weight_kg     <dbl> 72.4, 72.3, 69.7, 70.3, 69.9, 69.2, 69.1, 90.7, 85.8, 84…
## $ weight_pounds <dbl> 159.6147, 159.3942, 153.6622, 154.9850, 154.1031, 152.55…
## $ bmi           <dbl> 27.45, 27.38, 27.25, 27.46, 27.32, 27.04, 27.00, 28.00, …

### Rename the bmi column of bmi_overweight to bmi_overweight

bmi_df2 <- bmi_df2  %>%  
  rename(bmi_overweight = bmi)

head(bmi_df2)

##            id       Date weight_kg weight_pounds bmi_overweight
## 6  4319703577 2016-04-17      72.4      159.6147          27.45
## 7  4319703577 2016-05-04      72.3      159.3942          27.38
## 8  4558609924 2016-04-18      69.7      153.6622          27.25
## 9  4558609924 2016-04-25      70.3      154.9850          27.46
## 10 4558609924 2016-05-01      69.9      154.1031          27.32
## 11 4558609924 2016-05-02      69.2      152.5599          27.04

glimpse(bmi_df2)

## Rows: 32
## Columns: 5
## $ id             <dbl> 4319703577, 4319703577, 4558609924, 4558609924, 4558609…
## $ Date           <chr> "2016-04-17", "2016-05-04", "2016-04-18", "2016-04-25",…
## $ weight_kg      <dbl> 72.4, 72.3, 69.7, 70.3, 69.9, 69.2, 69.1, 90.7, 85.8, 8…
## $ weight_pounds  <dbl> 159.6147, 159.3942, 153.6622, 154.9850, 154.1031, 152.5…
## $ bmi_overweight <dbl> 27.45, 27.38, 27.25, 27.46, 27.32, 27.04, 27.00, 28.00,…

Finally, will create a data frame, bmi_obese. That is when bmi=>30.0

bmi_df3 <- subset(weight_log_info_new, bmi>=30.0)
  
head(bmi_df3)

##           id       Date weight_kg weight_pounds   bmi
## 3 1927972279 2016-04-13     133.5      294.3171 47.54

glimpse(bmi_df3)

## Rows: 1
## Columns: 5
## $ id            <dbl> 1927972279
## $ Date          <chr> "2016-04-13"
## $ weight_kg     <dbl> 133.5
## $ weight_pounds <dbl> 294.3171
## $ bmi           <dbl> 47.54

### Rename the bmi column of bmi_obese to obese

bmi_df3 <- bmi_df3 %>% 
  rename(bmi_obese=bmi)

### Check to see if the bmi column has been renamed
head(bmi_df3)

##           id       Date weight_kg weight_pounds bmi_obese
## 3 1927972279 2016-04-13     133.5      294.3171     47.54

glimpse(bmi_df3)

## Rows: 1
## Columns: 5
## $ id            <dbl> 1927972279
## $ Date          <chr> "2016-04-13"
## $ weight_kg     <dbl> 133.5
## $ weight_pounds <dbl> 294.3171
## $ bmi_obese     <dbl> 47.54




```r
### Examine some summary statistics of bmi_df1, bmi_df2 and bmi_df3

bmi_df1 %>% 
  select(weight_kg, weight_pounds, bmi_normal_weight) %>%
summary()

##    weight_kg     weight_pounds   bmi_normal_weight
##  Min.   :52.60   Min.   :116.0   Min.   :21.45    
##  1st Qu.:61.20   1st Qu.:134.9   1st Qu.:23.89    
##  Median :61.40   Median :135.4   Median :23.96    
##  Mean   :60.76   Mean   :134.0   Mean   :23.80    
##  3rd Qu.:61.70   3rd Qu.:136.0   3rd Qu.:24.10    
##  Max.   :62.50   Max.   :137.8   Max.   :24.39

bmi_df2 %>% 
  select(weight_kg, weight_pounds,bmi_overweight) %>%
summary()

##    weight_kg     weight_pounds   bmi_overweight 
##  Min.   :69.10   Min.   :152.3   Min.   :25.14  
##  1st Qu.:84.30   1st Qu.:185.8   1st Qu.:25.43  
##  Median :85.05   Median :187.5   Median :25.56  
##  Mean   :82.10   Mean   :181.0   Mean   :25.96  
##  3rd Qu.:85.42   3rd Qu.:188.3   3rd Qu.:26.01  
##  Max.   :90.70   Max.   :200.0   Max.   :28.00

bmi_df3 %>% 
  select(weight_kg, weight_pounds, bmi_obese) %>%
summary()

##    weight_kg     weight_pounds     bmi_obese    
##  Min.   :133.5   Min.   :294.3   Min.   :47.54  
##  1st Qu.:133.5   1st Qu.:294.3   1st Qu.:47.54  
##  Median :133.5   Median :294.3   Median :47.54  
##  Mean   :133.5   Mean   :294.3   Mean   :47.54  
##  3rd Qu.:133.5   3rd Qu.:294.3   3rd Qu.:47.54  
##  Max.   :133.5   Max.   :294.3   Max.   :47.54

### Merge bmi_df1, bmi_df2 and bmi_df3 with selected columns of the daily_activity data frame

bmi_daily1 <-  merge(daily_activity, bmi_df1, by = c ("id", "Date"))

colnames(bmi_daily1)

##  [1] "id"                         "Date"                      
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"                   "weight_kg"                 
## [17] "weight_pounds"              "bmi_normal_weight"

head(bmi_daily1)

##           id       Date total_steps total_distance tracker_distance
## 1 1503960366 2016-05-02       14727           9.71             9.71
## 2 1503960366 2016-05-03       15103           9.66             9.66
## 3 2873212765 2016-04-21        8859           5.98             5.98
## 4 2873212765 2016-05-12        7566           5.11             5.11
## 5 6962181067 2016-04-12       10199           6.74             6.74
## 6 6962181067 2016-04-13        5652           3.74             3.74
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 3.21                       0.57
## 2                          0                 3.73                       1.05
## 3                          0                 0.13                       0.37
## 4                          0                 0.00                       0.00
## 5                          0                 3.40                       0.83
## 6                          0                 0.57                       1.21
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  5.92                      0.00                  41
## 2                  4.88                      0.00                  50
## 3                  5.47                      0.01                   2
## 4                  5.11                      0.00                   0
## 5                  2.51                      0.00                  50
## 6                  1.96                      0.00                   8
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                    15                    277               798     2004
## 2                    24                    254               816     1990
## 3                    10                    371              1057     1970
## 4                     0                    268               720     1431
## 5                    14                    189               796     1994
## 6                    24                    142               548     1718
##   weight_kg weight_pounds bmi_normal_weight
## 1      52.6      115.9631             22.65
## 2      52.6      115.9631             22.65
## 3      56.7      125.0021             21.45
## 4      57.3      126.3249             21.69
## 5      62.5      137.7889             24.39
## 6      62.1      136.9071             24.24

glimpse(bmi_daily1)

## Rows: 34
## Columns: 18
## $ id                         <dbl> 1503960366, 1503960366, 2873212765, 2873212…
## $ Date                       <date> 2016-05-02, 2016-05-03, 2016-04-21, 2016-0…
## $ total_steps                <int> 14727, 15103, 8859, 7566, 10199, 5652, 1551…
## $ total_distance             <dbl> 9.71, 9.66, 5.98, 5.11, 6.74, 3.74, 1.03, 3…
## $ tracker_distance           <dbl> 9.71, 9.66, 5.98, 5.11, 6.74, 3.74, 1.03, 3…
## $ logged_activities_distance <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.0…
## $ very_active_distance       <dbl> 3.21, 3.73, 0.13, 0.00, 3.40, 0.57, 0.00, 0…
## $ moderately_active_distance <dbl> 0.57, 1.05, 0.37, 0.00, 0.83, 1.21, 0.00, 0…
## $ light_active_distance      <dbl> 5.92, 4.88, 5.47, 5.11, 2.51, 1.96, 1.03, 3…
## $ sedentary_active_distance  <dbl> 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00, 0…
## $ very_active_minutes        <int> 41, 50, 2, 0, 50, 8, 0, 0, 50, 5, 13, 35, 4…
## $ fairly_active_minutes      <int> 15, 24, 10, 0, 14, 24, 0, 0, 3, 13, 42, 41,…
## $ lightly_active_minutes     <int> 277, 254, 371, 268, 189, 142, 86, 217, 280,…
## $ sedentary_minutes          <int> 798, 816, 1057, 720, 796, 548, 862, 837, 74…
## $ calories                   <int> 2004, 1990, 1970, 1431, 1994, 1718, 1466, 1…
## $ weight_kg                  <dbl> 52.6, 52.6, 56.7, 57.3, 62.5, 62.1, 61.7, 6…
## $ weight_pounds              <dbl> 115.9631, 115.9631, 125.0021, 126.3249, 137…
## $ bmi_normal_weight          <dbl> 22.65, 22.65, 21.45, 21.69, 24.39, 24.24, 2…

bmi_daily2 <-  merge(daily_activity, bmi_df2, by = c ("id", "Date"))

colnames(bmi_daily2)

##  [1] "id"                         "Date"                      
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"                   "weight_kg"                 
## [17] "weight_pounds"              "bmi_overweight"

head(bmi_daily2)

##           id       Date total_steps total_distance tracker_distance
## 1 4319703577 2016-04-17          29           0.02             0.02
## 2 4319703577 2016-05-04       10429           7.02             7.02
## 3 4558609924 2016-04-18        8940           5.91             5.91
## 4 4558609924 2016-04-25        8095           5.35             5.35
## 5 4558609924 2016-05-01        3428           2.27             2.27
## 6 4558609924 2016-05-02        7891           5.22             5.22
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 0.00                       0.00
## 2                          0                 0.59                       0.58
## 3                          0                 0.98                       0.93
## 4                          0                 0.59                       0.25
## 5                          0                 0.00                       0.00
## 6                          0                 0.00                       0.00
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  0.02                         0                   0
## 2                  5.85                         0                   8
## 3                  4.00                         0                  14
## 4                  4.51                         0                  18
## 5                  2.27                         0                   0
## 6                  5.22                         0                   0
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                     0                      3              1363     1464
## 2                    13                    313              1106     2282
## 3                    15                    331              1080     2116
## 4                    10                    340               993     2225
## 5                     0                    190              1121     1692
## 6                     0                    383              1057     2066
##   weight_kg weight_pounds bmi_overweight
## 1      72.4      159.6147          27.45
## 2      72.3      159.3942          27.38
## 3      69.7      153.6622          27.25
## 4      70.3      154.9850          27.46
## 5      69.9      154.1031          27.32
## 6      69.2      152.5599          27.04

glimpse(bmi_daily2)

## Rows: 32
## Columns: 18
## $ id                         <dbl> 4319703577, 4319703577, 4558609924, 4558609…
## $ Date                       <date> 2016-04-17, 2016-05-04, 2016-04-18, 2016-0…
## $ total_steps                <int> 29, 10429, 8940, 8095, 3428, 7891, 11451, 1…
## $ total_distance             <dbl> 0.02, 7.02, 5.91, 5.35, 2.27, 5.22, 7.57, 9…
## $ tracker_distance           <dbl> 0.02, 7.02, 5.91, 5.35, 2.27, 5.22, 7.57, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 0.00, 0.59, 0.98, 0.59, 0.00, 0.00, 0.43, 5…
## $ moderately_active_distance <dbl> 0.00, 0.58, 0.93, 0.25, 0.00, 0.00, 1.62, 0…
## $ light_active_distance      <dbl> 0.02, 5.85, 4.00, 4.51, 2.27, 5.22, 5.52, 2…
## $ sedentary_active_distance  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ very_active_minutes        <int> 0, 8, 14, 18, 0, 0, 6, 200, 85, 108, 68, 94…
## $ fairly_active_minutes      <int> 0, 13, 15, 10, 0, 0, 30, 37, 7, 18, 13, 29,…
## $ lightly_active_minutes     <int> 3, 313, 331, 340, 190, 383, 339, 159, 312, …
## $ sedentary_minutes          <int> 1363, 1106, 1080, 993, 1121, 1057, 1065, 52…
## $ calories                   <int> 1464, 2282, 2116, 2225, 1692, 2066, 2223, 4…
## $ weight_kg                  <dbl> 72.4, 72.3, 69.7, 70.3, 69.9, 69.2, 69.1, 9…
## $ weight_pounds              <dbl> 159.6147, 159.3942, 153.6622, 154.9850, 154…
## $ bmi_overweight             <dbl> 27.45, 27.38, 27.25, 27.46, 27.32, 27.04, 2…

bmi_daily3 <-  merge(daily_activity, bmi_df3, by = c ("id", "Date"))

colnames(bmi_daily3)

##  [1] "id"                         "Date"                      
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"                   "weight_kg"                 
## [17] "weight_pounds"              "bmi_obese"

head(bmi_daily3)

##           id       Date total_steps total_distance tracker_distance
## 1 1927972279 2016-04-13         356           0.25             0.25
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                    0                          0
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  0.25                         0                   0
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                     0                     32               986     2151
##   weight_kg weight_pounds bmi_obese
## 1     133.5      294.3171     47.54

glimpse(bmi_daily3)

## Rows: 1
## Columns: 18
## $ id                         <dbl> 1927972279
## $ Date                       <date> 2016-04-13
## $ total_steps                <int> 356
## $ total_distance             <dbl> 0.25
## $ tracker_distance           <dbl> 0.25
## $ logged_activities_distance <dbl> 0
## $ very_active_distance       <dbl> 0
## $ moderately_active_distance <dbl> 0
## $ light_active_distance      <dbl> 0.25
## $ sedentary_active_distance  <dbl> 0
## $ very_active_minutes        <int> 0
## $ fairly_active_minutes      <int> 0
## $ lightly_active_minutes     <int> 32
## $ sedentary_minutes          <int> 986
## $ calories                   <int> 2151
## $ weight_kg                  <dbl> 133.5
## $ weight_pounds              <dbl> 294.3171
## $ bmi_obese                  <dbl> 47.54

Check how many unique ids are there in each bmi_daily1, bmi_daily2, and bmi_daily3

n_distinct(bmi_daily1$id)

## [1] 3

n_distinct(bmi_daily2$id)

## [1] 4

n_distinct(bmi_daily3$id)

## [1] 1

### Take a look at some summary statics of bmi_daily1, bmi_daily2 and bmi_daily3

bmi_daily1 %>% 
  select(bmi_normal_weight, weight_kg, weight_pounds, bmi_normal_weight, total_steps, total_distance, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes,  calories) %>%
  summary()

##  bmi_normal_weight   weight_kg     weight_pounds    total_steps   
##  Min.   :21.45     Min.   :52.60   Min.   :116.0   Min.   : 1551  
##  1st Qu.:23.89     1st Qu.:61.20   1st Qu.:134.9   1st Qu.: 6745  
##  Median :23.96     Median :61.40   Median :135.4   Median :10422  
##  Mean   :23.80     Mean   :60.76   Mean   :134.0   Mean   : 9984  
##  3rd Qu.:24.10     3rd Qu.:61.70   3rd Qu.:136.0   3rd Qu.:12556  
##  Max.   :24.39     Max.   :62.50   Max.   :137.8   Max.   :20031  
##  total_distance   very_active_minutes fairly_active_minutes
##  Min.   : 1.030   Min.   : 0.00       Min.   : 0.00        
##  1st Qu.: 4.455   1st Qu.: 0.50       1st Qu.: 4.75        
##  Median : 6.890   Median :16.50       Median :15.00        
##  Mean   : 6.698   Mean   :22.47       Mean   :18.12        
##  3rd Qu.: 8.675   3rd Qu.:40.25       3rd Qu.:32.50        
##  Max.   :13.240   Max.   :62.00       Max.   :42.00        
##  lightly_active_minutes sedentary_minutes    calories   
##  Min.   : 86.0          Min.   : 127.0    Min.   : 928  
##  1st Qu.:214.2          1st Qu.: 637.5    1st Qu.:1851  
##  Median :267.5          Median : 693.0    Median :2030  
##  Mean   :251.1          Mean   : 683.6    Mean   :1965  
##  3rd Qu.:293.2          3rd Qu.: 728.8    3rd Qu.:2148  
##  Max.   :371.0          Max.   :1057.0    Max.   :2571

bmi_daily2 %>% 
  select(bmi_overweight, weight_kg, weight_pounds, bmi_overweight, total_steps, total_distance, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes,  calories) %>%
  summary()

##  bmi_overweight    weight_kg     weight_pounds    total_steps   
##  Min.   :25.14   Min.   :69.10   Min.   :152.3   Min.   :   29  
##  1st Qu.:25.43   1st Qu.:84.30   1st Qu.:185.8   1st Qu.:10622  
##  Median :25.56   Median :85.05   Median :187.5   Median :12608  
##  Mean   :25.96   Mean   :82.10   Mean   :181.0   Mean   :14720  
##  3rd Qu.:26.01   3rd Qu.:85.42   3rd Qu.:188.3   3rd Qu.:20018  
##  Max.   :28.00   Max.   :90.70   Max.   :200.0   Max.   :29326  
##  total_distance  very_active_minutes fairly_active_minutes
##  Min.   : 0.02   Min.   :  0.00      Min.   : 0.00        
##  1st Qu.: 7.42   1st Qu.: 18.00      1st Qu.: 4.00        
##  Median : 8.94   Median : 65.00      Median : 8.00        
##  Mean   :12.16   Mean   : 58.72      Mean   :10.66        
##  3rd Qu.:18.14   3rd Qu.: 89.25      3rd Qu.:13.50        
##  Max.   :26.72   Max.   :200.00      Max.   :37.00        
##  lightly_active_minutes sedentary_minutes    calories   
##  Min.   :  3.0          Min.   : 525      Min.   :1464  
##  1st Qu.:212.8          1st Qu.:1064      1st Qu.:2594  
##  Median :226.5          Median :1112      Median :3466  
##  Mean   :241.3          Mean   :1088      Mean   :3173  
##  3rd Qu.:298.5          3rd Qu.:1150      3rd Qu.:3804  
##  Max.   :429.0          Max.   :1363      Max.   :4552

bmi_daily3 %>% 
  select(bmi_obese, weight_kg, weight_pounds, bmi_obese, total_steps, total_distance, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes, calories) %>%
  summary()

##    bmi_obese       weight_kg     weight_pounds    total_steps  total_distance
##  Min.   :47.54   Min.   :133.5   Min.   :294.3   Min.   :356   Min.   :0.25  
##  1st Qu.:47.54   1st Qu.:133.5   1st Qu.:294.3   1st Qu.:356   1st Qu.:0.25  
##  Median :47.54   Median :133.5   Median :294.3   Median :356   Median :0.25  
##  Mean   :47.54   Mean   :133.5   Mean   :294.3   Mean   :356   Mean   :0.25  
##  3rd Qu.:47.54   3rd Qu.:133.5   3rd Qu.:294.3   3rd Qu.:356   3rd Qu.:0.25  
##  Max.   :47.54   Max.   :133.5   Max.   :294.3   Max.   :356   Max.   :0.25  
##  very_active_minutes fairly_active_minutes lightly_active_minutes
##  Min.   :0           Min.   :0             Min.   :32            
##  1st Qu.:0           1st Qu.:0             1st Qu.:32            
##  Median :0           Median :0             Median :32            
##  Mean   :0           Mean   :0             Mean   :32            
##  3rd Qu.:0           3rd Qu.:0             3rd Qu.:32            
##  Max.   :0           Max.   :0             Max.   :32            
##  sedentary_minutes    calories   
##  Min.   :986       Min.   :2151  
##  1st Qu.:986       1st Qu.:2151  
##  Median :986       Median :2151  
##  Mean   :986       Mean   :2151  
##  3rd Qu.:986       3rd Qu.:2151  
##  Max.   :986       Max.   :2151

Next, we’ll merge the weight_log_info_new data frame with the daily_activity data frame

daily_bmi_activity  <- merge(daily_activity, weight_log_info_new, by=c ('id', 'Date'), all = TRUE) %>% 
  drop_na() %>% 
  select(-tracker_distance)

options(repr.plot.width=30)

head(daily_bmi_activity)

##           id       Date total_steps total_distance logged_activities_distance
## 1 1503960366 2016-05-02       14727           9.71                          0
## 2 1503960366 2016-05-03       15103           9.66                          0
## 3 1927972279 2016-04-13         356           0.25                          0
## 4 2873212765 2016-04-21        8859           5.98                          0
## 5 2873212765 2016-05-12        7566           5.11                          0
## 6 4319703577 2016-04-17          29           0.02                          0
##   very_active_distance moderately_active_distance light_active_distance
## 1                 3.21                       0.57                  5.92
## 2                 3.73                       1.05                  4.88
## 3                 0.00                       0.00                  0.25
## 4                 0.13                       0.37                  5.47
## 5                 0.00                       0.00                  5.11
## 6                 0.00                       0.00                  0.02
##   sedentary_active_distance very_active_minutes fairly_active_minutes
## 1                      0.00                  41                    15
## 2                      0.00                  50                    24
## 3                      0.00                   0                     0
## 4                      0.01                   2                    10
## 5                      0.00                   0                     0
## 6                      0.00                   0                     0
##   lightly_active_minutes sedentary_minutes calories weight_kg weight_pounds
## 1                    277               798     2004      52.6      115.9631
## 2                    254               816     1990      52.6      115.9631
## 3                     32               986     2151     133.5      294.3171
## 4                    371              1057     1970      56.7      125.0021
## 5                    268               720     1431      57.3      126.3249
## 6                      3              1363     1464      72.4      159.6147
##     bmi
## 1 22.65
## 2 22.65
## 3 47.54
## 4 21.45
## 5 21.69
## 6 27.45

n_distinct(daily_bmi_activity$id)

## [1] 8

### Check to see if all the NAs are removed

sum(is.na(daily_bmi_activity))

## [1] 0

Take a look at the summary statistics of the daily_bmi_activity data frame

summary(daily_bmi_activity)

##        id                 Date             total_steps    total_distance  
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :   29   Min.   : 0.020  
##  1st Qu.:6.962e+09   1st Qu.:2016-04-19   1st Qu.: 8477   1st Qu.: 5.945  
##  Median :6.962e+09   Median :2016-04-27   Median :11101   Median : 8.110  
##  Mean   :7.009e+09   Mean   :2016-04-26   Mean   :12102   Mean   : 9.211  
##  3rd Qu.:8.878e+09   3rd Qu.:2016-05-04   3rd Qu.:14996   3rd Qu.: 9.710  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :29326   Max.   :26.720  
##  logged_activities_distance very_active_distance moderately_active_distance
##  Min.   :0.0000             Min.   : 0.000       Min.   :0.000             
##  1st Qu.:0.0000             1st Qu.: 0.450       1st Qu.:0.115             
##  Median :0.0000             Median : 1.770       Median :0.380             
##  Mean   :0.1498             Mean   : 3.758       Mean   :0.651             
##  3rd Qu.:0.0000             3rd Qu.: 4.095       3rd Qu.:0.990             
##  Max.   :4.0817             Max.   :21.660       Max.   :2.390             
##  light_active_distance sedentary_active_distance very_active_minutes
##  Min.   : 0.020        Min.   :0.000000          Min.   :  0.00     
##  1st Qu.: 3.725        1st Qu.:0.000000          1st Qu.:  7.50     
##  Median : 4.890        Median :0.000000          Median : 29.00     
##  Mean   : 4.782        Mean   :0.004776          Mean   : 39.45     
##  3rd Qu.: 5.870        3rd Qu.:0.000000          3rd Qu.: 63.00     
##  Max.   :10.710        Max.   :0.110000          Max.   :200.00     
##  fairly_active_minutes lightly_active_minutes sedentary_minutes    calories   
##  Min.   : 0.00         Min.   :  3.0          Min.   : 127.0    Min.   : 928  
##  1st Qu.: 4.00         1st Qu.:212.5          1st Qu.: 686.0    1st Qu.:1998  
##  Median :12.00         Median :235.0          Median : 837.0    Median :2174  
##  Mean   :14.28         Mean   :243.1          Mean   : 881.4    Mean   :2545  
##  3rd Qu.:21.50         3rd Qu.:296.0          3rd Qu.:1105.0    3rd Qu.:3258  
##  Max.   :42.00         Max.   :429.0          Max.   :1363.0    Max.   :4552  
##    weight_kg      weight_pounds        bmi       
##  Min.   : 52.60   Min.   :116.0   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:135.4   1st Qu.:23.96  
##  Median : 62.50   Median :137.8   Median :24.39  
##  Mean   : 72.04   Mean   :158.8   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:187.5   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :294.3   Max.   :47.54

Mutate a column of total daily active hours to the daily_bmi_activity data frame

daily_bmi_activity <- daily_bmi_activity %>% 
  mutate(total_daily_active_hours=(very_active_minutes + fairly_active_minutes + lightly_active_minutes)/60)

colnames(daily_bmi_activity)

##  [1] "id"                         "Date"                      
##  [3] "total_steps"                "total_distance"            
##  [5] "logged_activities_distance" "very_active_distance"      
##  [7] "moderately_active_distance" "light_active_distance"     
##  [9] "sedentary_active_distance"  "very_active_minutes"       
## [11] "fairly_active_minutes"      "lightly_active_minutes"    
## [13] "sedentary_minutes"          "calories"                  
## [15] "weight_kg"                  "weight_pounds"             
## [17] "bmi"                        "total_daily_active_hours"

head(daily_bmi_activity)

##           id       Date total_steps total_distance logged_activities_distance
## 1 1503960366 2016-05-02       14727           9.71                          0
## 2 1503960366 2016-05-03       15103           9.66                          0
## 3 1927972279 2016-04-13         356           0.25                          0
## 4 2873212765 2016-04-21        8859           5.98                          0
## 5 2873212765 2016-05-12        7566           5.11                          0
## 6 4319703577 2016-04-17          29           0.02                          0
##   very_active_distance moderately_active_distance light_active_distance
## 1                 3.21                       0.57                  5.92
## 2                 3.73                       1.05                  4.88
## 3                 0.00                       0.00                  0.25
## 4                 0.13                       0.37                  5.47
## 5                 0.00                       0.00                  5.11
## 6                 0.00                       0.00                  0.02
##   sedentary_active_distance very_active_minutes fairly_active_minutes
## 1                      0.00                  41                    15
## 2                      0.00                  50                    24
## 3                      0.00                   0                     0
## 4                      0.01                   2                    10
## 5                      0.00                   0                     0
## 6                      0.00                   0                     0
##   lightly_active_minutes sedentary_minutes calories weight_kg weight_pounds
## 1                    277               798     2004      52.6      115.9631
## 2                    254               816     1990      52.6      115.9631
## 3                     32               986     2151     133.5      294.3171
## 4                    371              1057     1970      56.7      125.0021
## 5                    268               720     1431      57.3      126.3249
## 6                      3              1363     1464      72.4      159.6147
##     bmi total_daily_active_hours
## 1 22.65                5.5500000
## 2 22.65                5.4666667
## 3 47.54                0.5333333
## 4 21.45                6.3833333
## 5 21.69                4.4666667
## 6 27.45                0.0500000

glimpse(daily_bmi_activity)

## Rows: 67
## Columns: 18
## $ id                         <dbl> 1503960366, 1503960366, 1927972279, 2873212…
## $ Date                       <date> 2016-05-02, 2016-05-03, 2016-04-13, 2016-0…
## $ total_steps                <int> 14727, 15103, 356, 8859, 7566, 29, 10429, 8…
## $ total_distance             <dbl> 9.71, 9.66, 0.25, 5.98, 5.11, 0.02, 7.02, 5…
## $ logged_activities_distance <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.0…
## $ very_active_distance       <dbl> 3.21, 3.73, 0.00, 0.13, 0.00, 0.00, 0.59, 0…
## $ moderately_active_distance <dbl> 0.57, 1.05, 0.00, 0.37, 0.00, 0.00, 0.58, 0…
## $ light_active_distance      <dbl> 5.92, 4.88, 0.25, 5.47, 5.11, 0.02, 5.85, 4…
## $ sedentary_active_distance  <dbl> 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0…
## $ very_active_minutes        <int> 41, 50, 0, 2, 0, 0, 8, 14, 18, 0, 0, 6, 200…
## $ fairly_active_minutes      <int> 15, 24, 0, 10, 0, 0, 13, 15, 10, 0, 0, 30, …
## $ lightly_active_minutes     <int> 277, 254, 32, 371, 268, 3, 313, 331, 340, 1…
## $ sedentary_minutes          <int> 798, 816, 986, 1057, 720, 1363, 1106, 1080,…
## $ calories                   <int> 2004, 1990, 2151, 1970, 1431, 1464, 2282, 2…
## $ weight_kg                  <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, …
## $ weight_pounds              <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126…
## $ bmi                        <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 2…
## $ total_daily_active_hours   <dbl> 5.5500000, 5.4666667, 0.5333333, 6.3833333,…

Next, we’ll separate the types of bmi

data_bmi_type <- daily_bmi_activity  %>%
  summarise(bmi_type= factor(case_when( bmi>=18.5 & bmi <= 24.9 ~ 'normal_weight_bmi', bmi>=25 & bmi <= 29.9 ~ 'overweight_bmi', bmi>=30.0 ~ 'obese_bmi'), levels = c('normal_weight_bmi', 'overweight_bmi', 'obese_bmi')), .group=id, calories, total_steps, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes, total_daily_active_hours) %>%
  drop_na()

colnames(data_bmi_type)

## [1] "bmi_type"                 ".group"                  
## [3] "calories"                 "total_steps"             
## [5] "very_active_minutes"      "fairly_active_minutes"   
## [7] "lightly_active_minutes"   "sedentary_minutes"       
## [9] "total_daily_active_hours"

head(data_bmi_type)

##            bmi_type     .group calories total_steps very_active_minutes
## 1 normal_weight_bmi 1503960366     2004       14727                  41
## 2 normal_weight_bmi 1503960366     1990       15103                  50
## 3         obese_bmi 1927972279     2151         356                   0
## 4 normal_weight_bmi 2873212765     1970        8859                   2
## 5 normal_weight_bmi 2873212765     1431        7566                   0
## 6    overweight_bmi 4319703577     1464          29                   0
##   fairly_active_minutes lightly_active_minutes sedentary_minutes
## 1                    15                    277               798
## 2                    24                    254               816
## 3                     0                     32               986
## 4                    10                    371              1057
## 5                     0                    268               720
## 6                     0                      3              1363
##   total_daily_active_hours
## 1                5.5500000
## 2                5.4666667
## 3                0.5333333
## 4                6.3833333
## 5                4.4666667
## 6                0.0500000

glimpse(data_bmi_type)

## Rows: 67
## Columns: 9
## $ bmi_type                 <fct> normal_weight_bmi, normal_weight_bmi, obese_b…
## $ .group                   <dbl> 1503960366, 1503960366, 1927972279, 287321276…
## $ calories                 <int> 2004, 1990, 2151, 1970, 1431, 1464, 2282, 211…
## $ total_steps              <int> 14727, 15103, 356, 8859, 7566, 29, 10429, 894…
## $ very_active_minutes      <int> 41, 50, 0, 2, 0, 0, 8, 14, 18, 0, 0, 6, 200, …
## $ fairly_active_minutes    <int> 15, 24, 0, 10, 0, 0, 13, 15, 10, 0, 0, 30, 37…
## $ lightly_active_minutes   <int> 277, 254, 32, 371, 268, 3, 313, 331, 340, 190…
## $ sedentary_minutes        <int> 798, 816, 986, 1057, 720, 1363, 1106, 1080, 9…
## $ total_daily_active_hours <dbl> 5.5500000, 5.4666667, 0.5333333, 6.3833333, 4…

First, we’ll inspect daily step counts, calories burnt and time spent on all activities by each bmi type.

### In Boxplots

ggplot(data_bmi_type, aes(bmi_type, total_steps, fill= bmi_type )) + 
  geom_boxplot() + 
  theme(legend.position="none") + 
  labs(title="Figure 8: Daily Step Counts Accrued by Each BMI Type", x=NULL) + 
  theme(legend.position="none", text = element_text(size = 11), plot.title = element_text(hjust = 0.5))

ggplot(data_bmi_type, aes(bmi_type , calories, fill= bmi_type )) + 
  geom_boxplot() + 
  theme(legend.position="none") + 
  labs(title="Figure 9: Daily Calories Burnt by Each BMI Type", x=NULL) + 
  theme(legend.position="none", text = element_text(size = 11), plot.title = element_text(hjust = 0.5))

Figures 8 and 9 Analysis

In these two plots, both the normal_weight_bmi and overwight_bmi users take about 1,000 or more per day as suggested by the WHO. For the normal_weight_bmi users, the distributions of Calories and total steps taken are similar, <50% of normal_weight_bmi users take >1000 steps and burn >2,000 Calories.

However, for the overwight_bmi users, the distributions of step counts and calories burnt contradict each other, with >50% of overwight_bmi users taking >1,100 steps a day, but < 50% of them burnt >3,500 Calories. One would expect the distributions of step counts and Calories burnt to be similar since accumulating more steps burn more Calories. This brings up the question of whether the amount of activity time and/or the level of intensity of activity have more effect on the number of calories burnt. Next, will take a closer look at this by creating of boxplot of the total_daily_active hours spent for each BMI type.

### Total daily hours spent on overall activities for each BMI type

ggplot(data_bmi_type, aes(bmi_type, total_daily_active_hours, fill= bmi_type )) + 
  geom_boxplot() + 
  theme(legend.position="none") + 
  labs(title="Figure 10: Daily Overall Activity Hours Spent by Each BMI Type", x=NULL) + 
  theme(legend.position="none", text = element_text(size = 11), plot.title = element_text(hjust = 0.5))

Figure 10 Analysis

In this boxplot, the normal_weight_bmi users are very consistent in the distributions in Calories, total_steps and total_daily_active_hours , with >50% of them burning <2,000 Calories (Figure 9), taking <1,000 total_steps (Figure 8) and spending < 5.5 total_daily_active_hours (Figure 10), respectively.

For the overwight_bmi users, the distributions of users in total_daily_active_hours are about 50%, above and below 5.5 total_daily_active_hours. These observations do not account for the reverse distributions of total steps (Figure 8) and calories burnt (Figure 9). So, we’ll use bar plots to examine the actual step counts, calories burnt, and overall activity hours.

Examine the calories consumed and total steps taken by each bmi type in bar_plots

### Daily step counts accrued by Each bmi type in Bar_plots

data_bmi_type %>% 
  group_by(bmi_type) %>% 
  summarise(avg_total_steps = mean(total_steps)) %>%
  ggplot(aes(bmi_type, y=avg_total_steps, fill= bmi_type)) + 
  geom_col()+ 
  theme(legend.position="none") + 
  labs(title="Figure 11: Daily Steps Accrued by each BMI Type", x=NULL) +
  theme(legend.position="none", text = element_text(size = 11),plot.title = element_text(hjust = 0.5))

### Daily calories burnt by Each bmi type in Bar_plots

data_bmi_type %>% 
  group_by(bmi_type) %>% 
  summarise(avg_calories = mean(calories)) %>%
  ggplot(aes(bmi_type, y=avg_calories, fill= bmi_type)) + 
  geom_col()+ 
  theme(legend.position="none") + 
  labs(title="Figure 12: Daily Calories Burnt by each BMI Type", x=NULL) +
  theme(legend.position="none", text = element_text(size = 11),plot.title = element_text(hjust = 0.5))

### Daily Hours Spent on Overal Activities by each bmi type in Bar_plots

data_bmi_type %>%
  group_by(bmi_type) %>%
  summarise(avg_total_daily_active_hours = mean(total_daily_active_hours)) %>%
  ggplot(aes(bmi_type,y=avg_total_daily_active_hours, fill=bmi_type)) +
  geom_col() +
  theme(legend.position="none") +
  labs(title="Figure 13: Daily Overall Activity Hours by Each BMI Type", x=NULL) + 
  theme(legend.position="none", text = element_text(size = 11),plot.title = element_text(hjust = 0.5))

Figures 11-13 Analysis

Figure 11 clearly shows that the over_weight_bmi users accrued ~30% more step counts relative to the normal_weight_bmi users, up to a maximum of ~15,000 daily steps, which is above the recommended steps of 7,000 to 10,000 depending on age and sex. This could be one of the reasons that the overweight_bmi users burnt ~30% more calories than the normal_weight_bmi users (Figure 12), reaching > 3,000 calories/day. While this amount of calories burnt is also above the recommended calories of 1,600 -2,400 for adult women and 2,000 - 3,000 per day for adult men, still doesn’t explain the reverse distributions of users in step counts and calories burnt observed in Figure 8 and 9, respectively.

Figure 13 shows the overweight_bmi users spent <30 min more in activity hours than the normal_weight_bmi users,but it still does not explain the reverse user distributions in Figures 8 and 9. So, we’ll do a more detailed analysis using group bar plots.

One striking observation from Figures 11-13 is that the obese_bmi user spends a total of only ~32 minutes on lightly_active activities but was able to consume 2,000 calories, whereas the normal_weight_bmi users have to spend ~5 activity hours to burn the same calories. One likely explanation is that the obese_bmi user carries a lot more weight. The extra weights carried by the obese_bmi user compared to the normal_weight_bmi and bmi_overwight users are 160 lb/73 kg and 113 lb/51 kg, respectively. This is similar to one carrying extra loads of this amount when performing any type of activity, thus burning calories at a higher rate even with very_lightly activities.

The same might be partly true when comparing the overwight_bmi users with the normal_weight_bmi users, depending on if both types of users performed exactly the same type of activities. Therefore, not taking into a user’s weight and time engaged in each type of activity, the data might give the apparent effectiveness of lightly active activities in consuming calories at first glance.

Since the amount of daily activity hours does not completely correlate with the number of calories burnt, especially in the case of obese_bmi, we’ll examine if time and activity intensity have any effect on the number of calories burnt.

### Calculate the average daily hours spent on each type of activities

data_bmi_type_long <- data_bmi_type %>% 
  group_by(bmi_type) %>% 
  summarise(very_active = mean(very_active_minutes)/60, fairly_active = mean(fairly_active_minutes)/60, lightly_active = mean(lightly_active_minutes)/60, sedentary = mean(sedentary_minutes)/60) %>%
  pivot_longer(data_bmi_type, cols= (very_active:lightly_active), names_to = "Activity_Type", values_to ="Hours")

## Warning in gsub(vec_paste0("^", names_prefix), "", cols): argument 'pattern' has
## length > 1 and only the first element will be used

head(data_bmi_type_long)

## # A tibble: 6 × 4
##   bmi_type          sedentary Activity_Type  Hours
##   <fct>                 <dbl> <chr>          <dbl>
## 1 normal_weight_bmi      11.4 very_active    0.375
## 2 normal_weight_bmi      11.4 fairly_active  0.302
## 3 normal_weight_bmi      11.4 lightly_active 4.18 
## 4 overweight_bmi         18.1 very_active    0.979
## 5 overweight_bmi         18.1 fairly_active  0.178
## 6 overweight_bmi         18.1 lightly_active 4.02

glimpse(data_bmi_type_long)

## Rows: 9
## Columns: 4
## $ bmi_type      <fct> normal_weight_bmi, normal_weight_bmi, normal_weight_bmi,…
## $ sedentary     <dbl> 11.39363, 11.39363, 11.39363, 18.13958, 18.13958, 18.139…
## $ Activity_Type <chr> "very_active", "fairly_active", "lightly_active", "very_…
## $ Hours         <dbl> 0.3745098, 0.3019608, 4.1843137, 0.9786458, 0.1776042, 4…

colnames(data_bmi_type_long)

## [1] "bmi_type"      "sedentary"     "Activity_Type" "Hours"

data_bmi_type_long

## # A tibble: 9 × 4
##   bmi_type          sedentary Activity_Type  Hours
##   <fct>                 <dbl> <chr>          <dbl>
## 1 normal_weight_bmi      11.4 very_active    0.375
## 2 normal_weight_bmi      11.4 fairly_active  0.302
## 3 normal_weight_bmi      11.4 lightly_active 4.18 
## 4 overweight_bmi         18.1 very_active    0.979
## 5 overweight_bmi         18.1 fairly_active  0.178
## 6 overweight_bmi         18.1 lightly_active 4.02 
## 7 obese_bmi              16.4 very_active    0    
## 8 obese_bmi              16.4 fairly_active  0    
## 9 obese_bmi              16.4 lightly_active 0.533

### Compare the daily hours each type of MBI user spends on each type of activity.


ggplot(data_bmi_type_long, aes(bmi_type, y = Hours, x = bmi_type, fill= Activity_Type)) + 
  geom_bar(position = "dodge", stat="identity") + 
  labs(title= "Figure 14: Comparing Daily Very_, Fairly_ and Lightly_ Active Hours")

### Create another data frame to include time spent on sedentary activities


data_bmi_type_long_all <- data_bmi_type %>% 
  group_by(bmi_type) %>% 
  summarise(very_active = mean(very_active_minutes)/60, fairly_active = mean(fairly_active_minutes)/60, lightly_active = mean(lightly_active_minutes)/60, sedentary = mean(sedentary_minutes)/60) %>%
  pivot_longer(data_bmi_type, cols= (very_active:sedentary), names_to = "Activity_Type", values_to ="Hours")

## Warning in gsub(vec_paste0("^", names_prefix), "", cols): argument 'pattern' has
## length > 1 and only the first element will be used

head(data_bmi_type_long_all)

## # A tibble: 6 × 3
##   bmi_type          Activity_Type   Hours
##   <fct>             <chr>           <dbl>
## 1 normal_weight_bmi very_active     0.375
## 2 normal_weight_bmi fairly_active   0.302
## 3 normal_weight_bmi lightly_active  4.18 
## 4 normal_weight_bmi sedentary      11.4  
## 5 overweight_bmi    very_active     0.979
## 6 overweight_bmi    fairly_active   0.178

glimpse(data_bmi_type_long_all)

## Rows: 12
## Columns: 3
## $ bmi_type      <fct> normal_weight_bmi, normal_weight_bmi, normal_weight_bmi,…
## $ Activity_Type <chr> "very_active", "fairly_active", "lightly_active", "seden…
## $ Hours         <dbl> 0.3745098, 0.3019608, 4.1843137, 11.3936275, 0.9786458, …

colnames(data_bmi_type_long_all)

## [1] "bmi_type"      "Activity_Type" "Hours"

data_bmi_type_long_all

## # A tibble: 12 × 3
##    bmi_type          Activity_Type   Hours
##    <fct>             <chr>           <dbl>
##  1 normal_weight_bmi very_active     0.375
##  2 normal_weight_bmi fairly_active   0.302
##  3 normal_weight_bmi lightly_active  4.18 
##  4 normal_weight_bmi sedentary      11.4  
##  5 overweight_bmi    very_active     0.979
##  6 overweight_bmi    fairly_active   0.178
##  7 overweight_bmi    lightly_active  4.02 
##  8 overweight_bmi    sedentary      18.1  
##  9 obese_bmi         very_active     0    
## 10 obese_bmi         fairly_active   0    
## 11 obese_bmi         lightly_active  0.533
## 12 obese_bmi         sedentary      16.4

###  How each type of BMI users spend their daily hours on each activity type, and sedentariness

 ggplot(data_bmi_type_long_all, aes(bmi_type, y = Hours, x = bmi_type, fill= Activity_Type)) +
  geom_bar(position = "dodge", stat="identity") +
  labs(title= "Fiure 15: Comparing Various Daily Activity and Sedentary Time")

Figures 14-15 Analysis

Figures 14 and 15 clearly show that the overweight_bmi users spent >2x the amount of time on very_active activities, and thus burnt calories at a higher rate. This and the fact that overweight_bmi users carry more weight (on average, 47 lb/23 kg) than the normal_weght_users could be reasons why the overweight_bmi users burnt more calories than the normal_weight_bmi users (as explained above in Figures 11-13 Analysis).

This observation could also account for the reverse correlation observed in Step counts (Figure 8) and calories burnt (Figure 9). It’s because the overweight_bmi users spent more lightly_active time to accrue step counts and thus burnt calories at a lower rate.

Many users did not register sleep data, and some seemed to count sleep as part of sedentary time.

This creates inconsistent sedentary hours among users. So, we’ll focus on the percent total daily activity hours each type of BMI user spends on each very_, farily_ and light_active activities.

### Calculate percent daily activity time spent on very_, farily_ and lightly_ active ativities.
 
 
data_bmi_type_long_all_percent <- data_bmi_type %>% 
  group_by(bmi_type) %>%
  summarise(percent_very_active = mean(very_active_minutes)*100/(mean(very_active_minutes) +
                                                     mean(fairly_active_minutes) + 
                                                     mean(lightly_active_minutes)),
            percent_fairly_active = mean(fairly_active_minutes)*100/(mean(very_active_minutes) + 
                                                                     mean(fairly_active_minutes) +
                                                                      mean(lightly_active_minutes)) ,
            percent_lightly_active = mean(lightly_active_minutes)*100/(mean(very_active_minutes) + 
                                                                       mean(fairly_active_minutes) +
                                                                       mean(lightly_active_minutes))) %>%
  pivot_longer(data_bmi_type, cols= (percent_very_active:percent_lightly_active), names_to = "Percent_Activity_Type", values_to ="Percent")

## Warning in gsub(vec_paste0("^", names_prefix), "", cols): argument 'pattern' has
## length > 1 and only the first element will be used

colnames(data_bmi_type_long_all_percent)

## [1] "bmi_type"              "Percent_Activity_Type" "Percent"

head(data_bmi_type_long_all_percent)

## # A tibble: 6 × 3
##   bmi_type          Percent_Activity_Type  Percent
##   <fct>             <chr>                    <dbl>
## 1 normal_weight_bmi percent_very_active       7.70
## 2 normal_weight_bmi percent_fairly_active     6.21
## 3 normal_weight_bmi percent_lightly_active   86.1 
## 4 overweight_bmi    percent_very_active      18.9 
## 5 overweight_bmi    percent_fairly_active     3.43
## 6 overweight_bmi    percent_lightly_active   77.7

glimpse(data_bmi_type_long_all_percent)

## Rows: 9
## Columns: 3
## $ bmi_type              <fct> normal_weight_bmi, normal_weight_bmi, normal_wei…
## $ Percent_Activity_Type <chr> "percent_very_active", "percent_fairly_active", …
## $ Percent               <dbl> 7.704720, 6.212182, 86.083098, 18.899618, 3.4298…

data_bmi_type_long_all_percent

## # A tibble: 9 × 3
##   bmi_type          Percent_Activity_Type  Percent
##   <fct>             <chr>                    <dbl>
## 1 normal_weight_bmi percent_very_active       7.70
## 2 normal_weight_bmi percent_fairly_active     6.21
## 3 normal_weight_bmi percent_lightly_active   86.1 
## 4 overweight_bmi    percent_very_active      18.9 
## 5 overweight_bmi    percent_fairly_active     3.43
## 6 overweight_bmi    percent_lightly_active   77.7 
## 7 obese_bmi         percent_very_active       0   
## 8 obese_bmi         percent_fairly_active     0   
## 9 obese_bmi         percent_lightly_active  100

### Percent activity time each BMI type users spend on very, fairly and lightly active activities

ggplot(data_bmi_type_long_all_percent, aes(bmi_type, y = Percent, x = bmi_type, fill= Percent_Activity_Type)) +
  geom_bar(position = "dodge", stat="identity") +
  labs(title = "Figure 16: Percent Time on Various Activity Intensities")

Figure 16 Analysis

In this Figure, it’s obvious that the overweight_bmi users spent ~20% of activity time on very_active activities, whereas the normal_weight_bmi users spent < 8% time on very_active activities and thus burnt calories at a lower rate. For lightly_active activities, the normal_weight_bmi users spent a bit more (>85%) times on them, compared to ~78% time that the overweight_bmi users spent. The small increase in lightly_active hours is not enough to compensate for the calories consumed by spending a short time on very_active activities. It’s even more clear in this Figure that the obese_bmi user spent 100% of activity time on lightly_active activities. This is not surprising considering the huge extra weight on the body that consumes calories at a much higher rate.

```

5. Share

In this Share Phase the main tasks are to:

Answer the business questions based on findings from our analysis, along with effective visualizations to clearly communicate our findings.
Create a compelling story based on the insights found.
Relate the relevance of the findings to the business questions.

Recalling the business questions:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Based on the visualizations created during the analysis Phase (Figures 1 - 16), below is a summary of the trends observed:

The rate of calories burnt is determined by:
1. Steps counts -most users accrued between 7,000 and 8,000 daily steps, with Saturdays having the highest number of steps accrued.
2. The intensities of activity - There is a clear positive correlation between higher-intensity activities. However, users tend to spend the majority of activity time in the order of lighly_active > farily_active > very_active.
3. The amount of overall activity time - the more time spent, the more calories burnt.
A small group of users spends ~39 minutes awake while in bed. They could be bad sleepers or just simply reading a book or doing something with their smartphones.
Many users did not register weight and sleep data, it could be some were unclear of how to use the App or are unaware of the significance of these data or simply forgot to wear the device in bed.
Users’ activity frequency of the type activity: sedentary > lightly_active > farily_active > very_active.
While the weight_log_info data frame contains a relatively small dataset, the BMI data analysis shows that all types of BMI users spent most of their activity times on lighltly_active activities, between 75%-100% (Figure 16), with 50% of them burning between 2,500-3,500 calories a day and ~20% users burning >3,500 calories. They spent more time on fitness activities (~4.9 to >5 hours) and accrued more step counts (10,000-15,00). All of which surpass those of the average overall users analyzed and discussed above. This might indicate that users who recorded weeight_log_info data, care more about their health, e.g.weight, and are more serious in getting in shape.

One striking observation is that the obese_bmi user spends a total of only ~32 minutes on lightly_active activities but was able to achieve 2,000 calories burnt, whereas the normal_weight_bmi users have to spend ~5 activity hours to achieve the same 2,000 calories burnt as the obese user. This is because the extra weight of the obese user acts like a weight vest and results in a higher rate of calories burn when engaging in any type of activity as shown in Figures 11-16.

Marketing Suggestions:

Since the BMI users seem to be really serious about their health, Bellabeat could emphasize its ability to monitor health, e.g. track users’ MBI in their marketing message. After all, Bellabeat is created for women’s health and should invest more in this marketing aspect, as Theodore Levitt said in his well-known Marking Miopia suggests business leaders ask themselves, “What business are we really in?” That is asking leaders to answer the question, “What are we really doing for the customer?” Since those who care about their health are serious in recording the weight_log_info data, Bellabeat should focus on the health-conscious customers’ needs and fulfill its “Women’s #1 Health tracker” tagline.

6. Act

In this Act Phase the main tasks are to:

Prepare a final conclusion based on the analysis.
Determine how the analytic team and business apply insights found.
Discuss the next steps that would be taken based on our findings.
See if there are additional data that could use to expand on the findings.

Based on the insights discussed above in the Share and Analysis sections, Bellabeat is strongly encouraged to do the following:

Data’s age, size, duration of measurements, sampling bias, and lack of information on age, and ethnicity need improvement as suggested below:
1. Collect more recent, reliable with longer duration - e.g. make data collection automatic for at least 6 months and stored in the Bellabeat server for all registered users.
2. Make it mandatory for registered users to record their age, sex (if not female), and ethnicity so Bellabeat could better advise users on their health and activity needs based on this information.
3. Mandate users to measure and record measurements by attaching a photo of the readings on their weight, and height, along with waist and hip circumference that would be saved in the Bellabeat server. This would allow Belllbeat to determine users’ BMI and the possibility of abdominal adiposity for predicting cardiovascular disease risk.
4. Lacking blood pressure data - Bellabeat should be able to register these data automatically like other health trackers and automatically save them. This is especially important for users with a history of high blood pressure, including some pregnant women since high blood pressure might be a sign of preeclampsia that requires a Bed Rest prescription.
5. Possible data bias: Bellabeat only collect data from users who could afford the device, which lack inclusiveness and introduce bias to the data. To circumvent this, Bellabeat could run a 6-month or longer campaign by recruiting volunteers to use the device and collect data similar to that of paid users. Simultaneously, Bellabeat would help them to be aware of the availability of the technology and its advantages, and the possibility of marketing a less fancy version that costs less to those who cannot afford to purchase one. Alternatively, Bellabeat could create a low-cost rental program that can also be marketed to clinics as a prescription for dangerously overweight patients.
Most users spend the majority of time on lightly_active activities and little or none on very_active activities. Since a recent study shows that one can significantly reduce the risk of developing cardiovascular diseases by doing at least moderate-intensity exercises, Bellabeat could alert users whose BMI and/or abdominal adiposity data are at risk to include some intense exercise in their daily schedule.
Many users did not record sleep and weight data. Some appear to combine sleep data with sedentary time and ended with a total of 24 hours when summing sedentary time and total activity time. It’s likely that some forgot to wear the device to bed or were not aware of the significance of wearing them all day. To mitigate the lack of sleep and weight data, it’s recommended that Bellabeat do the following:
1. Clearly define Sedentary time - e.g. does it include sleep time? or what’s considered sedentary time, when no step is accrued? The inconsistency of sedentary time data makes it difficult to analyze the data and creates a false appearance that most users spend too much time being inactive.
2. Create short videos in various languages to show users what data are important to input for a healthy fitness program. For example, recommendations for the amount of daily sleep time, step counts, calorie counts, as well as their BMI and abdominal adiposity risk based on the data they entered.
BMI analysis shows that all users who record the weight_log_info data (including all normal_weight, overweight, and obese bmi users) are those who care about their health the most, compared to the average users analyzed. BMI users are observed to accrue more daily steps, achieved more calories burn, and spent more time in activities, especially in very_active ones. Therefore, it’s recommended that Bellabeat to do:
1. Make recording the weight_log_info data mandatory so that the data are more reliable - since <30% of users registered their data in this dataset. Also, require all users to include waist and hip circumference measurements in the weight_log_info dataset.
2. Market the advantage of Bellabeat’s ability to use weight_log_info data to alert users if they are at risk of cardiovascular disease based on their weight_log_info data, and give compliments to those whose weight_log_info data are normal.

Bellabeat Case Analysis

Martina Green

2023-02-10

Introduction and Background

1. Ask

Statement of Business Tasks

Key Stakeholders:

2. Prepare

ROCCC (Reliable, Original, Comprehensive, Current, and Cited) Analysis

Data Selection

3. Data Processing

Installing and loading common packages and libraries

Import Data to R, and create a data frame for each imported file followed by identifying all columns in each data frame created.

Identify all the columns in the new data frames for daily_activity data frame using the head() and glimpse() functions.

Look at some summary statistics

Find the number of unique participants in each data frame

Data Cleaning

Checking for duplicates for each data frame.

Removing duplicates and NA from each data frame

Check if duplicates and NA were removed from sleep_day

Cleaning column names

Tried clean_names(), it printed the entire data frame, so will try another method

Change the data type of the column dates for all data frames to the ISO format: yyyy-mm-dd and convert the names of the corresponding date column to “Date”.

4. Analyze

Look at some summary statistics

For the daily activity data frame:

For the sleep data frame:

Exploring a few plots

Figure 1 Analysis

We’ll examine the relationship between minutes asleep and time in bed

Figure 2 Analysis

Merging the first two datasets together and taking a look at all the columns of the merged data frame.

Take a look at how many participants are in the merged_data (inner join)

There are only 24 distinct ids, indicating that some ids in the daily activity dataset must have been filtered out during the merge process.

We’ll execute a full join to include all the participants from both datasets.

Check if all the columns are there using the head(), colnames() and glimpse() functions.

Check total participants in combined_data - are all 33 participants included after the full join?

Yes, the full_join_data contains all participants, i.e. 33 total

Next will first remove the NAs and duplicates if they exist in the dataset

Check combined_daa for duplicates

Replace NA with ‘0’

Great, no more NAs.

Users’ activity pattern through the week, including associated step-counts, calories burnt, and time spent on various activities: overall, very intense and sedentary.

Check if the ‘weekday’ column is added using the head() and glimpse() functions

Yes, the ‘weekday’ column is added as the last column in the data frame.

Now we’ll take a look at the summary statistics of the final combined_data data frame

Next will aggregate data of combined_data to see how the corresponding activities vary on each weekday.

Next, we’ll examine how the average steps, calories burnt, and time spent on various activities: overall, very intense (very_active), and sedentary from Sunday through Saturday.

Total step counts on each weekday of the week.

Figure 3 Analysis

Figure 4 Analysis

Figure 5 Analysis

Figures 6a and 6b Analysis

Figure 7 Analysis

Next, we’ll examine the BMI data using the weight_log_info data frame and will first do the following:

After removing the unwanted columns stated above, will add a column “bmi_normal_weight” in the data frame

Next will create three data frames:

Data frame for the normal_weight bmi

Finally, will create a data frame, bmi_obese. That is when bmi=>30.0

Check how many unique ids are there in each bmi_daily1, bmi_daily2, and bmi_daily3

Next, we’ll merge the weight_log_info_new data frame with the daily_activity data frame

Take a look at the summary statistics of the daily_bmi_activity data frame

Mutate a column of total daily active hours to the daily_bmi_activity data frame

Next, we’ll separate the types of bmi

First, we’ll inspect daily step counts, calories burnt and time spent on all activities by each bmi type.

Figures 8 and 9 Analysis

Figure 10 Analysis

Examine the calories consumed and total steps taken by each bmi type in bar_plots

Figures 11-13 Analysis

Since the amount of daily activity hours does not completely correlate with the number of calories burnt, especially in the case of obese_bmi, we’ll examine if time and activity intensity have any effect on the number of calories burnt.

Many users did not register sleep data, and some seemed to count sleep as part of sedentary time.

5. Share

6. Act