Bellabeat Time Analysis

Theresa 2022-08-31

Scenario

Product: Time - This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

Stakeholders:

Business Task: Analyze and define smartwatch trends for the new marketing strategy of the Time.

Data

Data Source:

FitBit Fitness Tracker Data provided by Amazon Mechanical Turk. The study was conducted of 30 Fitbit users between March 12th, 2016 and May 12th, 2016. It was provided by Google through Kaggle (https://www.kaggle.com/datasets/arashnic/fitbit)

Data Sets:

5 data sets were analyzed:

Each data set was in a csv format and imported into R studio for analysis.

Limitations of Data:

Reliability:

This data is not reliable because the sample is biased due to its limiting size of 30 participants. A quick search showed that the global estimate of smartwatches sold in 2016 was about 19 million smartwatches. Therefore, the sample size should have been 385 users for a 95% confidence interval and a margin of error of 5%. With such a small sample, the bias will be strong, making it harder to generalize results from the analysis to the population as a whole. It will also make for possible skewed and inaccurate analysis.

Original:

This data set is not original as it was collected by Amazon Mechanical Turk.

Comprehensive:

This data set is not very comprehensive as some participants are missing on certain dates, giving an incomplete glimpse at their data. Demographic information about the participants, such as age, sex, ethnicity, and location, are missing. Therefore, also further limiting the analysis and eliminating any consideration of cultural or biological contributions or influences.

Current:

This data set is outdated by 6 years. Since 2016, the usage and trendiness of smartwatches has increased. Therefore, the trends of the health metrics measured in this study may not be accurate or relevant anymore.

Cited:

This data set was collected and shared by Amazon Mechanical Turk. It’s citation is below.

Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit

datasets 03.12.2016-05.12.2016 [Data set]. Zenodo.

https://doi.org/10.5281/zenodo.53894

Cleaning Data

The entire analysis process was conducted through RStudio. Therefore, the first step taken was to install and load the necessary packages for this analysis.

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("rmarkdown")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggpmisc")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages ("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages ("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages ("stringr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
library(rmarkdown)
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyr)
library(ggpmisc)
## Loading required package: ggpp
## 
## Attaching package: 'ggpp'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(here)
## here() starts at /cloud/project
library(skimr)
library(stringr)

Now that the necessary packages installed and loaded, the data sets were imported into R. They were renamed to make coding simpler and easier to read.

steps <- read.csv("dailySteps_merged.csv")
sleep <- read.csv("sleepDay_merged.csv")
weight <- read.csv("weightLogInfo_merged.csv")
calories <- read.csv("dailyCalories_merged.csv")
activity <- read.csv("dailyActivity_merged.csv")

Steps

Steps was the first data set to be cleaned. To ensure the accuracy when imported, the column names and the first six rows were checked.

colnames(steps)
## [1] "Id"          "ActivityDay" "StepTotal"
head(steps)
##           Id ActivityDay StepTotal
## 1 1503960366   4/12/2016     13162
## 2 1503960366   4/13/2016     10735
## 3 1503960366   4/14/2016     10460
## 4 1503960366   4/15/2016      9762
## 5 1503960366   4/16/2016     12669
## 6 1503960366   4/17/2016      9705

The identification length was ran to ensure that all id lengths were accurate with no mistakes. The length was returned with 10 characters so it was then checked to make sure that no entry was not equal to 10 characters. It was returned with zero entries so it was accurate.

id_length_steps <- nchar(steps$Id)
sum(id_length_steps !=10)
## [1] 0

The steps data set was checked for duplicates and any removed. No duplicates were returned.

head(duplicated(steps))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

It was then checked for any observations with zero steps. Taking zero steps is very unlikely so it may be due to forgetting to wear device or device running out of charge. Therefore, 77 observations with zero steps were removed from a new version of the data set, steps2. A new version of the steps data set was created to keep the original data for good analysis practices.

nrow(steps[steps$StepTotal == 0, ])
## [1] 77
steps2 <- steps[steps$StepTotal != 0, ]

Next, the column names were renamed and formatted to be cohesive across all 5 of the imported data sets. The column, “ActivityDay” was reformatted into yyyy-mm-dd (international date format) to ease understanding of anyone reviewing my analysis process. It was also renamed to “Day” to be match the name of this column in the other data sets. The “StepTotal” column was also renamed to “TotalSteps” to match this column in other data sets.

steps2 <- steps2 %>% 
  mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%y")) %>% 
  rename(Day = ActivityDay) %>% 
  rename(TotalSteps = StepTotal)

To ensure all changes were correctly made, new version was viewed and inspected.

head(steps2)
##           Id        Day TotalSteps
## 1 1503960366 2020-04-12      13162
## 2 1503960366 2020-04-13      10735
## 3 1503960366 2020-04-14      10460
## 4 1503960366 2020-04-15       9762
## 5 1503960366 2020-04-16      12669
## 6 1503960366 2020-04-17       9705

Weight

To ensure the accuracy when imported, the column names and the first six rows were checked.

colnames(weight)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"
head(weight)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12

The identification length was ran to ensure that all id lengths were accurate with no mistakes. The length was returned with 10 characters so it was then checked to make sure that no entry was not equal to 10 characters. It was returned with zero entries so it was accurate.

id_length_weight <- nchar(weight$Id)
sum(id_length_weight !=10)
## [1] 0

Check for duplicates in data was performed. No duplicates were found.

head(duplicated(weight))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Observations with a weight of zero was checked for in both the kilograms and pounds columns. It is impossible to weigh 0 kgs or 0 lbs. so these observations would need to be removed. There were no observations with zero found in either column.

nrow(weight[weight$WeightPounds == 0, ])
## [1] 0
nrow(weight[weight$WeightKg == 0, ])
## [1] 0

Next, the column “Date” was formatted into international date format and renamed “Day” to match the other data sets.

weight2 <- weight %>% 
  mutate(Date = as.Date(Date, format = "%m/%d/%y")) %>% 
  rename(Day = Date)

Ensuring changes were accurate.

head(weight2)
##           Id        Day WeightKg WeightPounds Fat   BMI IsManualReport
## 1 1503960366 2020-05-02     52.6     115.9631  22 22.65           True
## 2 1503960366 2020-05-03     52.6     115.9631  NA 22.65           True
## 3 1927972279 2020-04-13    133.5     294.3171  NA 47.54          False
## 4 2873212765 2020-04-21     56.7     125.0021  NA 21.45           True
## 5 2873212765 2020-05-12     57.3     126.3249  NA 21.69           True
## 6 4319703577 2020-04-17     72.4     159.6147  25 27.45           True
##          LogId
## 1 1.462234e+12
## 2 1.462320e+12
## 3 1.460510e+12
## 4 1.461283e+12
## 5 1.463098e+12
## 6 1.460938e+12

Activity

Importing accuracy was verified for this data set by looking at the column names and first six rows.

colnames(activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
head(activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

Id length check was ran and came back with zero observations with character lengths other than 10.

id_length_activity <- nchar(activity$Id)
sum(id_length_activity !=10)
## [1] 0

Duplicates were checked for and none were found.

head(duplicated(activity))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

It is very unlikely that someone takes zero steps in a day so it was checked for any observations that had zero steps. 77 entries were found and removed from a new version of the activity data set, activity2. The original data set was kept for good data analysis practices.

nrow(activity[activity$TotalSteps == 0, ])
## [1] 77
activity2 <- activity[activity$TotalSteps != 0, ]

Next, the new version’s columns were renamed and formatted to be cohesive with the other data sets.

activity2 <- activity2 %>% 
  mutate(ActivityDate = as.Date(ActivityDate, format = "%m/%d/%y")) %>% 
  rename(Day = ActivityDate)

Ensuring accuracy of changes.

head(activity2)
##           Id        Day TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2020-04-12      13162          8.50            8.50
## 2 1503960366 2020-04-13      10735          6.97            6.97
## 3 1503960366 2020-04-14      10460          6.74            6.74
## 4 1503960366 2020-04-15       9762          6.28            6.28
## 5 1503960366 2020-04-16      12669          8.16            8.16
## 6 1503960366 2020-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

Calories

Accuracy of import was ensured.

colnames(calories)
## [1] "Id"          "ActivityDay" "Calories"
head(calories)
##           Id ActivityDay Calories
## 1 1503960366   4/12/2016     1985
## 2 1503960366   4/13/2016     1797
## 3 1503960366   4/14/2016     1776
## 4 1503960366   4/15/2016     1745
## 5 1503960366   4/16/2016     1863
## 6 1503960366   4/17/2016     1728

Accuracy of Id lengths was ensured.

id_length_calories <- nchar(calories$Id)
sum(id_length_calories !=10)
## [1] 0

Duplicates were checked for and none were found.

head(duplicated(calories))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Zero calories burned is unlikely as your body automatically burns some calories throughout the day, even without exercise. Therefore, any entries with zero calories were removed in a new version of the data set, calories2.

nrow(calories[calories$Calories == 0, ])
## [1] 4
calories2 <- calories[calories$Calories != 0, ]

The columns were reformatted and renamed to match the columns in other data sets.

calories2 <- calories2 %>% 
  mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%y")) %>% 
  rename(Day = ActivityDay)

Accuracy of all changes were ensured.

head(activity2)
##           Id        Day TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2020-04-12      13162          8.50            8.50
## 2 1503960366 2020-04-13      10735          6.97            6.97
## 3 1503960366 2020-04-14      10460          6.74            6.74
## 4 1503960366 2020-04-15       9762          6.28            6.28
## 5 1503960366 2020-04-16      12669          8.16            8.16
## 6 1503960366 2020-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

Sleep

Ensuring accuracy of importing data set.

colnames(sleep)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(sleep)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

Ensuring all Id lengths are 10 characters long.

id_length_sleep <- nchar(sleep$Id)
sum(id_length_sleep !=10)
## [1] 0

Checked for any duplicates in this data set. None were found.

head(duplicated(sleep))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

It is very unlikely that someone has zero minutes awake in bed or asleep and zero records of sleep in a day. Therefore, a search for zero in all 3 of those columns was done. None were found.

nrow(sleep[sleep$TotalMinutesAsleep == 0, ])
## [1] 0
nrow(sleep[sleep$TotalSleepRecords ==0, ])
## [1] 0
nrow(sleep[sleep$TotalTimeInBed ==0, ])
## [1] 0

Next, the columns were reformatted and renamed to be aligned with the other data sets.

sleep2 <- sleep %>% 
  mutate(SleepDay = as.Date(SleepDay, format = "%m/%d/%y")) %>% 
  rename(Day = SleepDay)

Ensuring accuracy of all changes.

head(sleep2)
##           Id        Day TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2020-04-12                 1                327            346
## 2 1503960366 2020-04-13                 2                384            407
## 3 1503960366 2020-04-15                 1                412            442
## 4 1503960366 2020-04-16                 2                340            367
## 5 1503960366 2020-04-17                 1                700            712
## 6 1503960366 2020-04-19                 1                304            320

Distinct Entries

The final step in the cleaning process was ensuring that number of distinct participants were the same or at least 75% of all the participants in the study. When performed, it was found that only 8 distinct participants in the weight2 data set and therefore was removed from any further analysis. The remaining data sets had at least 75% of participants and were used in the remaining data analysis process.

n_distinct(activity2$Id)
## [1] 33
n_distinct(steps2$Id)
## [1] 33
n_distinct(sleep2$Id)
## [1] 24
n_distinct(calories2$Id)
## [1] 33
n_distinct(weight2$Id)
## [1] 8

Exploring Data

Summary function was performed on the data sets to gain a general overview of the data.

summary(activity2 %>% 
        select(-Id, -Day, -TrackerDistance, -LoggedActivitiesDistance))
##    TotalSteps    TotalDistance   VeryActiveDistance ModeratelyActiveDistance
##  Min.   :    4   Min.   : 0.00   Min.   : 0.000     Min.   :0.0000          
##  1st Qu.: 4923   1st Qu.: 3.37   1st Qu.: 0.000     1st Qu.:0.0000          
##  Median : 8053   Median : 5.59   Median : 0.410     Median :0.3100          
##  Mean   : 8319   Mean   : 5.98   Mean   : 1.637     Mean   :0.6182          
##  3rd Qu.:11092   3rd Qu.: 7.90   3rd Qu.: 2.275     3rd Qu.:0.8650          
##  Max.   :36019   Max.   :28.03   Max.   :21.920     Max.   :6.4800          
##  LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
##  Min.   : 0.000      Min.   :0.00000         Min.   :  0.00   
##  1st Qu.: 2.345      1st Qu.:0.00000         1st Qu.:  0.00   
##  Median : 3.580      Median :0.00000         Median :  7.00   
##  Mean   : 3.639      Mean   :0.00175         Mean   : 23.02   
##  3rd Qu.: 4.895      3rd Qu.:0.00000         3rd Qu.: 35.00   
##  Max.   :10.710      Max.   :0.11000         Max.   :210.00   
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0.0   Min.   :  52  
##  1st Qu.:  0.00      1st Qu.:146.5        1st Qu.: 721.5   1st Qu.:1856  
##  Median :  8.00      Median :208.0        Median :1021.0   Median :2220  
##  Mean   : 14.78      Mean   :210.0        Mean   : 955.8   Mean   :2361  
##  3rd Qu.: 21.00      3rd Qu.:272.0        3rd Qu.:1189.0   3rd Qu.:2832  
##  Max.   :143.00      Max.   :518.0        Max.   :1440.0   Max.   :4900
summary(steps2)
##        Id                 Day               TotalSteps   
##  Min.   :1.504e+09   Min.   :2020-04-12   Min.   :    4  
##  1st Qu.:2.320e+09   1st Qu.:2020-04-18   1st Qu.: 4923  
##  Median :4.445e+09   Median :2020-04-26   Median : 8053  
##  Mean   :4.858e+09   Mean   :2020-04-26   Mean   : 8319  
##  3rd Qu.:6.962e+09   3rd Qu.:2020-05-03   3rd Qu.:11092  
##  Max.   :8.878e+09   Max.   :2020-05-12   Max.   :36019
summary(sleep2)
##        Id                 Day             TotalSleepRecords TotalMinutesAsleep
##  Min.   :1.504e+09   Min.   :2020-04-12   Min.   :1.000     Min.   : 58.0     
##  1st Qu.:3.977e+09   1st Qu.:2020-04-19   1st Qu.:1.000     1st Qu.:361.0     
##  Median :4.703e+09   Median :2020-04-27   Median :1.000     Median :433.0     
##  Mean   :5.001e+09   Mean   :2020-04-26   Mean   :1.119     Mean   :419.5     
##  3rd Qu.:6.962e+09   3rd Qu.:2020-05-04   3rd Qu.:1.000     3rd Qu.:490.0     
##  Max.   :8.792e+09   Max.   :2020-05-12   Max.   :3.000     Max.   :796.0     
##  TotalTimeInBed 
##  Min.   : 61.0  
##  1st Qu.:403.0  
##  Median :463.0  
##  Mean   :458.6  
##  3rd Qu.:526.0  
##  Max.   :961.0
summary(calories2)
##        Id                 Day                Calories   
##  Min.   :1.504e+09   Min.   :2020-04-12   Min.   :  52  
##  1st Qu.:2.320e+09   1st Qu.:2020-04-19   1st Qu.:1834  
##  Median :4.445e+09   Median :2020-04-26   Median :2144  
##  Mean   :4.850e+09   Mean   :2020-04-26   Mean   :2313  
##  3rd Qu.:6.962e+09   3rd Qu.:2020-05-04   3rd Qu.:2794  
##  Max.   :8.878e+09   Max.   :2020-05-12   Max.   :4900

When exploring the data, the need for two new variables in the activity data set was apparent. Total active minutes gave a good glimpse of participants activity throughout the day because it included activity such as walking around the office or house all the way to working out. A new variable of intentional exercise was also needed to separate out working out (exercises that increase heart rate drastically) and activity that naturally happens throughout the day. Therefore, it was created from combining the highly active and fairly active minute variables.

activity2 <- activity2 %>% 
  mutate(total_active_mins =LightlyActiveMinutes + FairlyActiveMinutes +VeryActiveMinutes) %>% 
  mutate(total_intentional_mins = FairlyActiveMinutes+VeryActiveMinutes)

When exploring the sleep data set, it became apparent that new variable was also needed here. There was a difference in “time asleep” and “time in bed” that participants were trying to fall asleep or laying in bed before getting up. This difference had the possibility of creating insight and correlation between various variables so “time awake in bed” was created.

sleep2 <-sleep2 %>% 
  mutate(time_awake_in_bed = TotalTimeInBed - TotalMinutesAsleep)

An outlier test was ran to determine if the median or mean would be a better measure of the data. It was determined from many outliers that the median would be the better measure for the “Total Steps”, “Sedentary Minutes”, “Time Asleep”, “Time in Bed”, and “Time in Bed Awake”. The mean would be sufficient for all other variables due to lack of multitude outliers.

boxplot.stats(sleep2$TotalMinutesAsleep)$out
##  [1] 700 119 124 796 137 722 750 166  61 152  77  59 692  99  82  62  98 106 126
## [20] 103 115 123 775  74  79  58  74
boxplot.stats(sleep2$TotalTimeInBed)$out
##  [1] 712 127 142 961 154 961 961 961 775 178  69  77  65 722 104  85  65 107 108
## [20] 137 121 179 129 134 725 843  78  82  61  75
boxplot.stats(sleep2$time_awake_in_bed)$out
##  [1] 165 317 239 371 195 161 106 132 227 185 110 176 153 180 121 137 216 205 121
## [20] 191 162 145 145 243 154 208 123 162 189 206 197 140  76  94  87  78  87
boxplot.stats(activity2$Calories)$out
## [1]   52  257 4552 4392 4501 4546 4900 4547 4398
boxplot.stats(activity2$total_active_mins)$out
## [1] 540 552
boxplot.stats(activity2$SedentaryMinutes)$out
## [1]  2 13  0
boxplot.stats(activity2$TotalDistance)$out
##  [1] 28.03 15.08 17.54 15.01 15.97 16.24 17.19 17.95 15.69 15.67 17.65 20.40
## [13] 18.98 25.29 17.40 18.11 17.62 16.31 15.74 20.65 26.72 16.30 19.34 18.25
## [25] 19.56

Questions that arose during exploration.

Does the number of steps fluctuate across the week?

activity2$week_day <- weekdays(activity2$Day)
activity_by_day <- aggregate(activity2$TotalSteps, list(activity2$week_day), mean)

ggplot(data=activity_by_day, aes(x=Group.1, y=x)) +
  geom_bar(stat="identity", color = "black", fill="cadetblue2") +
  scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
  labs(title="Number of Steps by Day of the Week", x="Day of the Week", y = "Number of Steps") 

It was evident that the number of steps did not fluctuate too much throughout the week. This discovery led to the inquiry if other variables fluctuated or stayed pretty consistent throughout the week.

Does active minutes fluctuate throughout the week?

activity2$week_day <- weekdays(activity2$Day)
activity_by_day_mins <- aggregate(activity2$total_active_mins, list(activity2$week_day), mean)

ggplot(data=activity_by_day_mins, aes(x=Group.1, y=x)) +
  geom_bar(stat="identity", color = "black", fill="cadetblue2") +
  scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
  labs(title="Number of Active Minutes by Day of The Week", x="Day of the Week", y = "Number of Active Minutes") 

There is a slight fluctuation throughout the week, with Friday being the day with the least active minutes. This could possibly be due to being tired after a work week. This insight doesn’t give a clearer idea of intentional minutes throughout the week. So using, the new variable to see when people are intentionally working out the most.

activity2$week_day <- weekdays(activity2$Day)
activity_by_day_inten <- aggregate(activity2$total_intentional_mins, list(activity2$week_day), mean)

ggplot(data=activity_by_day_inten, aes(x=Group.1, y=x)) +
  geom_bar(stat="identity", color = "black", fill="cadetblue2") +
  scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
  labs(title="Number of Intentional Active Minutes vs. Day", x="Day of the Week", y = "Intentional Active Minutes") 

It was noticed that the number of minutes for intentional active minutes was relatively low. When going back to the summary of activity, it is revealed that sedentary minutes is the largest amount of activity by far (955.8 minutes or about 16 hours compared to about 40 minutes of intentional working out and 4 hours of active minutes altogether). Therefore, it was intriguing to see if the amount of sedentary minutes fluctuated throughout the week.

Does the number of sedentary minutes fluctuate throughout the week?

activity2$week_day <- weekdays(activity2$Day)
activity_by_day_sed <- aggregate(activity2$SedentaryMinutes, list(activity2$week_day), mean)

ggplot(data=activity_by_day_sed, aes(x=Group.1, y=x)) +
  geom_bar(stat="identity", color = "black", fill="cadetblue2") +
  scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
  labs(title="Number of Sedentary Minutes by Day of The Week", x="Day of the Week", y = "Number of Sedentary Minutes") 

It is not surprising that the weekdays have a lot of sedentary minutes due to those days primarily being workdays for most people and that many jobs are at a desk. However, the weekend sedentary minutes is a little surprising since most people are not at work those days and have the opportunity to be more active.

Noticing this trend of sedentary minutes led to a inquiry about the minutes asleep and in bed and how/if they fluctuate throughout the week as well.

sleep2$week_day <- weekdays(sleep2$Day)
sleep_by_day <- aggregate(sleep2$TotalMinutesAsleep, list(sleep2$week_day), mean)

ggplot(data=sleep_by_day, aes(x=Group.1, y=x)) +
  geom_bar(stat="identity", color = "black", fill="cadetblue2") +
  scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
  labs(title="Minutes Asleep by Day of the Week", x="Day of the Week", y = "Number of Minutes Asleep") 

sleep2$week_day <- weekdays(sleep2$Day)
sleep_by_day <- aggregate(sleep2$time_awake_in_bed, list(sleep2$week_day), mean)

ggplot(data=sleep_by_day, aes(x=Group.1, y=x)) +
  geom_bar(stat="identity", color = "black", fill="cadetblue2") +
  scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
  labs(title="Minutes Awake in Bed by Day of the Week", x="Day of the Week", y = "Number of Minutes Awake in Bed") 

There were some noticeable differences of the various variables among the days of the week. This led to an inquiry about relationships among variables and if there were any correlation.

Is there a correlation between step count and minutes awake in bed? Minutes asleep?

act_sleep_merge <- merge(activity2, sleep2, by.x=c('Id', 'Day'),
                        by.y=c('Id', 'Day'))


ggplot(data=act_sleep_merge, aes(x=TotalSteps, y=time_awake_in_bed)) + geom_point(color="blue2")+
  labs(title = "Step Count vs. Minutes Awake in Bed", x="Total Steps a Day", y="Minutes Awake") +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

x <- act_sleep_merge$TotalSteps
y <- act_sleep_merge$time_awake_in_bed
cor.test(x, y, method = c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 0.54976, df = 411, p-value = 0.5828
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06956893  0.12327966
## sample estimates:
##        cor 
## 0.02710758
ggplot(data=act_sleep_merge, aes(x=TotalMinutesAsleep, y=TotalSteps)) + geom_point(color="blue2")+
  labs(title = "Step Count vs. Sleep", x="Minutes Asleep", y="Total Steps a Day") +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

x <- act_sleep_merge$TotalMinutesAsleep
y <- act_sleep_merge$TotalSteps
cor.test(x, y, method = c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -3.8563, df = 411, p-value = 0.0001336
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27834209 -0.09203143
## sample estimates:
##        cor 
## -0.1868665

It was revealed that there was a weak, positive correlation (0.027) between step count and minutes awake in bed and weak, negative correlation (-0.1812) between step count and time asleep. This discrepancy and weak correlations could be due to the bias in sample size. However it led to another question about active minutes and the effect it has on minutes asleep.

Is there a correlation between active minutes and minutes asleep? Awake in bed?

ggplot(data=act_sleep_merge, aes(x=total_active_mins, y=TotalMinutesAsleep)) + geom_point(color="blue2")+
                   labs(title = "Active Minutes vs. Minutes Asleep", x="Active Minutes", y="Minutes Asleep") +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

x <- act_sleep_merge$total_active_mins
y <- act_sleep_merge$TotalMinutesAsleep
cor.test(x, y, method = c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1.2953, df = 411, p-value = 0.196
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.15927519  0.03293659
## sample estimates:
##        cor 
## -0.0637606
ggplot(data=act_sleep_merge, aes(x=total_active_mins, y=time_awake_in_bed)) + geom_point(color="blue2")+
  labs(title = "Active Minutes vs. Minutes Awake", x="Active Minutes", y="Minutes Awake") +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

x <- act_sleep_merge$total_active_mins
y <- act_sleep_merge$time_awake_in_bed
cor.test(x, y, method = c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1.8879, df = 411, p-value = 0.05974
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.187539870  0.003805292
## sample estimates:
##        cor 
## -0.0927233

These visuals and correlation tests revealed that there is a weak, negative correlation between active minutes and time asleep (-0.064) as well as a weak, negative correlation between active minutes and time awake in bed (-0.093). However, the correlation between active and time awake was much smaller so this correlation could be due to the sample size bias and might be a positive correlation if this same analysis was done on a larger population. These tests and visuals were done with all active minutes, including both intentional and natural activities, which led to another exploration inquiry about focusing in on intentional active minutes done by working out and it’s affect on sleep.

Is there a correlation between intentional active minutes and minutes asleep?

ggplot(data=act_sleep_merge, aes(x=total_intentional_mins, y=TotalMinutesAsleep)) + geom_point(color="blue2")+
  labs(title = "Intentionally Active Minutes vs. Sleep", x="Total Intentional, Highly Active Minutes", y="Minutes Asleep") +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

x <- act_sleep_merge$total_intentional_mins
y <- act_sleep_merge$TotalMinutesAsleep
cor.test(x, y, method = c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -3.7358, df = 411, p-value = 0.0002136
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27294186 -0.08623358
## sample estimates:
##        cor 
## -0.1812202

This focus on intentional active minutes revealed that there is a negative correlation (-0.1812) between the number of minutes spent being highly or fairly active and the number of minutes asleep. Therefore, it can be concluded that being more active can help someone have more minutes asleep. This negative correlation led to a further inquiry about the minutes spent being sedentary and if it would have a similar affect with minutes asleep.

Is there a correlation between sedentary minutes and minutes asleep?

ggplot(data=act_sleep_merge, aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) + geom_point(color="blue2")+
  labs(title = "Sedentary Minutes vs. Minutes Asleep", x="Sedentary Minutes", y="Minutes Asleep") +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

x <- act_sleep_merge$SedentaryMinutes
y <- act_sleep_merge$TotalMinutesAsleep
cor.test(x, y, method = c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -15.181, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6578402 -0.5337719
## sample estimates:
##       cor 
## -0.599394

This test and visualization did confirm the theory that it would have an opposite affect of intentionally active minutes and time asleep. This investigation lead to a strong, negative correlation (-0.5994) so it was confirmed that sitting more through out the day leads to less time asleep.

Conclusions

Recommendations

  1. Market sleep quality functions of the watch and habits that increase time asleep.

    1. Advertise the positive effect exercise has on sleep and ways the watch can encourage more intentionally active minutes each day .

    2. Advertise the adverse effect of sedentary time periods and ways the watch can help minimize or break up prolonged sedentary time periods.

    3. Advertise step tracking features as a way to help users meet 10,000 steps throughout the day as a positive way to improve sleep.

  2. Market how the watch helps remind and keep track of metrics during the work day but also transitions to life outside of work.

    1. Stylish design allows for it to go from the office to the gym to the ballpark and everywhere in between.

    2. Recommend continual usage gives a comprehensive look into user’s habits and trends for healthier habits overall.

    3. Promote full usage of functions for the most accurate and least biased insights and suggestions from the watch, including weight information.

    4. Encourage positive outlook on weight as just one small part of the picture not the whole picture (i.e. muscle weighs more than fat so scale can show the same or higher number when one is actually healthier). **Weight is often a sore spot for women so promoting encouragement as a woman’s wellness company is important.