Theresa 2022-08-31
Product: Time - This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Stakeholders:
Urška Sršen - Co-Founder and Chief Creative Officer
Sando Mur - Co-Founder and Mathematician
Business Task: Analyze and define smartwatch trends for the new marketing strategy of the Time.
Data Source:
FitBit Fitness Tracker Data provided by Amazon Mechanical Turk. The study was conducted of 30 Fitbit users between March 12th, 2016 and May 12th, 2016. It was provided by Google through Kaggle (https://www.kaggle.com/datasets/arashnic/fitbit)
Data Sets:
5 data sets were analyzed:
Sleep (SleepDay_merged.csv)
Acitivity (dailyActivity_merged.csv)
Calories (dailyCalories_merged.csv)
Steps (dailySteps_merged.csv)
Weight (weightLogInfo_merged.csv)
Each data set was in a csv format and imported into R studio for analysis.
Limitations of Data:
Reliability:
This data is not reliable because the sample is biased due to its limiting size of 30 participants. A quick search showed that the global estimate of smartwatches sold in 2016 was about 19 million smartwatches. Therefore, the sample size should have been 385 users for a 95% confidence interval and a margin of error of 5%. With such a small sample, the bias will be strong, making it harder to generalize results from the analysis to the population as a whole. It will also make for possible skewed and inaccurate analysis.
Original:
This data set is not original as it was collected by Amazon Mechanical Turk.
Comprehensive:
This data set is not very comprehensive as some participants are missing on certain dates, giving an incomplete glimpse at their data. Demographic information about the participants, such as age, sex, ethnicity, and location, are missing. Therefore, also further limiting the analysis and eliminating any consideration of cultural or biological contributions or influences.
Current:
This data set is outdated by 6 years. Since 2016, the usage and trendiness of smartwatches has increased. Therefore, the trends of the health metrics measured in this study may not be accurate or relevant anymore.
Cited:
This data set was collected and shared by Amazon Mechanical Turk. It’s citation is below.
Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit
datasets 03.12.2016-05.12.2016 [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.53894
The entire analysis process was conducted through RStudio. Therefore, the first step taken was to install and load the necessary packages for this analysis.
install.packages("tidyverse")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dplyr")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("rmarkdown")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("tidyr")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggpmisc")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages ("here")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages ("skimr")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages ("stringr")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(janitor)##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
library(rmarkdown)
library(lubridate)##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr)
library(ggpmisc)## Loading required package: ggpp
##
## Attaching package: 'ggpp'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(here)## here() starts at /cloud/project
library(skimr)
library(stringr)Now that the necessary packages installed and loaded, the data sets were imported into R. They were renamed to make coding simpler and easier to read.
steps <- read.csv("dailySteps_merged.csv")
sleep <- read.csv("sleepDay_merged.csv")
weight <- read.csv("weightLogInfo_merged.csv")
calories <- read.csv("dailyCalories_merged.csv")
activity <- read.csv("dailyActivity_merged.csv")Steps was the first data set to be cleaned. To ensure the accuracy when imported, the column names and the first six rows were checked.
colnames(steps)## [1] "Id" "ActivityDay" "StepTotal"
head(steps)## Id ActivityDay StepTotal
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
The identification length was ran to ensure that all id lengths were accurate with no mistakes. The length was returned with 10 characters so it was then checked to make sure that no entry was not equal to 10 characters. It was returned with zero entries so it was accurate.
id_length_steps <- nchar(steps$Id)
sum(id_length_steps !=10)## [1] 0
The steps data set was checked for duplicates and any removed. No duplicates were returned.
head(duplicated(steps))## [1] FALSE FALSE FALSE FALSE FALSE FALSE
It was then checked for any observations with zero steps. Taking zero steps is very unlikely so it may be due to forgetting to wear device or device running out of charge. Therefore, 77 observations with zero steps were removed from a new version of the data set, steps2. A new version of the steps data set was created to keep the original data for good analysis practices.
nrow(steps[steps$StepTotal == 0, ])## [1] 77
steps2 <- steps[steps$StepTotal != 0, ]Next, the column names were renamed and formatted to be cohesive across all 5 of the imported data sets. The column, “ActivityDay” was reformatted into yyyy-mm-dd (international date format) to ease understanding of anyone reviewing my analysis process. It was also renamed to “Day” to be match the name of this column in the other data sets. The “StepTotal” column was also renamed to “TotalSteps” to match this column in other data sets.
steps2 <- steps2 %>%
mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%y")) %>%
rename(Day = ActivityDay) %>%
rename(TotalSteps = StepTotal)To ensure all changes were correctly made, new version was viewed and inspected.
head(steps2)## Id Day TotalSteps
## 1 1503960366 2020-04-12 13162
## 2 1503960366 2020-04-13 10735
## 3 1503960366 2020-04-14 10460
## 4 1503960366 2020-04-15 9762
## 5 1503960366 2020-04-16 12669
## 6 1503960366 2020-04-17 9705
To ensure the accuracy when imported, the column names and the first six rows were checked.
colnames(weight)## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
head(weight)## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
The identification length was ran to ensure that all id lengths were accurate with no mistakes. The length was returned with 10 characters so it was then checked to make sure that no entry was not equal to 10 characters. It was returned with zero entries so it was accurate.
id_length_weight <- nchar(weight$Id)
sum(id_length_weight !=10)## [1] 0
Check for duplicates in data was performed. No duplicates were found.
head(duplicated(weight))## [1] FALSE FALSE FALSE FALSE FALSE FALSE
Observations with a weight of zero was checked for in both the kilograms and pounds columns. It is impossible to weigh 0 kgs or 0 lbs. so these observations would need to be removed. There were no observations with zero found in either column.
nrow(weight[weight$WeightPounds == 0, ])## [1] 0
nrow(weight[weight$WeightKg == 0, ])## [1] 0
Next, the column “Date” was formatted into international date format and renamed “Day” to match the other data sets.
weight2 <- weight %>%
mutate(Date = as.Date(Date, format = "%m/%d/%y")) %>%
rename(Day = Date)Ensuring changes were accurate.
head(weight2)## Id Day WeightKg WeightPounds Fat BMI IsManualReport
## 1 1503960366 2020-05-02 52.6 115.9631 22 22.65 True
## 2 1503960366 2020-05-03 52.6 115.9631 NA 22.65 True
## 3 1927972279 2020-04-13 133.5 294.3171 NA 47.54 False
## 4 2873212765 2020-04-21 56.7 125.0021 NA 21.45 True
## 5 2873212765 2020-05-12 57.3 126.3249 NA 21.69 True
## 6 4319703577 2020-04-17 72.4 159.6147 25 27.45 True
## LogId
## 1 1.462234e+12
## 2 1.462320e+12
## 3 1.460510e+12
## 4 1.461283e+12
## 5 1.463098e+12
## 6 1.460938e+12
Importing accuracy was verified for this data set by looking at the column names and first six rows.
colnames(activity)## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
head(activity)## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
Id length check was ran and came back with zero observations with character lengths other than 10.
id_length_activity <- nchar(activity$Id)
sum(id_length_activity !=10)## [1] 0
Duplicates were checked for and none were found.
head(duplicated(activity))## [1] FALSE FALSE FALSE FALSE FALSE FALSE
It is very unlikely that someone takes zero steps in a day so it was checked for any observations that had zero steps. 77 entries were found and removed from a new version of the activity data set, activity2. The original data set was kept for good data analysis practices.
nrow(activity[activity$TotalSteps == 0, ])## [1] 77
activity2 <- activity[activity$TotalSteps != 0, ]Next, the new version’s columns were renamed and formatted to be cohesive with the other data sets.
activity2 <- activity2 %>%
mutate(ActivityDate = as.Date(ActivityDate, format = "%m/%d/%y")) %>%
rename(Day = ActivityDate)Ensuring accuracy of changes.
head(activity2)## Id Day TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2020-04-12 13162 8.50 8.50
## 2 1503960366 2020-04-13 10735 6.97 6.97
## 3 1503960366 2020-04-14 10460 6.74 6.74
## 4 1503960366 2020-04-15 9762 6.28 6.28
## 5 1503960366 2020-04-16 12669 8.16 8.16
## 6 1503960366 2020-04-17 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
Accuracy of import was ensured.
colnames(calories)## [1] "Id" "ActivityDay" "Calories"
head(calories)## Id ActivityDay Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
Accuracy of Id lengths was ensured.
id_length_calories <- nchar(calories$Id)
sum(id_length_calories !=10)## [1] 0
Duplicates were checked for and none were found.
head(duplicated(calories))## [1] FALSE FALSE FALSE FALSE FALSE FALSE
Zero calories burned is unlikely as your body automatically burns some calories throughout the day, even without exercise. Therefore, any entries with zero calories were removed in a new version of the data set, calories2.
nrow(calories[calories$Calories == 0, ])## [1] 4
calories2 <- calories[calories$Calories != 0, ]The columns were reformatted and renamed to match the columns in other data sets.
calories2 <- calories2 %>%
mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%y")) %>%
rename(Day = ActivityDay)Accuracy of all changes were ensured.
head(activity2)## Id Day TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2020-04-12 13162 8.50 8.50
## 2 1503960366 2020-04-13 10735 6.97 6.97
## 3 1503960366 2020-04-14 10460 6.74 6.74
## 4 1503960366 2020-04-15 9762 6.28 6.28
## 5 1503960366 2020-04-16 12669 8.16 8.16
## 6 1503960366 2020-04-17 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
Ensuring accuracy of importing data set.
colnames(sleep)## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(sleep)## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
Ensuring all Id lengths are 10 characters long.
id_length_sleep <- nchar(sleep$Id)
sum(id_length_sleep !=10)## [1] 0
Checked for any duplicates in this data set. None were found.
head(duplicated(sleep))## [1] FALSE FALSE FALSE FALSE FALSE FALSE
It is very unlikely that someone has zero minutes awake in bed or asleep and zero records of sleep in a day. Therefore, a search for zero in all 3 of those columns was done. None were found.
nrow(sleep[sleep$TotalMinutesAsleep == 0, ])## [1] 0
nrow(sleep[sleep$TotalSleepRecords ==0, ])## [1] 0
nrow(sleep[sleep$TotalTimeInBed ==0, ])## [1] 0
Next, the columns were reformatted and renamed to be aligned with the other data sets.
sleep2 <- sleep %>%
mutate(SleepDay = as.Date(SleepDay, format = "%m/%d/%y")) %>%
rename(Day = SleepDay)Ensuring accuracy of all changes.
head(sleep2)## Id Day TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2020-04-12 1 327 346
## 2 1503960366 2020-04-13 2 384 407
## 3 1503960366 2020-04-15 1 412 442
## 4 1503960366 2020-04-16 2 340 367
## 5 1503960366 2020-04-17 1 700 712
## 6 1503960366 2020-04-19 1 304 320
The final step in the cleaning process was ensuring that number of distinct participants were the same or at least 75% of all the participants in the study. When performed, it was found that only 8 distinct participants in the weight2 data set and therefore was removed from any further analysis. The remaining data sets had at least 75% of participants and were used in the remaining data analysis process.
n_distinct(activity2$Id)## [1] 33
n_distinct(steps2$Id)## [1] 33
n_distinct(sleep2$Id)## [1] 24
n_distinct(calories2$Id)## [1] 33
n_distinct(weight2$Id)## [1] 8
Summary function was performed on the data sets to gain a general overview of the data.
summary(activity2 %>%
select(-Id, -Day, -TrackerDistance, -LoggedActivitiesDistance))## TotalSteps TotalDistance VeryActiveDistance ModeratelyActiveDistance
## Min. : 4 Min. : 0.00 Min. : 0.000 Min. :0.0000
## 1st Qu.: 4923 1st Qu.: 3.37 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 8053 Median : 5.59 Median : 0.410 Median :0.3100
## Mean : 8319 Mean : 5.98 Mean : 1.637 Mean :0.6182
## 3rd Qu.:11092 3rd Qu.: 7.90 3rd Qu.: 2.275 3rd Qu.:0.8650
## Max. :36019 Max. :28.03 Max. :21.920 Max. :6.4800
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## Min. : 0.000 Min. :0.00000 Min. : 0.00
## 1st Qu.: 2.345 1st Qu.:0.00000 1st Qu.: 0.00
## Median : 3.580 Median :0.00000 Median : 7.00
## Mean : 3.639 Mean :0.00175 Mean : 23.02
## 3rd Qu.: 4.895 3rd Qu.:0.00000 3rd Qu.: 35.00
## Max. :10.710 Max. :0.11000 Max. :210.00
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 52
## 1st Qu.: 0.00 1st Qu.:146.5 1st Qu.: 721.5 1st Qu.:1856
## Median : 8.00 Median :208.0 Median :1021.0 Median :2220
## Mean : 14.78 Mean :210.0 Mean : 955.8 Mean :2361
## 3rd Qu.: 21.00 3rd Qu.:272.0 3rd Qu.:1189.0 3rd Qu.:2832
## Max. :143.00 Max. :518.0 Max. :1440.0 Max. :4900
summary(steps2)## Id Day TotalSteps
## Min. :1.504e+09 Min. :2020-04-12 Min. : 4
## 1st Qu.:2.320e+09 1st Qu.:2020-04-18 1st Qu.: 4923
## Median :4.445e+09 Median :2020-04-26 Median : 8053
## Mean :4.858e+09 Mean :2020-04-26 Mean : 8319
## 3rd Qu.:6.962e+09 3rd Qu.:2020-05-03 3rd Qu.:11092
## Max. :8.878e+09 Max. :2020-05-12 Max. :36019
summary(sleep2)## Id Day TotalSleepRecords TotalMinutesAsleep
## Min. :1.504e+09 Min. :2020-04-12 Min. :1.000 Min. : 58.0
## 1st Qu.:3.977e+09 1st Qu.:2020-04-19 1st Qu.:1.000 1st Qu.:361.0
## Median :4.703e+09 Median :2020-04-27 Median :1.000 Median :433.0
## Mean :5.001e+09 Mean :2020-04-26 Mean :1.119 Mean :419.5
## 3rd Qu.:6.962e+09 3rd Qu.:2020-05-04 3rd Qu.:1.000 3rd Qu.:490.0
## Max. :8.792e+09 Max. :2020-05-12 Max. :3.000 Max. :796.0
## TotalTimeInBed
## Min. : 61.0
## 1st Qu.:403.0
## Median :463.0
## Mean :458.6
## 3rd Qu.:526.0
## Max. :961.0
summary(calories2)## Id Day Calories
## Min. :1.504e+09 Min. :2020-04-12 Min. : 52
## 1st Qu.:2.320e+09 1st Qu.:2020-04-19 1st Qu.:1834
## Median :4.445e+09 Median :2020-04-26 Median :2144
## Mean :4.850e+09 Mean :2020-04-26 Mean :2313
## 3rd Qu.:6.962e+09 3rd Qu.:2020-05-04 3rd Qu.:2794
## Max. :8.878e+09 Max. :2020-05-12 Max. :4900
When exploring the data, the need for two new variables in the activity data set was apparent. Total active minutes gave a good glimpse of participants activity throughout the day because it included activity such as walking around the office or house all the way to working out. A new variable of intentional exercise was also needed to separate out working out (exercises that increase heart rate drastically) and activity that naturally happens throughout the day. Therefore, it was created from combining the highly active and fairly active minute variables.
activity2 <- activity2 %>%
mutate(total_active_mins =LightlyActiveMinutes + FairlyActiveMinutes +VeryActiveMinutes) %>%
mutate(total_intentional_mins = FairlyActiveMinutes+VeryActiveMinutes)When exploring the sleep data set, it became apparent that new variable was also needed here. There was a difference in “time asleep” and “time in bed” that participants were trying to fall asleep or laying in bed before getting up. This difference had the possibility of creating insight and correlation between various variables so “time awake in bed” was created.
sleep2 <-sleep2 %>%
mutate(time_awake_in_bed = TotalTimeInBed - TotalMinutesAsleep)An outlier test was ran to determine if the median or mean would be a better measure of the data. It was determined from many outliers that the median would be the better measure for the “Total Steps”, “Sedentary Minutes”, “Time Asleep”, “Time in Bed”, and “Time in Bed Awake”. The mean would be sufficient for all other variables due to lack of multitude outliers.
boxplot.stats(sleep2$TotalMinutesAsleep)$out## [1] 700 119 124 796 137 722 750 166 61 152 77 59 692 99 82 62 98 106 126
## [20] 103 115 123 775 74 79 58 74
boxplot.stats(sleep2$TotalTimeInBed)$out## [1] 712 127 142 961 154 961 961 961 775 178 69 77 65 722 104 85 65 107 108
## [20] 137 121 179 129 134 725 843 78 82 61 75
boxplot.stats(sleep2$time_awake_in_bed)$out## [1] 165 317 239 371 195 161 106 132 227 185 110 176 153 180 121 137 216 205 121
## [20] 191 162 145 145 243 154 208 123 162 189 206 197 140 76 94 87 78 87
boxplot.stats(activity2$Calories)$out## [1] 52 257 4552 4392 4501 4546 4900 4547 4398
boxplot.stats(activity2$total_active_mins)$out## [1] 540 552
boxplot.stats(activity2$SedentaryMinutes)$out## [1] 2 13 0
boxplot.stats(activity2$TotalDistance)$out## [1] 28.03 15.08 17.54 15.01 15.97 16.24 17.19 17.95 15.69 15.67 17.65 20.40
## [13] 18.98 25.29 17.40 18.11 17.62 16.31 15.74 20.65 26.72 16.30 19.34 18.25
## [25] 19.56
activity2$week_day <- weekdays(activity2$Day)
activity_by_day <- aggregate(activity2$TotalSteps, list(activity2$week_day), mean)
ggplot(data=activity_by_day, aes(x=Group.1, y=x)) +
geom_bar(stat="identity", color = "black", fill="cadetblue2") +
scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
labs(title="Number of Steps by Day of the Week", x="Day of the Week", y = "Number of Steps") It was evident that the number of steps did not fluctuate too much throughout the week. This discovery led to the inquiry if other variables fluctuated or stayed pretty consistent throughout the week.
activity2$week_day <- weekdays(activity2$Day)
activity_by_day_mins <- aggregate(activity2$total_active_mins, list(activity2$week_day), mean)
ggplot(data=activity_by_day_mins, aes(x=Group.1, y=x)) +
geom_bar(stat="identity", color = "black", fill="cadetblue2") +
scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
labs(title="Number of Active Minutes by Day of The Week", x="Day of the Week", y = "Number of Active Minutes") There is a slight fluctuation throughout the week, with Friday being the day with the least active minutes. This could possibly be due to being tired after a work week. This insight doesn’t give a clearer idea of intentional minutes throughout the week. So using, the new variable to see when people are intentionally working out the most.
activity2$week_day <- weekdays(activity2$Day)
activity_by_day_inten <- aggregate(activity2$total_intentional_mins, list(activity2$week_day), mean)
ggplot(data=activity_by_day_inten, aes(x=Group.1, y=x)) +
geom_bar(stat="identity", color = "black", fill="cadetblue2") +
scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
labs(title="Number of Intentional Active Minutes vs. Day", x="Day of the Week", y = "Intentional Active Minutes") It was noticed that the number of minutes for intentional active minutes was relatively low. When going back to the summary of activity, it is revealed that sedentary minutes is the largest amount of activity by far (955.8 minutes or about 16 hours compared to about 40 minutes of intentional working out and 4 hours of active minutes altogether). Therefore, it was intriguing to see if the amount of sedentary minutes fluctuated throughout the week.
activity2$week_day <- weekdays(activity2$Day)
activity_by_day_sed <- aggregate(activity2$SedentaryMinutes, list(activity2$week_day), mean)
ggplot(data=activity_by_day_sed, aes(x=Group.1, y=x)) +
geom_bar(stat="identity", color = "black", fill="cadetblue2") +
scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
labs(title="Number of Sedentary Minutes by Day of The Week", x="Day of the Week", y = "Number of Sedentary Minutes") It is not surprising that the weekdays have a lot of sedentary minutes due to those days primarily being workdays for most people and that many jobs are at a desk. However, the weekend sedentary minutes is a little surprising since most people are not at work those days and have the opportunity to be more active.
Noticing this trend of sedentary minutes led to a inquiry about the minutes asleep and in bed and how/if they fluctuate throughout the week as well.
sleep2$week_day <- weekdays(sleep2$Day)
sleep_by_day <- aggregate(sleep2$TotalMinutesAsleep, list(sleep2$week_day), mean)
ggplot(data=sleep_by_day, aes(x=Group.1, y=x)) +
geom_bar(stat="identity", color = "black", fill="cadetblue2") +
scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
labs(title="Minutes Asleep by Day of the Week", x="Day of the Week", y = "Number of Minutes Asleep") sleep2$week_day <- weekdays(sleep2$Day)
sleep_by_day <- aggregate(sleep2$time_awake_in_bed, list(sleep2$week_day), mean)
ggplot(data=sleep_by_day, aes(x=Group.1, y=x)) +
geom_bar(stat="identity", color = "black", fill="cadetblue2") +
scale_x_discrete(limits=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) +
labs(title="Minutes Awake in Bed by Day of the Week", x="Day of the Week", y = "Number of Minutes Awake in Bed") There were some noticeable differences of the various variables among the days of the week. This led to an inquiry about relationships among variables and if there were any correlation.
act_sleep_merge <- merge(activity2, sleep2, by.x=c('Id', 'Day'),
by.y=c('Id', 'Day'))
ggplot(data=act_sleep_merge, aes(x=TotalSteps, y=time_awake_in_bed)) + geom_point(color="blue2")+
labs(title = "Step Count vs. Minutes Awake in Bed", x="Total Steps a Day", y="Minutes Awake") +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
x <- act_sleep_merge$TotalSteps
y <- act_sleep_merge$time_awake_in_bed
cor.test(x, y, method = c("pearson"))##
## Pearson's product-moment correlation
##
## data: x and y
## t = 0.54976, df = 411, p-value = 0.5828
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06956893 0.12327966
## sample estimates:
## cor
## 0.02710758
ggplot(data=act_sleep_merge, aes(x=TotalMinutesAsleep, y=TotalSteps)) + geom_point(color="blue2")+
labs(title = "Step Count vs. Sleep", x="Minutes Asleep", y="Total Steps a Day") +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
x <- act_sleep_merge$TotalMinutesAsleep
y <- act_sleep_merge$TotalSteps
cor.test(x, y, method = c("pearson"))##
## Pearson's product-moment correlation
##
## data: x and y
## t = -3.8563, df = 411, p-value = 0.0001336
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27834209 -0.09203143
## sample estimates:
## cor
## -0.1868665
It was revealed that there was a weak, positive correlation (0.027) between step count and minutes awake in bed and weak, negative correlation (-0.1812) between step count and time asleep. This discrepancy and weak correlations could be due to the bias in sample size. However it led to another question about active minutes and the effect it has on minutes asleep.
ggplot(data=act_sleep_merge, aes(x=total_active_mins, y=TotalMinutesAsleep)) + geom_point(color="blue2")+
labs(title = "Active Minutes vs. Minutes Asleep", x="Active Minutes", y="Minutes Asleep") +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
x <- act_sleep_merge$total_active_mins
y <- act_sleep_merge$TotalMinutesAsleep
cor.test(x, y, method = c("pearson"))##
## Pearson's product-moment correlation
##
## data: x and y
## t = -1.2953, df = 411, p-value = 0.196
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.15927519 0.03293659
## sample estimates:
## cor
## -0.0637606
ggplot(data=act_sleep_merge, aes(x=total_active_mins, y=time_awake_in_bed)) + geom_point(color="blue2")+
labs(title = "Active Minutes vs. Minutes Awake", x="Active Minutes", y="Minutes Awake") +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
x <- act_sleep_merge$total_active_mins
y <- act_sleep_merge$time_awake_in_bed
cor.test(x, y, method = c("pearson"))##
## Pearson's product-moment correlation
##
## data: x and y
## t = -1.8879, df = 411, p-value = 0.05974
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.187539870 0.003805292
## sample estimates:
## cor
## -0.0927233
These visuals and correlation tests revealed that there is a weak, negative correlation between active minutes and time asleep (-0.064) as well as a weak, negative correlation between active minutes and time awake in bed (-0.093). However, the correlation between active and time awake was much smaller so this correlation could be due to the sample size bias and might be a positive correlation if this same analysis was done on a larger population. These tests and visuals were done with all active minutes, including both intentional and natural activities, which led to another exploration inquiry about focusing in on intentional active minutes done by working out and it’s affect on sleep.
ggplot(data=act_sleep_merge, aes(x=total_intentional_mins, y=TotalMinutesAsleep)) + geom_point(color="blue2")+
labs(title = "Intentionally Active Minutes vs. Sleep", x="Total Intentional, Highly Active Minutes", y="Minutes Asleep") +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
x <- act_sleep_merge$total_intentional_mins
y <- act_sleep_merge$TotalMinutesAsleep
cor.test(x, y, method = c("pearson"))##
## Pearson's product-moment correlation
##
## data: x and y
## t = -3.7358, df = 411, p-value = 0.0002136
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27294186 -0.08623358
## sample estimates:
## cor
## -0.1812202
This focus on intentional active minutes revealed that there is a negative correlation (-0.1812) between the number of minutes spent being highly or fairly active and the number of minutes asleep. Therefore, it can be concluded that being more active can help someone have more minutes asleep. This negative correlation led to a further inquiry about the minutes spent being sedentary and if it would have a similar affect with minutes asleep.
ggplot(data=act_sleep_merge, aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) + geom_point(color="blue2")+
labs(title = "Sedentary Minutes vs. Minutes Asleep", x="Sedentary Minutes", y="Minutes Asleep") +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
x <- act_sleep_merge$SedentaryMinutes
y <- act_sleep_merge$TotalMinutesAsleep
cor.test(x, y, method = c("pearson"))##
## Pearson's product-moment correlation
##
## data: x and y
## t = -15.181, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6578402 -0.5337719
## sample estimates:
## cor
## -0.599394
This test and visualization did confirm the theory that it would have an opposite affect of intentionally active minutes and time asleep. This investigation lead to a strong, negative correlation (-0.5994) so it was confirmed that sitting more through out the day leads to less time asleep.
More activity throughout the day helps increase the minutes asleep.
Any activity helps increase the quantity of sleep.
Intentional and higher active levels (i.e. working out or going on a brisk walk) have a stronger impact than the natural occurring activity (i.e. walking around the house or office).
50% of the participants only doing 26 minutes of higher active levels, with 25% doing none
Being sedentary has an adverse affect on sleep.
Average sedentary time was 955.8 minutes (~16 hours)
Important to be more active than sedentary throughout the day.
Breaking up the prolonged sedentary periods can help minimize effects.
50% of observations, participants slept less than 7 hours (recommended minimal amount) (per the CDC)
25% of the time, participants were only sleeping 361 minutes (~6 hours) or less
Median time lying awake in bed was 25 minutes but 75% of participants experienced up to 40 minutes lying awake
Lack of weight data demonstrates a negative connotation towards using it in health metrics. (Only 25% of participants in study reported weight metrics.)
Market sleep quality functions of the watch and habits that increase time asleep.
Advertise the positive effect exercise has on sleep and ways the watch can encourage more intentionally active minutes each day .
Advertise the adverse effect of sedentary time periods and ways the watch can help minimize or break up prolonged sedentary time periods.
Advertise step tracking features as a way to help users meet 10,000 steps throughout the day as a positive way to improve sleep.
Market how the watch helps remind and keep track of metrics during the work day but also transitions to life outside of work.
Stylish design allows for it to go from the office to the gym to the ballpark and everywhere in between.
Recommend continual usage gives a comprehensive look into user’s habits and trends for healthier habits overall.
Promote full usage of functions for the most accurate and least biased insights and suggestions from the watch, including weight information.
Encourage positive outlook on weight as just one small part of the picture not the whole picture (i.e. muscle weighs more than fat so scale can show the same or higher number when one is actually healthier). **Weight is often a sore spot for women so promoting encouragement as a woman’s wellness company is important.