Bellabeat Health Technology Case Study- Hope Owens, Jan 1 2022

Project Background

This is an example case study to demonstrate my knowledge and skills in statistical science. With a master’s degree in biology, professional research experience, and experience in R, SAS, SQL, BigQuery, and Python, I approach complex systems with a scientific approach to analysis. In the goal of working as a junior data analyst, I show my skills in this capstone relating to Bellabeat smart device use.

Summary

I am a (pretend) junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices to help guide marketing strategy for the company.

Characters:

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Products:

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

The project is divided into three questions to guide marketing analysis:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Data Analysis:

Statement of purpose: I am going to analyze open source fitness data to understand health variables in a subset of smart device users. The relationships between different health metrics and product use can give a lifestyle context for the marketing of Bellabeat smart devices, and can show areas of growth for Bellabeat. I am going to understand what factors of health can be most supported by Bellabeat Leaf, Time and application products, including sleep quality, falling asleep time, activity level, weight, BMI, fat levels, distance traveled, steps walked, as well as frequency of application use.

Stage 1: Ask

“The scientist is not a person who gives the right answers, he is one who asks the right questions.”- Claude Lévi-Strauss

Our analysis needs to be specifically designed to address our business objective. Data was found through CC0: Public Domain data, made available through Mobius. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

This data source consists of 18 CSV documents that show different analytical perspectives about user health. Each document represents different quantitative data tracked by Fitbit. With each row being one time point per subject and data being tracked by day and time, these datasets are considered long.

List of datasets

Table Name	Type	Description
dailyActivity_merged	Microsoft Excel CSV	id (user id number), ActivityDate (activity date and time in datetime), TotalSteps (total number of steps taken), TotalDistance (total distance traveled Km), LoggedActivitiesDistance (logged distance traveled Km), VeryActiveDistance (distance traveled- active state), ModeratelyActiveDistance (distance traveled- moderate state), LightActiveDistance (distance traveled- light state), SedentaryActiveDistance (distance traveled- sedentary state), VeryActiveMinutes (minutes active- active state), FairlyActiveMinutes (minutes active- moderate state), LightlyActiveMinutes (minutes active- light state), SedentaryMinutes (minutes active- sedentary state), Calories (calories burned)
dailyCalories_merged	Microsoft Excel CSV	Id (user id number), ActivityDate (activity date and time in datetime), Calories (calories burned)
dailyIntensities_merged	Microsoft Excel CSV	id (user id number), ActivityDate (activity date and time in datetime), SedentaryMinutes(minutes active- sedentary state), LightlyActiveMinutes (minutes active- light state), FairlyActiveMinutes (minutes active- moderate state), VeryActiveMinutes (minutes active- active state), SedentaryActiveDistance (distance traveled- sedentary state), LightActiveDistance (distance traveled- light state), ModeratelyActiveDistance (distance traveled- moderate state), VeryActiveDistance (distance traveled- active state)
dailySteps_merged	Microsoft Excel CSV	Id (user id number), ActivityDate (activity date and time in datetime), StepTotal (total number of steps taken)
heartrate_seconds_merged	Microsoft Excel CSV	Id (user id number), Time (activity date and time in datetime), Value (heartrate)
hourlyCalories_merged	Microsoft Excel CSV	Id (user id number), ActivityHour (activity date and time in datetime), Calories (calories burned)
hourlyIntensities_merged	Microsoft Excel CSV	Id (user id number), ActivityHour (activity date and time in datetime), TotalIntensity (total intensity level), AverageIntensity (average intensity level)
hourlySteps_merged	Microsoft Excel CSV	Id (user id number), ActivityHour (activity date and time in datetime), StepTotal (total number of steps taken)
minuteCaloriesNarrow_merged	Microsoft Excel CSV	Id (user id number), ActivityMinute (activity date and time in datetime), Calories (calories burned)
minuteCaloriesWide_merged	Microsoft Excel CSV	Id (user id number), ActivityHour (activity date and time in datetime), Calories00 (calories burned minute 00), Calories01 (calories burned minute 01), Calories02 (calories burned minute 02)… Calories59 (calories burned minute 59)
minuteIntensitiesNarrow_merged	Microsoft Excel CSV	Id (user id number), ActivityMinute (activity date and time in datetime), Intensity (intensity level)
minuteIntensitiesWide_merged	Microsoft Excel CSV	Id (user id number), ActivityHour (activity date and time in datetime), Intensity00 (intensity level minute 00), Intensity01 (intensity level minute 01), Intensity02 (intensity level minute 02)… Intensity59 (intensity level minute 59)
minuteMETsNarrow_merged	Microsoft Excel CSV	Id (user id number), ActivityMinute (activity date and time in datetime), METs (metabolic equivalent of task)
minuteSleep_merged	Microsoft Excel CSV	Id (user id number), date(activity date and time in datetime), value (time of sleep, hours), logId (data log id)
minuteStepsNarrow_merged	Microsoft Excel CSV	Id (user id number), ActivityMinute (activity date and time in datetime), Steps (total number of steps taken)
minuteStepsWide_merged	Microsoft Excel CSV	Id (user id number), ActivityHour (activity date and time in datetime), Steps00 (total number of steps taken minute 00), Steps01 (total number of steps taken minute 01), Steps02 (total number of steps taken minute 02)… Steps59 (total number of steps taken minute 59)
sleepDay_merged	Microsoft Excel CSV	Id (user id number), SleepDay(activity date and time in datetime), TotalSleepRecords (time of sleep, hours), TotalMinutesAsleep (time of sleep, minutes), TotalTimeInBed (time in bed, hours)
weightLogInfo_merged	Microsoft Excel CSV	Id (user id number), Date(activity date and time in datetime), WeightKg (weight measured in kg), WeightPounds (weight measured in lbs), Fat (percet fat), BMI (body mass index), IsManualReport (weight entered manually), LogId (data log id)

Task 1: Choose data- For this business objective, I chose to use the following datasets: dailyActivity_merged, dailyCalories_merged, dailyIntensities_merged, dailySteps_merged, hourlyCalories_merged, hourlyIntensities_merged, sleepDay_merged, and weightLogInfo_merged. The data showed health variables on a larger scale that was most applicable to my business objective, being hourly and diel measures of health and not measures at the minute level.

Task 2: Know your audience- I am performing analysis for Urška Sršen, Sando Mur and the Bellabeat analytics team. They have a working understanding of the fitness smart devices, and I need to tailor my results to help inform their conclusions.

Task 3: Aims to address objective- In the aims of answering the business objective, I decided future analysis should focus on these variables: sleep quality, falling asleep time, activity level, weight, BMI, fat levels, distance traveled, steps walked, as well as frequency of application use.

Stage 2: Prepare

“It is a capital mistake to theorize before one has data.” -Sherlock Holmes

Task 1: Data source validation- Verifying the metadata of the dataset confirmed it as open-source. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. We can copy, modify, and distribute the work without asking permission.

Due to the limitation of size (30 users) and not having any demographic information, our findings could be effected by a sampling bias. The population of fitbit users in this study might not be representative of the population as a whole. Additionally, the dataset is historic (2016) and has a limited duration of time (2 months long). This case study will pursue our business objective as an operational approach.

Task 2: Data exploration- I needed to prepare my workspace and load the packages that I would be using in my analysis. I then uploaded the files to R, and made sure they were in a usable format. I used the summary() function on each dataset to find general trends and irregularities in the data. In terms of R syntax, I did not have a problem with the original column names or data distribution, so I kept them the same.

#load packages used
require(lubridate)    # package to work with dates and times
require(tidyverse)    # package to clean and process data
require(plyr)         # package to work with ddply
require(ggplot2)      # package to graph values
require(GGally)       # package extends function of 'ggplot2'
require(reshape2)     # package long to wide datasets in r
require(car)          # package to run anova in r
require(calendR)      # package to plot calendar dates
require(openair)      # package to plot for calender variables in r
require(magrittr)     # package to use pipe %>%
require(tidyr)        # package transform wide dataset to long dataset
require(hexbin)       # package density plots

#load activity datasets provided by the CC0: Public Domain
daily_activity <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/dailyActivity_merged.csv", header=TRUE)
daily_calories <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/dailyCalories_merged.csv", header=TRUE)
daily_intensities <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/dailyIntensities_merged.csv", header=TRUE)
daily_steps <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/dailySteps_merged.csv", header=TRUE)
heartrate <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/heartrate_seconds_merged.csv", header=TRUE)
hourly_calories <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/hourlyCalories_merged.csv", header=TRUE)
hourly_intensities <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/hourlyIntensities_merged.csv", header=TRUE)
hourly_steps <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/hourlySteps_merged.csv", header=TRUE)
MET <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/minuteMETsNarrow_merged.csv", header=TRUE)
sleep_day <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/sleepDay_merged.csv", header=TRUE)
weight <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/bellabeat/weightLogInfo_merged.csv", header=TRUE)

### Proofread data, prepare for analysis
#summary (daily_activity)
#summary (daily_calories)
#summary (daily_intensities)
#summary (daily_steps)
#summary (heartrate)
#summary (hourly_calories)
#summary (hourly_intensities)
#summary (hourly_steps)
#summary (MET)
#summary (sleep_day)
#summary (weight)

Stage 3: Process

“No data is clean, but most is useful” - Dean Abbott

Task 1: Clean and process data- Before I could perform analysis, I needed my values to be in a usable format. I knew that I was going to be interested in date, time, and day of the week, so I transformed the date and time values to integer form, listed in separate columns. I also wanted to strip the hour from the time format, as well as day of week so I could broadly classify by these variables.

With a single row (as either an hour or day) serving as a unit of replication, I wanted rows to have all future classification variables listed in separate columns so I could group them later on. I performed these functions for all datasets.

#create date format to strip values
daily_activity$date <- mdy(daily_activity$ActivityDate)  # import date field in correct format
daily_activity$month<- month(daily_activity$date)        # get month of year
daily_activity$doy<- yday(daily_activity$date)           # get day of year
daily_activity$year<- year(daily_activity$date)          # get year
daily_activity$wday<- wday(daily_activity$date, 
                           label = FALSE, 
                           abbr = TRUE, 
                           week_start = getOption("lubridate.week.start", 7), 
                           locale = Sys.getlocale("LC_TIME")) # extract day of week from calendar date

Stage 4-5: Analysis and Figures

“If you torture the data long enough, it will confess.” - Ronald Coase

Task 1: Analyze- Now with my datasets prepared for analysis, I was ready to group by data by different attributes. I was interested in distance and activity intensity levels during different time periods as my response variable. I set my function to find the average distance traveled by users for each intensity class, and well as the frequency of users, standard deviation and standard error.

For this test, with a single row (or day) serving as a unit of replication, I classified individual predictor variables by intensity level and time so I could run statistical tests to test for variable significance. I did not think any of the variables would be colinear, being different metrics of time and intensity, but I could have graphed two predictor variables against each other to test for colinearity if they seemed closely related.

Task 2: Visualization- I was then able to graph my function using a ggplot bar chart to visualize data, and run a type III ANOVA using the car() package. My business objective was to find differences health metrics by time, intensity and use type, so I used a type III sums of squares. I wanted to find presence of both a main effect and a main effect and interaction effect between my two predictor variables. With the possible presence of significant interactions, this approach to analysis was the most valid for my business objective.

4.1 Distance travelled by activity type:

### (1a) Daily activity by weekday- total
weekday_daily_activity <- ddply(daily_activity, c("wday"), summarise,    
                                         N    = sum(!is.na(TotalDistance) ), #use this version if missing values present
                                         mean = mean(TotalDistance,na.rm=TRUE),
                                         sd   = sd(TotalDistance,na.rm=TRUE),
                                         se  = sd / sqrt(N))
# Create classifier variable
weekday_daily_activity$activity <- paste("Total")
### (2a) Active distance by weekday- intensity subset active
weekday_active_distance <- ddply(daily_activity, c("wday"), summarise,    
                                        N    = sum(!is.na(VeryActiveDistance) ), #use this version if missing values present
                                        mean = mean(VeryActiveDistance,na.rm=TRUE),
                                        sd   = sd(VeryActiveDistance,na.rm=TRUE),
                                        se  = sd / sqrt(N))
weekday_active_distance$activity <- paste("Active")
### (3a) Moderate distance by weekday
weekday_moderate_distance <- ddply(daily_activity, c("wday"), summarise,    
                                        N    = sum(!is.na(ModeratelyActiveDistance) ), #use this version if missing values present
                                        mean = mean(ModeratelyActiveDistance,na.rm=TRUE),
                                        sd   = sd(ModeratelyActiveDistance,na.rm=TRUE),
                                        se  = sd / sqrt(N))
weekday_moderate_distance$activity <- paste("Moderate")
### (4a) Light distance by weekday
weekday_light_distance <- ddply(daily_activity, c("wday"), summarise,    
                                        N    = sum(!is.na(LightActiveDistance) ), #use this version if missing values present
                                        mean = mean(LightActiveDistance,na.rm=TRUE),
                                        sd   = sd(LightActiveDistance,na.rm=TRUE),
                                        se  = sd / sqrt(N))
weekday_light_distance$activity <- paste("Light")
### (5a) Sedentary distance by weekday
weekday_sedentary_distance <- ddply(daily_activity, c("wday"), summarise,    
                                        N    = sum(!is.na(SedentaryActiveDistance) ), #use this version if missing values present
                                        mean = mean(SedentaryActiveDistance,na.rm=TRUE),
                                        sd   = sd(SedentaryActiveDistance,na.rm=TRUE),
                                        se  = sd / sqrt(N))
weekday_sedentary_distance$activity <- paste("Sedentary")
# Bind 2a-5a
weekday_distance <- rbind(weekday_sedentary_distance, weekday_light_distance, weekday_moderate_distance, 
                          weekday_active_distance)

##### Figure: Distance traveled by activity type
graph1<-ggplot() +
           geom_histogram(data = weekday_sedentary_distance, 
                          aes(x = wday, y = mean), 
                          stat = "identity", 
                          fill = "#004c6d", 
                          alpha = 0.4, 
                          width = 0.95) +
           geom_histogram(data = weekday_light_distance, 
                          aes(x = wday, y = mean), 
                          stat = "identity", 
                          fill = "#004c6d", 
                          alpha = 0.4, 
                          width = 0.95) +
           geom_histogram(data = weekday_active_distance, 
                          aes(x = wday, y = mean), 
                          stat = "identity", 
                          fill = "#004c6d", 
                          alpha = 0.6, 
                          width = 0.95) +
           geom_histogram(data = weekday_moderate_distance, 
                          aes(x = wday, y = mean), 
                          stat = "identity", 
                          fill = "#004c6d", 
                          alpha = 1, 
                          width = 0.95) +
           labs(title = "Distance traveled by activity type",x="Day of week",y="Kilometers") +
           theme_minimal() +
           theme(legend.position="bottom") +
           theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
           theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
           theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
           theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) + 
           theme(axis.line = element_line(colour = "black"),
                panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                panel.border = element_blank(),
                panel.background = element_blank()) 

graph1 + geom_errorbar(data = weekday_light_distance, 
                       aes(x=wday, ymin=mean-se, 
                           ymax=mean+se), width=0.3, 
                       colour="black", 
                       alpha=0.9, 
                       size=0.8) +
         geom_errorbar(data = weekday_active_distance, 
                       aes(x=wday, ymin=mean-se, 
                           ymax=mean+se), 
                       width=0.3, 
                       colour="black", 
                       alpha=1, 
                       size=0.8) +
         geom_errorbar(data = weekday_moderate_distance, 
                       aes(x=wday, ymin=mean-se, 
                           ymax=mean+se), 
                       width=0.3, 
                       colour="black", 
                       alpha=1.2, 
                       size=0.8)

Figure 1. Distance traveled by users (km) at different fitness intensity levels (sedentary, light, moderate and active), shown through increasing color intensity with active being indicated by the dark navy bars (n = 30, ± 1 SE). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

######### ANOVA: Distance: intensity/weekday #########
daily_activity_subset <- daily_activity[,(names(daily_activity) %in% c("wday",
                                                                       "ModeratelyActiveDistance",
                                                                       "SedentaryActiveDistance",
                                                                       "LightActiveDistance",
                                                                       "VeryActiveDistance"))]
# convert wide format to long format dataset
# wday= weekday, variable= intensity, value= distance
long <- melt(daily_activity_subset, id.vars = c("wday"))
# type III anova
anova.dist<-lm(value~factor(variable)*factor(wday), data= long)
Anova(anova.dist, type="III")

4.2 Proportional frequency of smart device use:

# Tally frequency of users within the sleep_day dataset
# One row = one day/one unit of replication
device_use <- sleep_day %>%
  group_by(Id) %>%
  tally()

# Create classifier variable for device use based on number of days used
device_use_class<- device_use %>% 
                  mutate(Use_Level = case_when(n >= 1 & n<= 10 ~ "Rare Use",
                                        n >= 11 & n<= 20 ~ "Moderate Use",
                                        n >= 21 & n<= 31 ~ "Continuing Use"))

# Subset dataset to frequency (n) and device use level
device_use_class <- device_use_class[,(names(device_use_class) %in% c("n",
                                                           "Use_Level"))]
# Count frequency of use classes, prepare for figure
device_use_class <- data.frame(device_use_class) %>% 
                    group_by(Use_Level) %>% 
                    tally()

# Omit device use class = na
device_use_class <- na.omit(device_use_class)

#### Figure: Proportional frequency of smart device use
device_use_class %>%
  ggplot(aes(x = "", y = Use_Level, fill = Use_Level)) +
  geom_bar(stat = "identity", width = 1, col="black")+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#7da2c0", "#5a84a3", "#366788","#004c6d")) +
  geom_label(aes(label = Use_Level),
             color = "Black",
             position = position_stack(vjust = 0.39),
             show.legend = FALSE) +
  labs(title="Proportional frequency of smart device use") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(),) +  
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = 'white', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = 'white', size = 12, hjust = 0.5))

Figure 2. Proportional frequency of smart device use for users of the FitBit App from the data subset, shown through increasing color intensity with rare use indicated by the dark navy color (n = 30). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

sleep_day$SleepDay=as.POSIXct(sleep_day$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep_day$time <- format(sleep_day$SleepDay, format = "%H:%M:%S")
sleep_day$date <- format(sleep_day$SleepDay, format = "%m/%d/%y")
sleep_day$date <- mdy(sleep_day$date)  # import date field in correct format
sleep_day$month <- month(sleep_day$date)        # get month of year
sleep_day$doy <- yday(sleep_day$date)           # get day of year
sleep_day$year <- year(sleep_day$date)          # get year
sleep_day$wday <- wday(sleep_day$date, label = FALSE, abbr = TRUE, week_start = getOption("lubridate.week.start", 7), locale = Sys.getlocale("LC_TIME"))


device_use <- sleep_day %>%
  group_by(date) %>%
  tally() %>%
  summarise( max_user = max(n))

device_use <- sleep_day %>%
  group_by(date) %>%
  tally() %>%
  mutate(usage_rate = n / 17)

            
calendarPlot(device_use, pollutant = "usage_rate",
             year = 2016,
             month = 4:5,
             annotate = "date",
             main = "Frequency of smart device users", 
             cols = "viridis"
)

Figure 3. Proportional frequency of smart device use for users of the FitBit App April 12th, 2016 to May 12th, 2016, shown through increasing color intensity with rare use indicated by the dark navy color, and frequent use indicated by the yellow color (n = 30). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

######### ANOVA: Frequency: use class #########
# Anova- are means different?
use.lm <- lm(n ~ Use_Level, data = device_use_class)
use.av <- aov(use.lm)
summary(use.av)

##             Df Sum Sq Mean Sq
## Use_Level    2  34.67   17.33

## Post hoc- how are means different?
tukey.test <- TukeyHSD(use.av)
tukey.test

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = use.lm)
## 
## $Use_Level
##                             diff lwr upr p adj
## Moderate Use-Continuing Use   -8 NaN NaN   NaN
## Rare Use-Continuing Use       -2 NaN NaN   NaN
## Rare Use-Moderate Use          6 NaN NaN   NaN

4.3 Proportional duration of activity type:

### (1b) Active minutes by weekday
weekday_active_minutes <- ddply(daily_activity, c("wday"), summarise,    
                                        N    = sum(!is.na(VeryActiveMinutes) ), #use this version if missing values present
                                        mean = mean(VeryActiveMinutes,na.rm=TRUE),
                                        sd   = sd(VeryActiveMinutes,na.rm=TRUE),
                                        se  = sd / sqrt(N))
weekday_active_minutes$activity <- paste("Active")
### (2b) Moderate minutes by weekday
weekday_moderate_minutes <- ddply(daily_activity, c("wday"), summarise,    
                                        N    = sum(!is.na(FairlyActiveMinutes) ), #use this version if missing values present
                                        mean = mean(FairlyActiveMinutes,na.rm=TRUE),
                                        sd   = sd(FairlyActiveMinutes,na.rm=TRUE),
                                        se  = sd / sqrt(N))
weekday_moderate_minutes$activity <- paste("Moderate")
### (3b) Light minutes by weekday
weekday_light_minutes <- ddply(daily_activity, c("wday"), summarise,    
                                        N    = sum(!is.na(LightlyActiveMinutes) ), #use this version if missing values present
                                        mean = mean(LightlyActiveMinutes,na.rm=TRUE),
                                        sd   = sd(LightlyActiveMinutes,na.rm=TRUE),
                                        se  = sd / sqrt(N))
weekday_light_minutes$activity <- paste("Light")
### (4b) Sedentary minutes by weekday
weekday_sedentary_minutes <- ddply(daily_activity, c("wday"), summarise,    
                                        N    = sum(!is.na(SedentaryMinutes) ), #use this version if missing values present
                                        mean = mean(SedentaryMinutes,na.rm=TRUE),
                                        sd   = sd(SedentaryMinutes,na.rm=TRUE),
                                        se  = sd / sqrt(N))
weekday_sedentary_minutes$activity <- paste("Sedentary")
# Bind 1b-4b
weekday_minutes <- rbind(weekday_sedentary_minutes, weekday_light_minutes, weekday_moderate_minutes, weekday_active_minutes)

### (5b) pivot table- minutes by activity type, by weekday
minutes_mean <- ddply(weekday_minutes, c("activity"), summarise,    
                                   N    = sum(!is.na(mean) ), #use this version if missing values present
                                   mean = mean(mean,na.rm=TRUE),
                                   sd   = sd(mean,na.rm=TRUE),
                                   se  = sd / sqrt(mean))

### Figure: Proportional duration of activity type
minutes_mean %>%
  ggplot(aes(x = "", y = activity, fill = activity)) +
  geom_bar(stat = "identity", width = 1, col="black")+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#7da2c0", "#5a84a3", "#366788","#004c6d")) +
  geom_label(aes(label = activity),
             color = "Black",
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  labs(title="Proportional duration of activity type") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(),) +  
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = 'white', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = 'white', size = 12, hjust = 0.5))

Figure 4. Proportional duration of activity level (sedentary, light, moderate, active), shown through increasing color intensity with sedentary use indicated by the dark navy color (n = 30). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

######### ANOVA: Duration: activity type #########
# Anova- are means different?
activity.lm <- lm(mean ~ activity, data = minutes_mean)
activity.av <- aov(activity.lm)
summary(activity.av)

##             Df Sum Sq Mean Sq
## activity     3 649548  216516

## Post hoc- how are means different?
tukey.test <- TukeyHSD(activity.av)
tukey.test

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = activity.lm)
## 
## $activity
##                           diff lwr upr p adj
## Light-Active        171.683642 NaN NaN   NaN
## Moderate-Active      -7.567495 NaN NaN   NaN
## Sedentary-Active    970.487980 NaN NaN   NaN
## Moderate-Light     -179.251137 NaN NaN   NaN
## Sedentary-Light     798.804338 NaN NaN   NaN
## Sedentary-Moderate  978.055475 NaN NaN   NaN

4.4 Fitness Intensity by Hour:

#create date format to strip values
hourly_intensities$ActivityHour=as.POSIXct(hourly_intensities$ActivityHour, 
                                           format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()) # set variable type as date time format
hourly_intensities$time <- format(hourly_intensities$ActivityHour, 
                                  format = "%H:%M:%S")                                       # extract time
hourly_intensities$time_obj <- strptime(hourly_intensities$time, 
                                        format = "%H:%M:%OS", tz = "EST")                    # create time format to strip hour as subclass
hourly_intensities$hour<- hour(hourly_intensities$time_obj)                                  # strip time

### (1c) Daily steps by weekday
hour_activity <- ddply(hourly_intensities, c("hour"), summarise,    
                                N    = sum(!is.na(TotalIntensity) ), #use this version if missing values present
                                mean = mean(TotalIntensity,na.rm=TRUE),
                                sd   = sd(TotalIntensity,na.rm=TRUE),
                                se  = sd / sqrt(N))

graph2 <- ggplot() +
             geom_histogram(data = hour_activity, aes(x = hour, y = mean), stat = "identity", fill = "#004c6d", color="#004c6d" , alpha = 0.6, width = 0.95) +
             labs(title = "Fitness intensity level by hour",x="Hour",y="Intensity Level") +
             theme_minimal() +
             theme(legend.position="bottom") +
             theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
             theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
             theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
             theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) + 
             theme(axis.line = element_line(colour = "black"),
                 panel.grid.major = element_blank(),
                 panel.grid.minor = element_blank(),
                 panel.border = element_blank(),
                 panel.background = element_blank()) 

graph2 + geom_errorbar(data = hour_activity, aes(x=hour, ymin=mean-se, ymax=mean+se), width=0.3, colour="black", alpha=0.9, size=0.8)

head (hour_activity)

Figure 5. Fitness activity level by hour throughout a 24 hour period (± 1 SE) as shown by the blue bars (n = 30). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

######### ANOVA: Intensity: hour #########
hour.lm <- lm(mean ~ hour, data = hour_activity)
hour.av <- aov(hour.lm)
summary(hour.av)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## hour         1    457   457.0   12.23 0.00204 **
## Residuals   22    822    37.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.5 Distance traveled compared with calories burned:

#create date format to strip values
daily_calories$ActivityHour=as.POSIXct(daily_calories$ActivityDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
daily_calories$time <- format(daily_calories$ActivityDay, format = "%H:%M:%S")
daily_calories$date <- format(daily_calories$ActivityDay, format = "%m/%d/%y")
daily_calories$date <- mdy(daily_calories$date)  # import date field in correct format
daily_calories$month <- month(daily_calories$date)        # get month of year
daily_calories$doy <- yday(daily_calories$date)           # get day of year
daily_calories$year <- year(daily_calories$date)          # get year
daily_calories$wday <- wday(daily_calories$date, label = FALSE, abbr = TRUE, week_start = getOption("lubridate.week.start", 7), locale = Sys.getlocale("LC_TIME"))
#create date format to strip values
sleep_day$SleepDay=as.POSIXct(sleep_day$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep_day$time <- format(sleep_day$SleepDay, format = "%H:%M:%S")
sleep_day$date <- format(sleep_day$SleepDay, format = "%m/%d/%y")
sleep_day$date <- mdy(sleep_day$date)  # import date field in correct format
sleep_day$month <- month(sleep_day$date)        # get month of year
sleep_day$doy <- yday(sleep_day$date)           # get day of year
sleep_day$year <- year(sleep_day$date)          # get year
sleep_day$wday <- wday(sleep_day$date, label = FALSE, abbr = TRUE, week_start = getOption("lubridate.week.start", 7), locale = Sys.getlocale("LC_TIME"))

sleep_calories <- merge(sleep_day,daily_calories,by="doy")
sleep_calories_subset <- sleep_calories[,(names(sleep_calories) %in% c("TotalMinutesAsleep","doy",
                                                                       "TotalTimeInBed","Calories"))]
sleep_calories_subset$sleep_hours <- sleep_calories_subset$TotalMinutesAsleep/60
sleep_calories_subset$bed_hours <- sleep_calories_subset$TotalTimeInBed/60
sleep_calories_subset$fall_asleep_time <- sleep_calories_subset$bed_hours - sleep_calories_subset$sleep_hours

# Bin size control + color palette
ggplot(sleep_calories_subset, aes(x=sleep_hours, y=Calories) ) +
  geom_hex(bins = 70) +
  scale_fill_continuous(type = "viridis") +
  theme_bw() +
  labs(title = "Daily distance travelled compared with calories burned",x="Kilometers Travelled",y="Calories Burned") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())

Figure 6. Distance traveled (Km) compared with calories burned in fitness smart device users. Density of count shown by color, with brighter colors showing more frequent values (n = 30). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

# cont...
#
ggplot(sleep_calories_subset, aes(x=fall_asleep_time, y=Calories) ) +
  geom_hex(bins = 70) +
  scale_fill_continuous(type = "viridis") +
  theme_bw() +
  labs(title = "Daily falling asleep time compared with calories burned",x="Time Falling Asleep",y="Calories Burned") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())

Figure 7. Daily falling asleep time compared with calories burned. Density of count shown by color, with brighter colors showing more frequent values (n = 30). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

4.6 Correlogram of sleep level vs. bed time (hours) and fitness intensity level (hours):

sleep_day$SleepDay=as.POSIXct(sleep_day$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep_day$time <- format(sleep_day$SleepDay, format = "%H:%M:%S")
sleep_day$date <- format(sleep_day$SleepDay, format = "%m/%d/%y")
sleep_day$date <- mdy(sleep_day$date)  # import date field in correct format
sleep_day$month <- month(sleep_day$date)        # get month of year
sleep_day$doy <- yday(sleep_day$date)           # get day of year
sleep_day$year <- year(sleep_day$date)          # get year
sleep_day$wday <- wday(sleep_day$date, label = FALSE, abbr = TRUE, week_start = getOption("lubridate.week.start", 7), locale = Sys.getlocale("LC_TIME"))

daily_intensities$date <- mdy(daily_intensities$ActivityDay)  # import date field in correct format
daily_intensities$month <- month(daily_intensities$date)        # get month of year
daily_intensities$doy <- yday(daily_intensities$date)           # get day of year
daily_intensities$year <- year(daily_intensities$date)          # get year
daily_intensities$wday <- wday(daily_intensities$date, label = FALSE, abbr = TRUE, week_start = getOption("lubridate.week.start", 7), locale = Sys.getlocale("LC_TIME"))

sleep_intensity <- merge(sleep_day,daily_intensities,by="doy")
sleep_intensity_subset <- sleep_intensity[,(names(sleep_intensity) %in% c("TotalMinutesAsleep","doy",
                                                                        "TotalTimeInBed","LightlyActiveMinutes",
                                                                        "SedentaryMinutes","FairlyActiveMinutes",
                                                                        "VeryActiveMinutes"))]
### Sleep as hour
sleep_intensity_subset$sleep_hours <- sleep_intensity_subset$TotalMinutesAsleep/60
sleep_intensity_subset$bed_hours <- sleep_intensity_subset$TotalTimeInBed/60
sleep_intensity_subset$fall_asleep_time <- sleep_intensity_subset$bed_hours - sleep_intensity_subset$sleep_hours
### Intensity as hour
sleep_intensity_subset$seden_hours <- sleep_intensity_subset$SedentaryMinutes/60
sleep_intensity_subset$light_hours <- sleep_intensity_subset$LightlyActiveMinutes/60
sleep_intensity_subset$moder_hours <- sleep_intensity_subset$FairlyActiveMinutes/60
sleep_intensity_subset$activ_hours <- sleep_intensity_subset$VeryActiveMinutes/60

sleep_intensity_subset_class<- sleep_intensity_subset %>% 
  mutate(Sleep_Level = case_when(sleep_hours >= 0 & sleep_hours<= 7 ~ "Low Sleep",
                               sleep_hours >= 7.01 & sleep_hours<= 8.5 ~ "Healthy Sleep",
                               sleep_hours >= 8.51 & sleep_hours<= 24 ~ "Over Sleep"))

# create correlogram for activities and calories
correlogram <- sleep_intensity_subset_class %>%
  select(bed_hours, activ_hours, moder_hours, 
         light_hours, seden_hours, Sleep_Level) %>%
  ggpairs(mapping = ggplot2::aes(colour = Sleep_Level),
          lower = list(continuous = wrap("smooth", alpha = 0.5, size=0.9,alignPercent=0.8)),
          diag = list(discrete="barDiag", continuous = wrap("densityDiag", alpha=0.5 )),
          upper = list(combo = wrap("box_no_facet", alpha=0.5)),
          columns = 1:5) +
  scale_fill_manual(values=c( "#366788","#7da2c0","#004c6d")) +
  scale_color_manual(values=c("#366788","#7da2c0","#004c6d")) +
  theme_bw() +
  labs(title = "Correlogram of Sleep Level vs. Bed Time (hours) and Fitness Intensity Level (hours)",
       subtitle = "Categorized by low, healthy, and oversleep sleep levels",
       caption = "Data source: Public Domain through Mobius") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())

correlogram # display the correlogram

Figure 8. Correlogram of sleep level vs. bed time (hours) and fitness intensity level (hours), categorized by sleep level. Sleep level is with increased sleep time (low, healthy or over sleep) indicated by the light to dark navy points respectively (n = 30). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

4.7 Correlogram of sedentary activity compared with sleep and duration in bed:

sleep_intensity_subset <- sleep_intensity_subset_class[,
                           (names(sleep_intensity_subset_class) %in% c("sleep_hours",
                                                                       "bed_hours",
                                                                       "seden_hours"))]

### correlogram
# Set palette for correlogram
change_palette <- function(correlogram){
  for(i in 1:correlogram$nrow) {
    for(j in 1:correlogram$ncol){
      correlogram[i,j] <- correlogram[i,j] +
        scale_fill_manual(values=c( "#0582CA","#004c6d")) +
        scale_color_manual(values=c("#0582CA","#004c6d"))
    }
  }
  return(correlogram)
}

change_palette_single <- function(data, mapping, method = "lm", ...) {
  correlogram <- ggplot(data = data, mapping = mapping) +
    geom_point(colour = "#004c6d") +
    geom_smooth(method = method, color = "black", ...)
  return(correlogram)
}

correlogram <- sleep_intensity_subset %>%
  select(sleep_hours,
         bed_hours,
         seden_hours) %>%
  ggpairs(lower = list(continuous = wrap(change_palette_single, method = "lm")),
          diag = list(continuous = wrap("barDiag", fill = "#004c6d", alpha = 0.6, colour = "#004c6d")),
          upper = list(continuous = wrap("cor", size = 4))) +
  theme() +
  labs(title = "Correlogram of sedentary activity compared with sleep and duration in bed",
       caption = "Data source: Public Domain through Mobius")+
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())

correlogram

Figure 9. Correlogram of sedentary activity compared with sleep and duration in bed. Fitness activity level by hour throughout a 24 hour period (± 1 SE) as shown by the blue bars (n = 30). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

4.8 Correlogram of BMI level vs. fat, weight(lbs) and fitness intensity level (hours):

daily_intensities$date <- mdy(daily_intensities$ActivityDay)  # import date field in correct format
daily_intensities$month <- month(daily_intensities$date)        # get month of year
daily_intensities$doy <- yday(daily_intensities$date)           # get day of year
daily_intensities$year <- year(daily_intensities$date)          # get year
daily_intensities$wday <- wday(daily_intensities$date, label = FALSE, abbr = TRUE, week_start = getOption("lubridate.week.start", 7), locale = Sys.getlocale("LC_TIME"))

weight$WeightDay=as.POSIXct(weight$Date, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
weight$time <- format(weight$WeightDay, format = "%H:%M:%S")
weight$date <- format(weight$WeightDay, format = "%m/%d/%y")
weight$date <- mdy(weight$date)  # import date field in correct format
weight$month <- month(weight$date)        # get month of year
weight$doy <- yday(weight$date)           # get day of year
weight$year <- year(weight$date)          # get year
weight$wday <- wday(weight$date, label = FALSE, abbr = TRUE, week_start = getOption("lubridate.week.start", 7), locale = Sys.getlocale("LC_TIME"))

intensities_weight <- merge(daily_intensities,weight,by="doy")
intensities_weight <- intensities_weight[,(names(intensities_weight) %in% c("WeightPounds","Fat","doy",
                                                                          "BMI","LightlyActiveMinutes",
                                                                          "SedentaryMinutes","FairlyActiveMinutes",
                                                                          "VeryActiveMinutes"))]
### Intensity as hour
intensities_weight$seden_hours <- intensities_weight$SedentaryMinutes/60
intensities_weight$light_hours <- intensities_weight$LightlyActiveMinutes/60
intensities_weight$moder_hours <- intensities_weight$FairlyActiveMinutes/60
intensities_weight$activ_hours <- intensities_weight$VeryActiveMinutes/60

intensities_weight_class <- intensities_weight %>% 
  mutate(BMI_class = case_when(BMI >= 0 & BMI<= 18.5 ~ "Underweight",
                                 BMI >= 18.51 & BMI<= 24.9 ~ "Healthy",
                                 BMI >= 25 & BMI<= 29.9 ~ "Overweight",
                                 BMI<= 30 ~ "Obese"))

intensities_weight_class <- na.omit(intensities_weight_class)
# create correlogram for activities and calories
correlogram2 <- intensities_weight_class %>%
  select(Fat, WeightPounds, activ_hours, seden_hours, BMI_class) %>%
  ggpairs(mapping = ggplot2::aes(colour = BMI_class),
          lower = list(continuous = wrap("smooth", alpha = 0.5, size=0.9,alignPercent=0.8)),
          diag = list(discrete="barDiag", continuous = wrap("densityDiag", alpha=0.5 )),
          upper = list(combo = wrap("box_no_facet", alpha=0.5)),
          columns = 1:5) +
  scale_fill_manual(values=c( "#7FB3D5","#004c6d")) +
  scale_color_manual(values=c("#7FB3D5","#004c6d")) +
  theme_bw() +
  labs(title = "Correlogram of BMI Level vs. Fat, Weight(lbs) and Fitness Intensity Level (hours)",
       subtitle = "Categorized by healthy and overweight BMI levels",
       caption = "Data source: Public Domain through Mobius") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())

correlogram2 # display the correlogram

Figure 10. Correlogram of BMI level vs. fat, weight (lbs) and fitness intensity level (hours), categorized by healthy and overweight users, indicated by the blue and navy bars (n = 30). The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

4.9 Hourly activity intensity compared with calories burned:

#create date time format to strip values
hourly_calories$Activity_Day=as.POSIXct(hourly_calories$ActivityHour, 
                                        format="%m/%d/%Y %I:%M:%S %p", 
                                        tz=Sys.timezone())     # set variable type as date time format
hourly_calories$time <- format(hourly_calories$Activity_Day, 
                               format = "%H:%M:%S")            # create time column
hourly_calories$date <- format(hourly_calories$Activity_Day, 
                               format = "%m/%d/%y")            # create date column
hourly_calories$date <- mdy(hourly_calories$date)              # import date field in correct format
hourly_calories$month <- month(hourly_calories$date)           # get month of year
hourly_calories$doy <- yday(hourly_calories$date)              # get day of year
hourly_calories$year <- year(hourly_calories$date)             # get year
hourly_calories$wday <- wday(hourly_calories$date, 
                             label = FALSE, abbr = TRUE, 
                             week_start = getOption("lubridate.week.start", 7), 
                             locale = Sys.getlocale("LC_TIME"))# extract day of week from calendar date
hourly_calories$time_obj <- strptime(hourly_calories$time, 
                                     format = "%H:%M:%OS", 
                                     tz = "EST")               # create time format to strip hour as subclass
hourly_calories$hour<- hour(hourly_calories$time_obj)          # strip time

# Convert from integer time to diel categorical (Day/Night) 
hourly_calories$diel <- hourly_calories$hour <- mapvalues(hourly_calories$hour, 
                                                                from = c(1, 2, 3, 4, 5,
                                                                         6, 7, 8, 9, 10,
                                                                         11, 12, 13, 14,
                                                                         15, 16, 17, 18,
                                                                         19, 20, 21, 22,
                                                                         23, 0), to = c("Night", "Night", "Night",
                                                                                        "Night", "Night", "Night",
                                                                                        "Night", "Day", "Day", 
                                                                                        "Day", "Day", "Day",
                                                                                        "Day", "Day", "Day",
                                                                                        "Day", "Day", "Day",
                                                                                        "Day", "Night", "Night",
                                                                                        "Night", "Night", "Night"))

#create date time format to strip values
hourly_calories$Activity_Day=as.POSIXct(hourly_calories$ActivityHour, 
                                        format="%m/%d/%Y %I:%M:%S %p", 
                                        tz=Sys.timezone())     # set variable type as date time format
hourly_calories$time <- format(hourly_calories$Activity_Day, 
                               format = "%H:%M:%S")            # create time column
hourly_calories$date <- format(hourly_calories$Activity_Day, 
                               format = "%m/%d/%y")            # create date column
hourly_calories$date <- mdy(hourly_calories$date)              # import date field in correct format
hourly_calories$month <- month(hourly_calories$date)           # get month of year
hourly_calories$doy <- yday(hourly_calories$date)              # get day of year
hourly_calories$year <- year(hourly_calories$date)             # get year
hourly_calories$wday <- wday(hourly_calories$date, 
                             label = FALSE, abbr = TRUE, 
                             week_start = getOption("lubridate.week.start", 7), 
                             locale = Sys.getlocale("LC_TIME"))# extract day of week from calendar date
hourly_calories$time_obj <- strptime(hourly_calories$time, 
                                     format = "%H:%M:%OS", 
                                     tz = "EST")               # create time format to strip hour as subclass
hourly_calories$hour<- hour(hourly_calories$time_obj)          # strip time

# Subset data to daytime calories
hourly_calories_day <- hourly_calories[ which(hourly_calories$diel=='Day'
                                              ), ]
# Subset dataset to hour, calories, sedentary time... very active time
hourly_calories_day_subset <- hourly_calories_day[,(names(hourly_calories_day) %in% c("hour",
                                                                                      "Calories",
                                                                                      "SedentaryMinutes",
                                                                                      "LightlyActiveMinutes",
                                                                                      "FairlyActiveMinutes",
                                                                                      "VeryActiveMinutes"))]
#create date time format to strip values
hourly_intensities$Activity_Day=as.POSIXct(hourly_intensities$ActivityHour, 
                                           format="%m/%d/%Y %I:%M:%S %p", 
                                           tz=Sys.timezone())     # set variable type as date time format
hourly_intensities$time <- format(hourly_intensities$Activity_Day, 
                                  format = "%H:%M:%S")            # create time column
hourly_intensities$date <- format(hourly_intensities$Activity_Day, 
                                  format = "%m/%d/%y")            # create date column
hourly_intensities$date <- mdy(hourly_intensities$date)           # import date field in correct format
hourly_intensities$month <- month(hourly_intensities$date)        # get month of year
hourly_intensities$doy <- yday(hourly_intensities$date)           # get day of year
hourly_intensities$year <- year(hourly_intensities$date)          # get year
hourly_intensities$wday <- wday(hourly_intensities$date, 
                                label = FALSE, 
                                abbr = TRUE, 
                                week_start = getOption("lubridate.week.start", 7), 
                                locale = Sys.getlocale("LC_TIME"))# extract day of week from calendar date
hourly_intensities$time_obj <- strptime(hourly_intensities$time, 
                                        format = "%H:%M:%OS", 
                                        tz = "EST")               # create time format to strip hour as subclass
hourly_intensities$hour<- hour(hourly_intensities$time_obj)       # strip time

#  Convert from integer time to diel categorical (Day/Night) 
hourly_intensities$diel <- hourly_intensities$hour <- mapvalues(hourly_intensities$hour, 
                                                                    from = c(1, 2, 3, 4, 5,
                                                                             6, 7, 8, 9, 10,
                                                                             11, 12, 13, 14,
                                                                             15, 16, 17, 18,
                                                                             19, 20, 21, 22,
                                                                             23, 0), to = c("Night", "Night", "Night",
                                                                                         "Night", "Night", "Night",
                                                                                         "Night", "Day", "Day", 
                                                                                         "Day", "Day", "Day",
                                                                                         "Day", "Day", "Day",
                                                                                         "Day", "Day", "Day",
                                                                                         "Day", "Night", "Night",
                                                                                         "Night", "Night", "Night"))

#create date time format to strip values
hourly_intensities$Activity_Day=as.POSIXct(hourly_intensities$ActivityHour, 
                                           format="%m/%d/%Y %I:%M:%S %p", 
                                           tz=Sys.timezone())     # set variable type as date time format
hourly_intensities$time <- format(hourly_intensities$Activity_Day, 
                                  format = "%H:%M:%S")            # create time column
hourly_intensities$date <- format(hourly_intensities$Activity_Day, 
                                  format = "%m/%d/%y")            # create date column
hourly_intensities$date <- mdy(hourly_intensities$date)           # import date field in correct format
hourly_intensities$month <- month(hourly_intensities$date)        # get month of year
hourly_intensities$doy <- yday(hourly_intensities$date)           # get day of year
hourly_intensities$year <- year(hourly_intensities$date)          # get year
hourly_intensities$wday <- wday(hourly_intensities$date, 
                                label = FALSE, 
                                abbr = TRUE, 
                                week_start = getOption("lubridate.week.start", 7), 
                                locale = Sys.getlocale("LC_TIME"))# extract day of week from calendar date
hourly_intensities$time_obj <- strptime(hourly_intensities$time, 
                                        format = "%H:%M:%OS", 
                                        tz = "EST")               # create time format to strip hour as subclass
hourly_intensities$hour<- hour(hourly_intensities$time_obj)       # strip time

# Subset data to daytime intensities
hourly_intensities_day <- hourly_intensities[ which(hourly_intensities$diel=='Day'
), ]
# Subset dataset to hour and total intensity
hourly_intensities_day_subset <- hourly_intensities_day[,(names(hourly_intensities_day) %in% c("hour",
                                                                             "TotalIntensity"))]
# Create classifier variable for intensity level
hourly_intensities_day_subset_active <- hourly_intensities_day_subset %>% 
  mutate(activity = case_when(TotalIntensity <= 20 ~ "Sedentary",
                              TotalIntensity <= 35  ~ "Light",
                              TotalIntensity <= 45  ~ "Moderate",
                              TotalIntensity > 45  ~ "Active"))
#<20 is sedentary
#20-35 is light
#35-45 is moderate
#>40 is active

# Horizontal bind datasets (hour is unit of replication)
calorie_intensities <- cbind(hourly_intensities_day_subset_active,hourly_calories_day_subset)

# Subset data by total intensity, hour, activity, calories
calorie_intensities <- calorie_intensities[,(names(calorie_intensities) %in% c("TotalIntensity",
                                                        "hour",
                                                        "activity",
                                                        "Calories"))]

#### Figure: Hourly activity intesity compared with calories burned
w<-ggplot(calorie_intensities, aes(x=Calories, y=TotalIntensity, color=activity)) + 
  geom_point() +
  ylim(0, 180) +
  labs(title = "Hourly activity intensity compared with calories burned",x="Hourly Calories Burned",y="Hourly Intensity Level") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())


linear <- stat_smooth(method = "lm", col = "#013045") 
w + scale_color_manual(values=c("#7da2c0", "#366788", "#5a84a3", "#004c6d")) + 
    linear

\(\mathbf{\displaystyle \normalsize LM\: : \:Intensity\:Level\:(\,hours\,) = 0.31 \star Calories\:Burned\:(\,hours\,) - 17.03 ,\:p < 0.001,\:R^2 = 0.796}\)

Figure 11. Hourly activity intensity compared with calories burned. Fitness intensity level (hours) is shown by the blue points, with decreasing activity level shown by increased color intensity (n = 30). A sedentary fitness level is shown by the navy points. The data used in analysis was obtained as CC0: Public Domain data, made available through Mobius.

Stage 6: Share

“The goal is to turn data into information, and information into insight.” - Carly Fiorina

Task 1: Results: With the objective of understanding potential growth for Bellabeat, I was able to find factors of health that can be supported with fitness device use. These values were sourced from public dataobtained as CC0: Public Domain data, made available through Mobius. Data contained fitness metrics of thirty Fitbit users, including minute-level output for physical activity, heart rate, and sleep monitoring. The data souce was verified and sufficiently robust in size to offer strength to the statistical tests.

Bellabeat products include the Leaf, Time and App

Distance travelled by users in kilometers was found significantly vary by weekdays, and was found to significantly vary by activity intensity, with sedentary activity travelling the shortest distances.
Application use was found to significantly vary by user, with half of users using the fitbit app features rarely.
The duration of activity intensity was found to vary significantly, with sedentary and moderate intensities being the bulk of time throughout a 24 hour period.
Active intensity was found to occur at the smallest proportion throughout a 24 hour period.
Activity intensity was found to increase significantly during daytime hours throughout a 24 hour period. The peak periods for activity intensity were 5-7 pm.
Most frequent counts of distance travelled (km) vs. caloried burned were users travelling 7.5 km per day and burning 2,000 calories.
Most users burned over 1,200 calories per day and had a distance travelled range from 4 to 11 kilometers.
The frequency of falling asleep time was found to be highest for users burning 2,000 calories, with falling asleep being 15 minutes.
The bulk of users fell asleep within 1.5 hours of being in bed.
Sleep time did not have a significant effect of duration of active and moderate activity intensity. However, it was statistically significant for healthy sleepers to have lower duration of sedentary activity intensity.
Healthy BMI users were found to have significantly less sedentary time than overweight BMI users.
Hourly activity intensity level was found to significantly effect calories burned, with sedentary, light, moderate, and active intensity levels increasing hourly calories burned.
50% of the users use their device on a daily basis and that just 36% of the users wear the device continuously.

Task 2: Business insights from analysis: Based on my statistical analysis, I have found relationships between several health variables that give context for Bellabeat product marketing.

Product	Features	Recommendation
1. Bellabeat Leaf	Tracks activity, heartrate, sleep, and stress.	Activity intensity was found to be statistically significant correlated with calories burned, but also inversely correlated with distance traveled. I recommend having different pathways of measuring fitness in Bellabeat products, allowing for focus on both longer duration and lower intensity activities, or short duration and higher intensity activities. Most users were more frequently found to burn 2,000 calories per day, or the daily intake recommendation by the FDA. I recommend focusing on health metrics for users of average fitness, with most users falling within average metabolic limits.
2. Bellabeat Time	Tracks activity, heartrate, sleep, and stress.	Sleep was found to range from healthy to unhealthy levels; healthy sleep habits were associated with a lower duration of sedentary activity. I recommend an emphasis on healthy (7-8 hours) sleep patterns, to assist users achieve increased activity levels. Activity intensity is also found to increase during daylight hours, with a peak from 5-7 pm. I recommend rewarding users for daytime activity, especially during the 5-7 pm period. Distance traveled was also found to vary significantly by weekday, with peak distance traveled occurring Friday, and Saturday, and lowest distances traveled on Sunday. I recommend creating fitness challenges for Friday and Saturday users, and incorporating more mindfulness based activities on Sundays.
3. Bellabeat App	Provides health data related activity, sleep, stress, menstrual cycle, and mindfulness habits.	BMI level was found to significantly effect sedentary activity levels, I recommend having a specific app focus on sleep health and increasing light/moderate intensity activities for users with an overweight BMI. App use was found to significantly vary within the statistical population. App use is how users interact with their health statistics, and there is an opportunity for growth with app interest. I recommend offering rewards to users who use the app more frequently, unlocking specific fitness dashboards and recommendations.

Analysis Limitations: The nature of smart device software allows for a complex understanding of user health statistics over a long period of time. However, the limitations of smart device statistics are tied to the study duration and the sample size of the population. The sample size of this population was statistically sound for study, but was small (n=30). As a preliminary study, information about the business objective was able to be obtained, however; for a more robust study, a larger sample size and more consistent duration of use could provide more information on health benefits from a lifestyle change. The range in use data did show areas of growth for Bellabeat products to become more interesting for users, and to be used on a daily scale.