Bellabeat, a high-tech manufacturer of health-focused products for women, including Leaf, Bellabeat’s classic wellness tracker, and Time, a luxury hybrid wellness watch, believes that analyzing non-Bellabeat smart device fitness data could help unlock new growth opportunities for the company. The goal of this analysis project is to analyze smart device data to gain insight into how consumers are using their smart devices and apply this knowledge to Bellabeat’s products and guide marketing strategy for the company. The results of the project will then be presented to the Bellabeat executive team along with high-level recommendations for the company’s marketing strategy.
Analyze smart device usage data for trends and apply insights to both Bellabeat customers and marketing strategy.
The data sets were downloaded from Kaggle and belong to the public domain. They contain the personal fitness tracking data from thirty Fitbit users who consented to the submission of their personal tracker data.
Installed and loaded the packages necessary for the analysis:
# install.packages("tidyverse")
# install.packages("skimr")
# install.packages("janitor")
# install.packages("compare")
# install.packages("ggpubr")
# install.packages("scales")
# install.packages("gridExtra")
# install.packages("reshape2")
library("tidyverse")
library("skimr")
library("janitor")
library("compare")
library("ggpubr")
library("scales")
library("gridExtra")
library("reshape2")
Sett the correct file path and loaded and assigned the smart device usage data.
setwd("C:/Users/nredw/OneDrive/Documents/Nicholas/Learning/Data Analytics/Google Data Analytics Certificate/Capstone/Bellabeat Capstone/Data/Fitabase Data 4.12.16-5.12.16")
daily_activity <- read.csv("dailyActivity_merged.csv")
daily_calories <- read.csv("dailyCalories_merged.csv")
daily_intensities <- read.csv("dailyIntensities_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
daily_sleep <- read.csv("sleepDay_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")
heartrate <- read.csv("heartrate_seconds_merged.csv")
Compactly displayed the structure of the imported data sets and assessed them.
str(daily_activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(daily_calories)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(daily_intensities)
## 'data.frame': 940 obs. of 10 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
str(daily_steps)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ StepTotal : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
str(daily_sleep)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
str(weight_log)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
str(heartrate)
## 'data.frame': 2483658 obs. of 3 variables:
## $ Id : num 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ Time : chr "4/12/2016 7:21:00 AM" "4/12/2016 7:21:05 AM" "4/12/2016 7:21:10 AM" "4/12/2016 7:21:20 AM" ...
## $ Value: int 97 102 105 103 101 95 91 93 94 93 ...
Observing the structure of the data sets above, it is apparent that the daily_calories, daily_intensities, and daily_steps data frames will not be necessary as the same data is contained within the daily_activity data frame. This can be seen from the snippets of data in the str() outputs, but also confirmed with the following comparisons:
compare(daily_activity$Calories, daily_calories$Calories)
## TRUE
compare(daily_activity$TotalSteps, daily_steps$StepTotal, ignoreNames = TRUE)
## TRUE
compare(daily_activity$VeryActiveDistance, daily_intensities$VeryActiveDistance)
## TRUE
compare(daily_activity$ModeratelyActiveDistance, daily_intensities$ModeratelyActiveDistance)
## TRUE
compare(daily_activity$LightActiveDistance, daily_intensities$LightActiveDistance)
## TRUE
compare(daily_activity$SedentaryActiveDistance, daily_intensities$SedentaryActiveDistance)
## TRUE
compare(daily_activity$VeryActiveMinutes, daily_intensities$VeryActiveMinutes)
## TRUE
compare(daily_activity$FairlyActiveMinute, daily_intensities$FairlyActiveMinute)
## TRUE
compare(daily_activity$LightlyActiveMinutes, daily_intensities$LightlyActiveMinutes)
## TRUE
compare(daily_activity$SedentaryMinutes, daily_intensities$SedentaryMinutes)
## TRUE
Additionally, the weight_log data frame only contains 67 observations, which is significantly less than the 960 in the daily_activities and the 413 in the daily_sleep data frames. To investigate further, a look into how many unique Ids appear in each data frame was performed:
#Calculating the amount of unique ids in each of the data frames to be used
weight_log_ids <- weight_log$Id %>% unique %>% length
daily_activity_ids <- daily_activity$Id %>% unique %>% length
daily_sleep_ids <- daily_sleep$Id %>% unique %>% length
heartrate_ids <- heartrate$Id %>% unique %>% length
#Setting up the columns for the data frame to be used in the plot
Data_Frame <- c("weight_log", "daily_activity", "daily_sleep", "heartrate")
Unique_Ids <- c(weight_log_ids, daily_activity_ids, daily_sleep_ids, heartrate_ids)
#Creating the plot
data.frame(Data_Frame, Unique_Ids) %>%
#Setting the fill with the as.factor() function forces discrete colors between the different data frames
ggplot(aes(x = Data_Frame, y = Unique_Ids, fill = as.factor(Data_Frame))) +
geom_bar(stat = "identity") +
ylab("Unique Ids") +
xlab("Data Frame") +
#Renames the title on the legend (otherwise it will read "as.Factor(Data_Frame)"
scale_fill_discrete(name = "Legend") +
#Add title and subtitle and adjusts the size
labs(title = "Unique Ids in Each Data Frame", subtitle = "Most of the data frames contain less unique Ids than the total number of participants in the total data set (n=30).") +
theme(plot.subtitle = element_text(size = 10))
The above demonstrates that only 8 individuals contributed to the
weight_log data. A glance through this data frame in tabular form shows
that much of the data is for only two distinct values in the Id column
(i.e., most of the data only consists of two individuals). The remaining
Ids only have entries on 2-4 dates each. As such, it would be difficult
to use this data to conduct analyses as it would be biased towards
essentially the activity of just two individuals. Below demonstrates
much of the data belongs to just two unique Ids:
weight_log_entries <- weight_log %>%
group_by(Id) %>%
count
weight_log_entries
#Establish simpler names for Ids for presenting
users = c()
for (i in 1:length(weight_log_entries$Id)) {
users[i] <- paste("User ", as.character(i))
}
#Calculating percentage of entries per Id
weight_log_chart <- weight_log_entries %>%
mutate(percent = round(100 * n / sum(weight_log_entries$n),1))
weight_log_chart$user <- users
#Creating pie chart of entries per Id
weight_log_chart %>%
ggplot(aes(x = "", y = percent, fill = as.factor(user))) +
geom_col() +
# Establishes the circular graph
coord_polar(theta = "y") +
# Changes color palette
scale_fill_brewer(palette = "Set3", name = "Legend") +
#Removes extraneous elements
theme_void() +
#Adds labels for the percentages and positions them
geom_text(aes(x = 1.6, label = percent),
position = position_stack(vjust = 0.5)) +
labs(title = "Weight Log Entries by User", subtitle = "More than 75% of weight log entries came from just two distinct users.")
Only 8 users out of ~30 in the total data set logged any weight entries.
With 2 of those 8 logging most of the entries, only 2 out 30 (~7%)
consistently logged their weight. The lack of quality data around weight
highlights an opportunity for encouraging users to be diligent and
consistent with weight tracking through notifications. A possible
marketing strategy could be to highlight the ease of tracking health
data with the Bellabeat app, emphasizing the simplicity of logging
weight entries. Future analysis could involve exploring if those who
track their activity, sleep, weight, and other health data more
consistently tend to lose more weight. Demonstrating this relationship
would provide a marketing strategy for selling Bellabeat products (i.e.,
using the app and the smart wellness products leads to losing weight).
It would also be an effective means of encouraging users to continue
using the Bellabeat app and the Leaf and Time products.
After exploring the daily_activity data in tabular form, it was observed that many of the rows contain “zero” across all or many of the columns. Sorting by either TotalSteps or TotalDistance made this apparent:
daily_activity %>% arrange(TotalSteps)
It is likely the rows that contain 0 in the TotalStep column indicate days that the individual did not wear their smart device, as it is not realistic for a person to take no steps during their waking hours. Taking this even further, it is likely that many entries even with non-zero values indicate days in which the smart device user did not wear their device consistently. Therefore, a cutoff of 1000 steps was chosen as a threshold for consideration in the analysis:
daily_activity <- daily_activity %>% filter(TotalSteps >= 1000)
The data now contains only values greater than or equal to 1000, which still consists of most of the rows (831) from the original data set:
daily_activity %>% arrange(TotalSteps)
While the daily_activity data frame contains only dates in its time data (i.e., MM/DD/YYYY), the daily_sleep data frame’s time data contains a date along with a time component (i.e., MM/DD/YYYY HH:MM:SS). This time component does not actually contain any useful data (it is always 12:00:00 AM) and needed to be removed to match up with the daily_activity data.
# Code using pipes:
daily_sleep$SleepDay <- daily_sleep$SleepDay %>%
substr(1, nchar(daily_sleep$SleepDay)-3) %>% #Removing the "AM" from the column
mdy_hms %>% #Converting the string to a timestamp
date ##Removing the time component of the timestamp
# Original Code that was written:
# daily_sleep$SleepDay <- paste0("0", daily_sleep$SleepDay)
# daily_sleep$SleepDay <- substr(daily_sleep$SleepDay, 1, nchar(daily_sleep$SleepDay)-3)
# daily_sleep$SleepDay <- mdy_hms(daily_sleep$SleepDay)
# daily_sleep$SleepDay <- date(daily_sleep$SleepDay)
The SleepDay time data in the daily_sleep data frame is now just a date; however, in the process it was converted from a string to a date. For consistency, the time data in the ActivityDate time data in the daily_activity data frame was also converted.
daily_activity$ActivityDate <- daily_activity$ActivityDate %>%
mdy %>%
date
The daily_activity data and the daily_sleep data are now ready to be combined together. A join onto the daily_sleep data was chosen, as there are significantly less observations in this data set (413 vs the 940 in daily_activity). While frequently joins are performed such that the broader set is what is joined onto (i.e., a left_join below instead of a right_join), this would have resulted in many N/A entries in the data set, which would have inhibited analyses into correlations between the daily_activity data and the daily_sleep data. As a last step, any of the rows containing “NA” as a result of merging the tables will be removed:
daily_activity_sleep <- right_join(daily_activity, daily_sleep, by=c('Id','ActivityDate'='SleepDay')) %>%
drop_na
For the heart rate data frame, the Time column in the heart rate data frame was converted from a string to a POSIXct format. Additionally, the time component of this data was broken out to facilitate analysis.
## Converts the data into a POSIXct format
heartrate$Time <- heartrate$Time %>%
parse_date_time2('%m:%d:%Y %I:%M:%S %p')
## This creates a new column in which only the time element of the Time column is retained; the date is defaulted "0000-01-01", which is not an issue as the aim is to aggregate based on the time and not the date.
heartrate$DayTime <- format(as.POSIXct(heartrate$Time,format="%Y:%m:%d %H:%M:%S"),"%H:%M:%S") %>%
parse_date_time2('%H:%M:%S')
# Previous attempts at cleaning the data:
# heartrate$Date <- heartrate$Time %>%
# as.Date()
#
# heartrate$DayTime <- heartrate$Time %>%
# as.Time()
# heartrate$Time <- heartrate$Time %>%
# substr(1, nchar(heartrate$Time)-3)
# heartrate$DayTime <- format(as.POSIXct(heartrate$Time,format="%Y:%m:%d %H:%M:%S"),"%H:%M:%S")
# heartrate$Date <- format(as.POSIXct(heartrate$Time,format="%Y:%m:%d %H:%M:%S"),"%Y:%m:%d")
# heartrate <- heartrate %>%
# separate(Time, into = c('Date','Time'), sep = ' ')
#
# heartrate$Date <- heartrate$Date %>% ymd
# heartrate$Time <- heartrate$Time %>% as.POSIXct(format = '%H:%M:%S')
Both the original Time column and the new DayTime column in the heart rate data frame are now POSIXct types, and the date has essentially been scrubbed in the DayTime column:
str(heartrate)
## 'data.frame': 2483658 obs. of 4 variables:
## $ Id : num 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ Time : POSIXct, format: "2016-04-12 07:21:00" "2016-04-12 07:21:05" ...
## $ Value : int 97 102 105 103 101 95 91 93 94 93 ...
## $ DayTime: POSIXct, format: "0000-01-01 07:21:00" "0000-01-01 07:21:05" ...
Now, the data in the heart rate data frame was used to create a new data frame in which the average heartbeat throughout the day was determined for each unique Id in the data set:
heartrate_agg <- heartrate %>%
group_by(Id, DayTime) %>%
summarise(Avg_Value = mean(Value))
A new data frame, heartrate_agg_stats, was created to hold summary statistics regarding the newly created aggregated hear rate data frame. The aggregated heart rate data was then used in exploratory visualizations.
heartrate_agg_stats <- heartrate_agg %>%
group_by(Id) %>%
summarize(min = min(Avg_Value),
q1 = quantile(Avg_Value, 0.25),
median = median(Avg_Value),
mean = mean(Avg_Value),
q3 = quantile(Avg_Value, 0.75),
max = max(Avg_Value)) %>%
arrange(-mean)
heartrate_agg_stats
heartrate_agg %>%
ggplot(aes(DayTime,Avg_Value)) +
#as.factor(Id) forces the color to be individual for each plot, rather than a gradient across each Id
geom_point(aes(color = as.factor(Id), alpha = 1/10)) +
#scales = "free" allows for each plot to have its own scale on the y- and x-axis; however, the axis limits are rebound below
facet_wrap(~Id, scales = "free", ncol = 2) +
#hides the legends that are unnecessary
guides(color = "none", alpha = "none") +
#Adjusts the tick marks to consistently show between 00:00:00 and 24:00:00 on the x-axis and between 40 and 200 on the y-axis
scale_x_datetime(date_labels = '%T',
limits = c(as.POSIXct("0000-01-01 00:00:00", tz = 'UTC'),
as.POSIXct("0000-01-01 24:00:00", tz = 'UTC'))
) +
scale_y_continuous(limits = c(40,200)) +
#Adds title, subtitle, and y-axis label
labs(title = "Averaged Heart Rate Throughout the Day",
subtitle = "The heart rate of 14 FitBit users was averaged over the course of a day for approximately one month.",
caption = "The dotted lines represent the 25th and 75th quartiles and the red dashed line represents the mean for all heart rate values across all individuals.") +
ylab("Heart Rate (bpm)") +
#Adjusting the angle of the x-axis tick mark labels, the size of the title and subtitle, and removes the x-axis label
theme(axis.text.x = element_text(angle = 30),
plot.title = element_text(size = 20),
plot.subtitle = element_text(size = 10),
axis.title.x = element_blank()) +
#Adds trendline
geom_smooth() +
#Adds in the three horizontal lines for both the 25th and 75th quartiles and the average for each Id
geom_hline(aes(yintercept=quantile(Avg_Value, 0.75)), linetype = 'dotted') +
geom_hline(aes(yintercept=mean(Avg_Value)), color = 'red', linetype = 'dashed') +
geom_hline(aes(yintercept=quantile(Avg_Value, 0.25)), linetype = 'dotted')
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The graphical representation of each individual’s average heart rate throughout the day led to several observations.
Firstly, several of the individuals have observed peaks in their data, representing times of either vigorous labor or deliberate exercise. Additionally, most of the individuals have periods of marked lower heart rates, whether compared to the average or the 25th quantile of users’ data, likely indicating sleep. This data could be used to generate notifications to users that either prompt them to consider getting a workout in or to begin winding down for bed based on their historic average heart rate data.
Secondly, there are gaps in some of the individual’s data, in most cases occurring sometime during the morning. For some of the individuals, it appears that they do not wear the device when they sleep, while others have small gaps, perhaps during morning routines such as showering and eating breakfast, and others still have only sporadic spurts of averaged data. If the intent is that the device is worn continuously in order to provide accurate data, then notifications to users based on when their activity is noted to be missing could be created that provide encouragement to wear the device during those hours along with helpful information about the benefits of doing so.
Lastly, while the population is small (n = 14), there appears to be a few “archetypes” of user based on the overall appearance of their average daily heartbeat data. The first archetype appears as a trend that is relatively level, but then has obvious peaks, sometimes several, at particular times of the day. This type of data likely reflects a user who is typically not particularly active during most of the day, but then makes a concerted effort to perform workouts and get in routine exercise. A second archetype appears to be a trend line that has a definitive, sustained period of increased heart rate throughout the day, but without necessarily as dramatic of peaks as the first archetype. This data likely reflects a user that performs some form of physical labor throughout their day, likely as part of a job. One last archetype of user to consider would be one whose data does not reflect peaks nor periods of sustained elevated heart rate throughout the day. This likely would indicate someone who is largely sedentary during the day and does not routinely exercise.
While further data would be necessary to better flush out these categories, understand assumptions, and explore if other ones exist, this preliminary analysis proves valuable for Bellabeat’s market strategy, as each archetype could be targeted in unique fashion to provide better support and services.
Next, the daily_activity_sleep data frame was analyzed.
summary(daily_activity_sleep)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Min. :2016-04-12 Min. : 1202 Min. : 0.780
## 1st Qu.:3.977e+09 1st Qu.:2016-04-19 1st Qu.: 5450 1st Qu.: 3.723
## Median :4.703e+09 Median :2016-04-27 Median : 9105 Median : 6.380
## Mean :5.047e+09 Mean :2016-04-26 Mean : 8720 Mean : 6.166
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:11420 3rd Qu.: 8.063
## Max. :8.792e+09 Max. :2016-05-12 Max. :22770 Max. :17.540
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.780 Min. :0.0000 Min. : 0.000
## 1st Qu.: 3.723 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 6.380 Median :0.0000 Median : 0.630
## Mean : 6.160 Mean :0.1157 Mean : 1.483
## 3rd Qu.: 8.053 3rd Qu.:0.0000 3rd Qu.: 2.430
## Max. :17.540 Max. :4.0817 Max. :12.540
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.000 Min. :0.350 Min. :0.0000000
## 1st Qu.:0.000 1st Qu.:2.625 1st Qu.:0.0000000
## Median :0.440 Median :3.760 Median :0.0000000
## Mean :0.767 Mean :3.884 Mean :0.0009406
## 3rd Qu.:1.052 3rd Qu.:5.000 3rd Qu.:0.0000000
## Max. :6.480 Max. :9.480 Max. :0.1100000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 34.0 Min. : 125.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:161.8 1st Qu.: 631.8
## Median : 10.00 Median : 12.00 Median :210.5 Median : 716.5
## Mean : 25.75 Mean : 18.44 Mean :220.9 Mean : 713.1
## 3rd Qu.: 38.25 3rd Qu.: 28.00 3rd Qu.:264.2 3rd Qu.: 781.0
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1265.0
## Calories TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. : 741 Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1860 1st Qu.:1.000 1st Qu.:362.5 1st Qu.:405.2
## Median :2225 Median :1.000 Median :433.0 Median :463.0
## Mean :2421 Mean :1.114 Mean :419.0 Mean :458.3
## 3rd Qu.:2933 3rd Qu.:1.000 3rd Qu.:492.0 3rd Qu.:527.8
## Max. :4900 Max. :3.000 Max. :796.0 Max. :961.0
Firstly, The average daily total steps taken was 8720, which is less than the commonly cited 10,000 step goal from CDC; however, research has indicated there are significant health benefits of taking even 8000 steps daily (as compared to only taking 4000 steps). If this data is indicative of the average Leaf user as well, then marketing can use this research in promotional materials to state that the average user of Leaf devices can receive these health benefits.
Next, the average total distance is much greater than what the average American walks (1.5 - 2 miles daily) at 6.1, and even the 1st quartile is 3.7. Again, marketing information could promote that the average user of Leaf devices covers a much greater distance on average than the typical American, and that even casual users still go a significantly higher distance on average as well, as compared to the average American.
Continuing, the average very active, fairly active, and lightly active minutes were 25.75, 18.44, and 220.9 respectively. The CDC recommends 150 minutes of “moderate-intensity” or 75 minutes of “vigorous-intensity” activity each week, or an equivalent combination. With the previously stated averages, assuming very active is vigorous-intensity and fairly active is moderate-intensity, then the average user from this data set gets approximately 180 and 130 minutes of vigorous-intensity and moderate-intensity activity each week. These are well above the CDC’s recommendations, and are yet another boast that could be used in marketing promotions.
A final metric to consider is sleep, which the average user receives 419 minutes of daily. This equates to 6.98 hours of sleep every night on average, which is just shy of recommended 7+ hours of sleep for most adults. There are several very low observations in the data (entries with as low as 1, 2, or 3 hours of sleep) that could arguably be outliers. Additionally, with more data specific to Leaf users, the data could very well be over the 7-hour mark the CDC recommends. If considered indicative of average Leaf users, then stating that the average Leaf user gets the recommended amount of sleep each night could be used for marketing purposes.
In exploring the medians of these variables; however, the very active and fairly active minutes columns have much lower medians than means. This was then further explored in the following box plots:
daily_activity_sleep_m <- daily_activity_sleep %>%
#Converting the data into a long format to facilitate plotting
melt(id.vars = "Id", measure.vars = c("TotalSteps", "TotalDistance", "VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActiveMinutes", "TotalMinutesAsleep"))
daily_activity_sleep_m %>%
ggplot(aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free") +
labs(title = "Daily Activity Box Plots", subtitle = "The breakdown of the activity data into box plots reveals many outliers for very active and fairly active minutes.") +
theme(plot.title = element_text(size = 20),
plot.subtitle = element_text(size = 10),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank()) +
guides(fill = "none")
The box plots revealed that there appear to be many outliers in the very
active and fairly active minutes data. As such, these were further
broken down by individual and examined:
daily_activity_sleep %>%
ggplot(aes(x = reorder(as.factor(Id), -VeryActiveMinutes) , y = VeryActiveMinutes, color = as.factor(Id))) +
guides(color = "none") +
theme(axis.text.x = element_text(angle = 30),
plot.title = element_text(size = 20),
plot.subtitle = element_text(size = 10),
axis.title.x = element_blank()) +
labs(title = "Very Active Minutes in a Day by Id",
subtitle = "Only a few individuals had days with greater than 100 very active minutes.",
caption = "The dotted lines represent the 25th and 75th quartiles and the red dashed line represents the median."
) +
geom_point() +
geom_hline(aes(yintercept=quantile(VeryActiveMinutes, 0.75)), linetype = 'dotted') +
geom_hline(aes(yintercept=median(VeryActiveMinutes)), color = 'red', linetype = 'dashed') +
geom_hline(aes(yintercept=quantile(VeryActiveMinutes, 0.25)), linetype = 'dotted')
daily_activity_sleep %>%
ggplot(aes(x = reorder(as.factor(Id), -FairlyActiveMinutes) , y = FairlyActiveMinutes, color = as.factor(Id))) +
guides(color = "none") +
theme(axis.text.x = element_text(angle = 30),
plot.title = element_text(size = 20),
plot.subtitle = element_text(size = 10),
axis.title.x = element_blank()) +
labs(title = "Fairly Active Minutes in a Day by Id",
subtitle = "Only a few individuals had days with greater than 75 fairly active minutes.",
caption = "The dotted lines represent the 25th and 75th quartiles and the red dashed line represents the median."
) +
geom_point() +
geom_hline(aes(yintercept=quantile(FairlyActiveMinutes, 0.75)), linetype = 'dotted') +
geom_hline(aes(yintercept=median(FairlyActiveMinutes)), color = 'red', linetype = 'dashed') +
geom_hline(aes(yintercept=quantile(FairlyActiveMinutes, 0.25)), linetype = 'dotted')
For either variable, very or fairly active minutes, the dot plots
demonstrate that only a few individuals were contributing to the
outliers demonstrated in the their respective box plots. These
individuals would therefore be contributing towards the larger
discrepancy between the median and mean for these variables, as their
days with a significantly higher number of very active or fairly active
minutes would pull the average higher as compared to the median.
The variation between individuals very active and fairly active minutes in the data set highlights the differences that each person has in their exercise and activity habits. While marketing efforts can still make claims based on averages across all users, tailoring notifications, messages, or analytics to each user would provide personalized feedback that can improve their health. For example, based on the users very active and fairly active minutes data, a message or notification indicating that the user is or is not reaching the CDC recommended 150 minutes of “moderate-intensity” or 75 minutes of “vigorous-intensity” activity along with a comparison of how their activity is to this threshold.
The relationship between variables in the activity data frame and calories burned were next investigated:
totalsteps_calories <- daily_activity_sleep %>%
ggplot(aes(TotalSteps, Calories)) +
geom_point() +
geom_smooth(method = "lm") +
stat_cor()
totalsteps_calories_f <- daily_activity_sleep %>%
filter(TotalSteps < quantile(TotalSteps, 0.975)) %>%
ggplot(aes(TotalSteps, Calories)) +
geom_point() +
geom_smooth(method = "lm") +
stat_cor()
trackerdistance_calories <- daily_activity_sleep %>%
filter(TrackerDistance < quantile(TrackerDistance, 0.975)) %>%
ggplot(aes(TrackerDistance, Calories)) +
geom_point() +
geom_smooth(method = "lm") +
stat_cor()
veryactivedist_calories <- daily_activity_sleep %>%
filter(VeryActiveDistance < quantile(VeryActiveDistance, 0.975)) %>%
ggplot(aes(VeryActiveDistance, Calories)) +
geom_point() +
geom_smooth(method = "lm") +
stat_cor()
modactivedist_calories <- daily_activity_sleep %>%
filter(ModeratelyActiveDistance < quantile(ModeratelyActiveDistance, 0.975)) %>%
ggplot(aes(ModeratelyActiveDistance, Calories)) +
geom_point() +
geom_smooth(method = "lm") +
stat_cor()
lightactivedist_calories <- daily_activity_sleep %>%
filter(LightActiveDistance < quantile(LightActiveDistance, 0.975)) %>%
ggplot(aes(LightActiveDistance, Calories)) +
geom_point() +
geom_smooth(method = "lm") +
stat_cor()
veryactivemin_calories <- daily_activity_sleep %>%
filter(VeryActiveMinutes < quantile(VeryActiveMinutes, 0.975)) %>%
ggplot(aes(VeryActiveMinutes, Calories)) +
geom_point() +
geom_smooth(method = "lm") +
stat_cor()
grid.arrange(totalsteps_calories, totalsteps_calories_f, trackerdistance_calories,
veryactivedist_calories, modactivedist_calories, lightactivedist_calories,
veryactivemin_calories,
ncol = 2,
top = text_grob("Daily Activity vs. Calories Scatter Plots", hjust = -0.125, vjust = 0.5, x = 0, size = 20)
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The first total steps vs. calories plot reveals a weak correlation between total steps taken and calories; however, there appear to outliers (~ >15,000 steps) that may be skewing the correlation. By filtering out outliers with a criteria that they are beyond the 0.975 quantile, the correlation becomes even weaker. So while a positive relationship exists between total steps and calories burned, it is not very strong and therefore would not be useful for further considerations.
A stronger relationship exists between daily tracked distance and calories burned; however, the correlation is still relatively weak.
Additionally, only poor correlations could be demonstrated between the active distance measurements and calories burned.
A moderate correlation exists between calories burned and very active minutes; this could be used to notify users of the Bellabeat app that when they begin logging very active minutes that they are burning calories at a high rate.
As exploring scatter plots one-by-one was inefficient, a correlation matrix was generated to quickly understand the relationship between variables:
#Picking the columns in the daily_activity_sleep data frame to correlate
daily_activity_sleep_numerics <- daily_activity_sleep[,c(3,4,5,6,7,8,9,11,12,13,14,15,17,18)]
#Generating the correlation matrix
activity_cor_mat <- cor(daily_activity_sleep_numerics)
#Converting the matrix into a data frame and transforming its arrangement such that each correlation variable and the correlation itself has its own column
activity_cor_table <- activity_cor_mat %>%
as.data.frame %>%
rownames_to_column("Row_Variable") %>%
pivot_longer(-"Row_Variable")
#Renaming columns
activity_cor_table <- activity_cor_table %>%
rename("Column_Variable" = "name") %>%
rename("Correlation" = "value")
#Removing the diagonal (i.e., where the same variable was correlated with itself)
activity_cor_table_f <- filter(activity_cor_table, Column_Variable != Row_Variable)
#Removing instances where two of the same correlation existed (i.e., Row_var A with Column Var B and then Row_Var B with Column Var A, etc.)
activity_cor_table_f <- activity_cor_table_f %>%
distinct(Correlation, .keep_all = TRUE)
#Removing any weak correlations
activity_cor_table_f <- filter(activity_cor_table_f, Correlation >= 0.40 | Correlation <= -0.40)
activity_cor_table_f <- arrange(activity_cor_table_f, -Correlation)
activity_cor_table_f
activity_cor_table %>%
ggplot(aes(Row_Variable, Column_Variable, fill = Correlation)) +
geom_tile() +
#Sets the colors of the heat map and forces the limits of color gradient to match correlations of -1 to 1.
scale_fill_gradient2(low = "blue", mid = ("white"), high="orange", limits = c(-1,1)) +
theme(
#Adjusts the angle, margin, and position of the x-axis texts
axis.text.x = element_text(angle = -30,
margin = margin(t = -0),
hjust = 0
),
axis.ticks.length.x = unit(0.5, "cm"),
axis.title = element_blank(),
#Adjusting the position of the title and subtitle
plot.title = element_text(hjust = -2),
plot.subtitle = element_text(hjust = 0.935)
) +
labs(title="Daily Activity, Calories, and Sleep Correlations", subtitle = "The correlation heat map reveals that many of the activity variables have weak correlations.")
The three variables with the highest correlations (R > .98) were
combinations of TotalDistance, TrackerDistance, and TotalSteps. The
strong correlation between these variables highlights how little users
are entering logged activities that contribute to the total distance as
opposed to relying on the tracked distance their device calculates. In
fact, of the original 831 observations in the raw daily_activity data
frame, only ~30 entries contain non-zero values for the
LoggedActivitiesDistance column. This highlights that users prefer to
rely on the device to track their numbers rather than rely on manual
entry. This preference can be used for marketing strategy by playing up
the ease of using the Leaf or Time devices to track user activity.
The remaining positive correlations that were at least weakly correlated (R > 0.40) were combinations of the four distinct categories (Very Active, Moderately Active, Light Active and Sedentary) distances or minutes, the three aforementioned highly correlation variables (TotalDistance, TrackerDistance, and TotalSteps), and Calories. Of these correlations, Calories was most highly correlated with VeryActiveMinutes (R > 0.61). This was previously shown to have a weaker correlation upon removing outliers (R > 0.51), but still constitutes a moderate correlation. While additional analysis with weight logging data would be ideal, users using viewing their data through the app could be encouraged when viewing their very active minutes that it is the most correlated with burning the most calories in a day.
There were several limitations with the data used in the analysis as well as other hurdles: