The objective was to report on how other non-Bellabeats products is being used through a data driven analysis. Based on these results a product from Bellabeats product line was selected to assist the marketing team with their marketing strategy. The data source was crowd sourced from FitBit Fitness tracker data and downloaded from Kaggle. Unfortunately, multiple datasets were incomplete and critical data was missing i.e. sex, age, weight, fitness level and height. Taking these limitations into consideration this project focused on ‘daily data’ as these datasets were complete and allowed for the most comprehensive analysis to determine how the product was used. Based on the analysis the fitness tracker was used primarily as an activity tracker, with all participants wearing the trackers during the day. Although, sleep data was incomplete it allowed for insights into the participants sleeping behaviors. The relationships of various activity levels was analyzed against calories used. A strong positive correlation was found between very active minutes and calories used, while a negative correlation was observed between sedentary minutes and calories used. According to the CDC it is recommended that a person should exercise for a minimum of 22 minutes per day, with the intensity being moderate to vigorous. The participants meet this requirement, however, on closer observation it was noted that the majority of the activity minutes being light intensity. The recommendation from the data is that Bellabeats marketing team should focus on marketing the ‘Bellabeats Time’ and ‘Bellabeats Leaf’ as both of these products are able to perform activity tracking. Based on the observations it is recommended that the app should include goals and rewards specific to the user eg. if user aims to lose weight the app should award the user each time they perform vigorous activities, as this would correspond to more calories burned.
The data source used during this analysis was FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius). The dataset contains personal fitness information from thirty (30) FitBit users. Thirty eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to identify users’ habits.
The metadata provided on Kaggle can be found on https://www.kaggle.com/arashnic/fitbit/metadata. To verify the source I navigated to https://zenodo.org/record/53894#.X9oeh3Uzaao and noticed the data was captured on another application named ‘Amazon Mechanical Turk’ as part of a survey. Additional verification was not possible. Unfortunately, the original source was unavailable and the eligibility criteria was not disclosed.
Unfortunately, the data provided and the information it contained was unreliable. Due to the unavailability of metadata for the original data, which makes it difficult to verify the data’s quality. It cannot be established if the data is biased or not as the data source only states that there are thirty eligible Fitbit users. Information such as sex, age, fitness level, weight and height is absent. Furthermore, there are multiple incomplete data tables. The data appears to be pre-processed as the file name contains ‘merged’ in the file name (eg. activitydata_merged). The data used is sourced from a Third Party and cannot be validated as the original data is unavailable. There are some critical information missing to perform a more comprehensive analysis i.e. age, sex, height and weight. The data was collected in 2016 from my understanding of the fitness device market there have been multiple improvements over the past five (5) years. The data source is cited.
The reference for the sourced data is as follow:
Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.53894
Because there is missing information the creditably of the data cannot be verified. This makes it difficult to draw informed conclusions, as the data has glaring weaknesses. Without acknowledging these limitations to the analysis the decisions based on the analysis can be incorrect and cost the company revenue.
The data sets were downloaded from Kaggle at https://www.kaggle.com/arashnic/fitbit.
The CSV files were imported into R.
# Import data from csv to data frames
activity_daily <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/dailyActivity_merged.csv", header = TRUE, sep = ",")
heartrate_seconds <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/heartrate_seconds_merged.csv", header = TRUE, sep = ",")
mets_minute <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/minuteMETsNarrow_merged.csv", header = TRUE, sep = ",")
sleep_daily <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/sleepDay_merged.csv", header = TRUE, sep = ",")
sleep_minute <- read.csv(file = "C:/Users/pc/Documents/Capstone_projects/fitness_project/minuteSleep_merged.csv", header = TRUE, sep = ",")
The focus was on ‘daily data’ with the exception of sleep data, heart rate and METs. Viewing the data to ensure that the data has successfully imported.
# View Data to ensure import was successful. The focus will be on the daily data with the exception of the heart rate and sleep data.
tibble(activity_daily)
## # A tibble: 940 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <chr> <int> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## 7 1.50e9 4/18/2016 13019 8.59 8.59 0
## 8 1.50e9 4/19/2016 15506 9.88 9.88 0
## 9 1.50e9 4/20/2016 10544 6.68 6.68 0
## 10 1.50e9 4/21/2016 9819 6.34 6.34 0
## # ... with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <int>,
## # FairlyActiveMinutes <int>, LightlyActiveMinutes <int>,
## # SedentaryMinutes <int>, Calories <int>
tibble(heartrate_seconds)
## # A tibble: 2,483,658 x 3
## Id Time Value
## <dbl> <chr> <int>
## 1 2022484408 4/12/2016 7:21:00 AM 97
## 2 2022484408 4/12/2016 7:21:05 AM 102
## 3 2022484408 4/12/2016 7:21:10 AM 105
## 4 2022484408 4/12/2016 7:21:20 AM 103
## 5 2022484408 4/12/2016 7:21:25 AM 101
## 6 2022484408 4/12/2016 7:22:05 AM 95
## 7 2022484408 4/12/2016 7:22:10 AM 91
## 8 2022484408 4/12/2016 7:22:15 AM 93
## 9 2022484408 4/12/2016 7:22:20 AM 94
## 10 2022484408 4/12/2016 7:22:25 AM 93
## # ... with 2,483,648 more rows
tibble(mets_minute)
## # A tibble: 1,325,580 x 3
## Id ActivityMinute METs
## <dbl> <chr> <int>
## 1 1503960366 4/12/2016 12:00:00 AM 10
## 2 1503960366 4/12/2016 12:01:00 AM 10
## 3 1503960366 4/12/2016 12:02:00 AM 10
## 4 1503960366 4/12/2016 12:03:00 AM 10
## 5 1503960366 4/12/2016 12:04:00 AM 10
## 6 1503960366 4/12/2016 12:05:00 AM 12
## 7 1503960366 4/12/2016 12:06:00 AM 12
## 8 1503960366 4/12/2016 12:07:00 AM 12
## 9 1503960366 4/12/2016 12:08:00 AM 12
## 10 1503960366 4/12/2016 12:09:00 AM 12
## # ... with 1,325,570 more rows
tibble(sleep_daily)
## # A tibble: 413 x 5
## Id SleepDay TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
## <dbl> <chr> <int> <int> <int>
## 1 1503960366 4/12/2016 12:00~ 1 327 346
## 2 1503960366 4/13/2016 12:00~ 2 384 407
## 3 1503960366 4/15/2016 12:00~ 1 412 442
## 4 1503960366 4/16/2016 12:00~ 2 340 367
## 5 1503960366 4/17/2016 12:00~ 1 700 712
## 6 1503960366 4/19/2016 12:00~ 1 304 320
## 7 1503960366 4/20/2016 12:00~ 1 360 377
## 8 1503960366 4/21/2016 12:00~ 1 325 364
## 9 1503960366 4/23/2016 12:00~ 1 361 384
## 10 1503960366 4/24/2016 12:00~ 1 430 449
## # ... with 403 more rows
tibble(sleep_minute)
## # A tibble: 188,521 x 4
## Id date value logId
## <dbl> <chr> <int> <dbl>
## 1 1503960366 4/12/2016 2:47:30 AM 3 11380564589
## 2 1503960366 4/12/2016 2:48:30 AM 2 11380564589
## 3 1503960366 4/12/2016 2:49:30 AM 1 11380564589
## 4 1503960366 4/12/2016 2:50:30 AM 1 11380564589
## 5 1503960366 4/12/2016 2:51:30 AM 1 11380564589
## 6 1503960366 4/12/2016 2:52:30 AM 1 11380564589
## 7 1503960366 4/12/2016 2:53:30 AM 1 11380564589
## 8 1503960366 4/12/2016 2:54:30 AM 2 11380564589
## 9 1503960366 4/12/2016 2:55:30 AM 2 11380564589
## 10 1503960366 4/12/2016 2:56:30 AM 2 11380564589
## # ... with 188,511 more rows
With the tibbles I noticed a few things. First, was the use of unconventional column name like ‘ActivityDate’ rather than just ‘Date’ or the use of ‘Time’ when in fact it was ‘Date_Time’. To ensure that the headers were all in the same format, the clean_names function was used.
# Cleaning process with janitor
# Based on the results the column names for date and time to be made consistent
activity_daily_cln <- activity_daily %>%
rename(c("ActivityDate" = "date")) %>%
clean_names()
# Heart rate data column to be changed from time to date time
heartrate_seconds_cln <- heartrate_seconds %>%
rename(c("Time" = "date_time")) %>%
clean_names()
# METs data column name change from activity minutes to date time
mets_minute_cln <- mets_minute %>%
rename(c("ActivityMinute" = "date_time", "METs" = "mets")) %>%
clean_names()
# Sleep data column names to be changed the SleepDay to Date
sleep_daily_cln <- sleep_daily %>%
rename(c("SleepDay" = "date")) %>%
clean_names()
# Sleep minute column name date to date time
sleep_minute_cln <- sleep_minute %>%
rename(c("date" = "date_time")) %>%
clean_names()
The tabyl function was used to verify the completeness of the data and provided summaries of each data frame.
# Produce a table summarizing the table based on id.
tabyl(activity_daily_cln, id)
## id n percent
## 1503960366 31 0.032978723
## 1624580081 31 0.032978723
## 1644430081 30 0.031914894
## 1844505072 31 0.032978723
## 1927972279 31 0.032978723
## 2022484408 31 0.032978723
## 2026352035 31 0.032978723
## 2320127002 31 0.032978723
## 2347167796 18 0.019148936
## 2873212765 31 0.032978723
## 3372868164 20 0.021276596
## 3977333714 30 0.031914894
## 4020332650 31 0.032978723
## 4057192912 4 0.004255319
## 4319703577 31 0.032978723
## 4388161847 31 0.032978723
## 4445114986 31 0.032978723
## 4558609924 31 0.032978723
## 4702921684 31 0.032978723
## 5553957443 31 0.032978723
## 5577150313 30 0.031914894
## 6117666160 28 0.029787234
## 6290855005 29 0.030851064
## 6775888955 26 0.027659574
## 6962181067 31 0.032978723
## 7007744171 26 0.027659574
## 7086361926 31 0.032978723
## 8053475328 31 0.032978723
## 8253242879 19 0.020212766
## 8378563200 31 0.032978723
## 8583815059 31 0.032978723
## 8792009665 29 0.030851064
## 8877689391 31 0.032978723
# The outcome: there was only 21 people that wore the fitness device for 31 days
tabyl(activity_daily_cln, id) %>%
summarise(count(n))
## count(n).x count(n).freq
## 1 4 1
## 2 18 1
## 3 19 1
## 4 20 1
## 5 26 2
## 6 28 1
## 7 29 2
## 8 30 3
## 9 31 21
# The heart rate was more incomplete that the activity data with only 14 participants
tabyl(heartrate_seconds_cln,id) %>%
summarise(count(n))
## count(n).x count(n).freq
## 1 2490 1
## 2 32771 1
## 3 122841 1
## 4 133592 1
## 5 152683 1
## 6 154104 1
## 7 158899 1
## 8 192168 1
## 9 228841 1
## 10 248560 1
## 11 249748 1
## 12 255174 1
## 13 266326 1
## 14 285461 1
# There was 33 observations in total of varying lengths
tabyl(mets_minute_cln, id) %>%
summarise(count(n))
## count(n).x count(n).freq
## 1 5280 1
## 2 24840 1
## 3 25860 1
## 4 28320 1
## 5 36060 1
## 6 36600 1
## 7 39600 1
## 8 39900 1
## 9 40320 1
## 10 41760 1
## 11 42480 2
## 12 43020 1
## 13 43080 1
## 14 43440 1
## 15 43800 1
## 16 43860 2
## 17 43920 2
## 18 43980 3
## 19 44040 1
## 20 44100 4
## 21 44160 5
# There were 24 participants, tracking their sleep a minimum of 1 and a maximum of 32 days.
tabyl(sleep_daily_cln, id) %>%
summarise(count(n))
## count(n).x count(n).freq
## 1 1 1
## 2 2 1
## 3 3 3
## 4 4 1
## 5 5 2
## 6 8 1
## 7 15 2
## 8 18 1
## 9 24 2
## 10 25 1
## 11 26 2
## 12 28 4
## 13 31 2
## 14 32 1
# There were 24 participants, tracking their sleep a minimum of 69 min and a maximum of 15682 min. This is consistent with the observations noted above.
tabyl(sleep_minute_cln, id) %>%
summarise(count(n))
## count(n).x count(n).freq
## 1 69 1
## 2 143 1
## 3 700 1
## 4 905 1
## 5 1107 1
## 6 1384 1
## 7 2189 1
## 8 2883 1
## 9 3038 1
## 10 6807 1
## 11 7370 1
## 12 9183 1
## 13 9580 1
## 14 9734 1
## 15 11194 1
## 16 11671 1
## 17 11976 1
## 18 12375 1
## 19 12912 1
## 20 13051 1
## 21 14450 1
## 22 15054 1
## 23 15064 1
## 24 15682 1
The next step was to convert the character format for date and time to date formats as the focus was on ‘daily data’.
# To be able to plot the data the date formats of all the dataframes needs to be changed
# Date is changed for the two dataframes below
activity_daily_cln[[2]] <- as.Date(activity_daily_cln[[2]], "%m/%d/%Y")
tibble(activity_daily_cln)
## # A tibble: 940 x 15
## id date total_steps total_distance tracker_distance
## <dbl> <date> <int> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## 7 1503960366 2016-04-18 13019 8.59 8.59
## 8 1503960366 2016-04-19 15506 9.88 9.88
## 9 1503960366 2016-04-20 10544 6.68 6.68
## 10 1503960366 2016-04-21 9819 6.34 6.34
## # ... with 930 more rows, and 10 more variables:
## # logged_activities_distance <dbl>, very_active_distance <dbl>,
## # moderately_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <int>,
## # fairly_active_minutes <int>, lightly_active_minutes <int>,
## # sedentary_minutes <int>, calories <int>
sleep_daily_cln[[2]] <- as.Date(sleep_daily_cln[[2]], "%m/%d/%Y")
tibble(sleep_daily_cln)
## # A tibble: 413 x 5
## id date total_sleep_records total_minutes_asl~ total_time_in_b~
## <dbl> <date> <int> <int> <int>
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
## 7 1503960366 2016-04-20 1 360 377
## 8 1503960366 2016-04-21 1 325 364
## 9 1503960366 2016-04-23 1 361 384
## 10 1503960366 2016-04-24 1 430 449
## # ... with 403 more rows
# The dataframes below needs datetime format
heartrate_seconds_cln[[2]] <- as.POSIXct(heartrate_seconds_cln[[2]], format = "%m/%d/%Y %H:%M:%S %p")
tibble(heartrate_seconds_cln)
## # A tibble: 2,483,658 x 3
## id date_time value
## <dbl> <dttm> <int>
## 1 2022484408 2016-04-12 07:21:00 97
## 2 2022484408 2016-04-12 07:21:05 102
## 3 2022484408 2016-04-12 07:21:10 105
## 4 2022484408 2016-04-12 07:21:20 103
## 5 2022484408 2016-04-12 07:21:25 101
## 6 2022484408 2016-04-12 07:22:05 95
## 7 2022484408 2016-04-12 07:22:10 91
## 8 2022484408 2016-04-12 07:22:15 93
## 9 2022484408 2016-04-12 07:22:20 94
## 10 2022484408 2016-04-12 07:22:25 93
## # ... with 2,483,648 more rows
mets_minute_cln[[2]] <- as.POSIXct(mets_minute_cln[[2]], format = "%m/%d/%Y %H:%M:%S %p")
tibble(mets_minute_cln)
## # A tibble: 1,325,580 x 3
## id date_time mets
## <dbl> <dttm> <int>
## 1 1503960366 2016-04-12 12:00:00 10
## 2 1503960366 2016-04-12 12:01:00 10
## 3 1503960366 2016-04-12 12:02:00 10
## 4 1503960366 2016-04-12 12:03:00 10
## 5 1503960366 2016-04-12 12:04:00 10
## 6 1503960366 2016-04-12 12:05:00 12
## 7 1503960366 2016-04-12 12:06:00 12
## 8 1503960366 2016-04-12 12:07:00 12
## 9 1503960366 2016-04-12 12:08:00 12
## 10 1503960366 2016-04-12 12:09:00 12
## # ... with 1,325,570 more rows
# Separating column for the date time in tables for heart rate and METs
heartrate_seconds_cln_v2 <- separate(heartrate_seconds_cln, col = date_time, c("date","time"), sep = " ")
heartrate_seconds_cln_v2[[2]] <- as.Date(heartrate_seconds_cln_v2[[2]], "%Y-%m-%d")
tibble(heartrate_seconds_cln_v2)
## # A tibble: 2,483,658 x 4
## id date time value
## <dbl> <date> <chr> <int>
## 1 2022484408 2016-04-12 07:21:00 97
## 2 2022484408 2016-04-12 07:21:05 102
## 3 2022484408 2016-04-12 07:21:10 105
## 4 2022484408 2016-04-12 07:21:20 103
## 5 2022484408 2016-04-12 07:21:25 101
## 6 2022484408 2016-04-12 07:22:05 95
## 7 2022484408 2016-04-12 07:22:10 91
## 8 2022484408 2016-04-12 07:22:15 93
## 9 2022484408 2016-04-12 07:22:20 94
## 10 2022484408 2016-04-12 07:22:25 93
## # ... with 2,483,648 more rows
mets_minute_cln_v2 <- separate(mets_minute_cln, col = date_time, c("date","time"), sep = " ")
mets_minute_cln_v2[[2]] <- as.Date(mets_minute_cln_v2[[2]], "%Y-%m-%d")
tibble(mets_minute_cln_v2)
## # A tibble: 1,325,580 x 4
## id date time mets
## <dbl> <date> <chr> <int>
## 1 1503960366 2016-04-12 12:00:00 10
## 2 1503960366 2016-04-12 12:01:00 10
## 3 1503960366 2016-04-12 12:02:00 10
## 4 1503960366 2016-04-12 12:03:00 10
## 5 1503960366 2016-04-12 12:04:00 10
## 6 1503960366 2016-04-12 12:05:00 12
## 7 1503960366 2016-04-12 12:06:00 12
## 8 1503960366 2016-04-12 12:07:00 12
## 9 1503960366 2016-04-12 12:08:00 12
## 10 1503960366 2016-04-12 12:09:00 12
## # ... with 1,325,570 more rows
Because there was a focus on ‘daily data’ the heart rate and METs data had to be converted to ‘daily data’.
# Summarize the min and sec data into daily data in order to use for the daily insights.
heartrate_seconds_cln_dt <- data.table(heartrate_seconds_cln_v2)
heartrate_seconds_cln_dt_2 <- heartrate_seconds_cln_dt[,list(mean_hrt = mean(value), max_hrt = max(value), min_hrt = min(value)), by = c("id,date")]
mets_minute_cln_dt <- data.table(mets_minute_cln_v2)
mets_minute_cln_dt_2 <- mets_minute_cln_dt[,list(mean_mets = mean(mets), max_mets = max(mets), min_mets = min(mets)), by = c("id,date")]
Verification to look for duplication’s, missing values and unusual values.
# Looking for any duplication or missing values
# No duplicates
activity_daily_cln %>%
get_dupes() %>%
summarise(count(id))
## [1] count(id)
## <0 rows> (or 0-length row.names)
# Multiple duplicates
heartrate_seconds_cln_v2 %>%
get_dupes() %>%
summarise(count(id))
## count(id).x count(id).freq
## 1 2022484408 260
## 2 2347167796 1862
## 3 4020332650 876
## 4 4388161847 3218
## 5 4558609924 878
## 6 5553957443 4684
## 7 5577150313 1508
## 8 6117666160 940
## 9 6775888955 28
## 10 6962181067 2490
## 11 7007744171 462
## 12 8792009665 1142
## 13 8877689391 320
# Remove the duplicates from heart rate data
heartrate_seconds_cln_v3 <- heartrate_seconds_cln_v2 %>%
distinct()
# Duplicates observed and thus needs to be removed
mets_minute_cln_v2 %>%
get_dupes() %>%
summarise(count(id))
## count(id).x count(id).freq
## 1 1503960366 10572
## 2 1624580081 33966
## 3 1644430081 17290
## 4 1844505072 28874
## 5 1927972279 35982
## 6 2022484408 10120
## 7 2026352035 11290
## 8 2320127002 23562
## 9 2347167796 4908
## 10 2873212765 25368
## 11 3372868164 15662
## 12 3977333714 18014
## 13 4020332650 30258
## 14 4057192912 4142
## 15 4319703577 17532
## 16 4388161847 10206
## 17 4445114986 13766
## 18 4558609924 9598
## 19 4702921684 16180
## 20 5553957443 18544
## 21 5577150313 9424
## 22 6117666160 14316
## 23 6290855005 26456
## 24 6775888955 29664
## 25 6962181067 12778
## 26 7007744171 8978
## 27 7086361926 15902
## 28 8053475328 16742
## 29 8253242879 20226
## 30 8378563200 13924
## 31 8583815059 22228
## 32 8792009665 25810
## 33 8877689391 6882
# Remove duplicates from METs data
mets_minute_cln_v3 <- mets_minute_cln_v2 %>%
distinct()
# Duplicates observed in the data frame below
sleep_daily_cln %>%
get_dupes() %>%
summarise(count(id))
## count(id).x count(id).freq
## 1 4388161847 2
## 2 4702921684 2
## 3 8378563200 2
# Remove duplicates from sleep data
sleep_daily_cln_v2 <- sleep_daily_cln %>%
distinct()
# Final step of the cleaning process is to summarize the min and sec data into daily data in order to use for the daily insights.
heartrate_seconds_cln_dt <- data.table(heartrate_seconds_cln_v3)
heartrate_seconds_cln_dt_2 <- heartrate_seconds_cln_dt[,list(mean_hrt = mean(value), max_hrt = max(value), min_hrt = min(value)), by = c("id,date")]
mets_minute_cln_dt <- data.table(mets_minute_cln_v3)
mets_minute_cln_dt_2 <- mets_minute_cln_dt[,list(mean_mets = mean(mets), max_mets = max(mets), min_mets = min(mets)), by = c("id,date")]
# Using the summary function you can quickly identify if there are unusual numbers
heartrate_seconds_cln_dt_2 %>%
summary()
## id date mean_hrt max_hrt
## Min. :2.022e+09 Min. :2016-04-12 Min. : 59.38 Min. : 80.0
## 1st Qu.:4.388e+09 1st Qu.:2016-04-19 1st Qu.: 70.49 1st Qu.:125.0
## Median :5.577e+09 Median :2016-04-26 Median : 77.50 Median :135.5
## Mean :5.565e+09 Mean :2016-04-26 Mean : 78.63 Mean :138.7
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-03 3rd Qu.: 84.93 3rd Qu.:153.0
## Max. :8.878e+09 Max. :2016-05-12 Max. :109.79 Max. :203.0
## min_hrt
## Min. :36.00
## 1st Qu.:48.00
## Median :52.00
## Mean :52.69
## 3rd Qu.:56.00
## Max. :71.00
mets_minute_cln_dt_2 %>%
summary()
## id date mean_mets max_mets
## Min. :1.504e+09 Min. :2016-04-12 Min. :10.00 Min. : 10.0
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.:13.54 1st Qu.: 58.0
## Median :4.445e+09 Median :2016-04-26 Median :15.73 Median : 74.0
## Mean :4.847e+09 Mean :2016-04-26 Mean :15.53 Mean : 73.6
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:17.54 3rd Qu.: 93.0
## Max. :8.878e+09 Max. :2016-05-12 Max. :26.85 Max. :157.0
## min_mets
## Min. : 0.000
## 1st Qu.:10.000
## Median :10.000
## Mean : 9.931
## 3rd Qu.:10.000
## Max. :10.000
activity_daily_cln %>%
summary()
## id date total_steps total_distance
## Min. :1.504e+09 Min. :2016-04-12 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Median :2016-04-26 Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean :2016-04-26 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :2016-05-12 Max. :36019 Max. :28.030
## tracker_distance logged_activities_distance very_active_distance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## moderately_active_distance light_active_distance sedentary_active_distance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## very_active_minutes fairly_active_minutes lightly_active_minutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
## Median : 4.00 Median : 6.00 Median :199.0
## Mean : 21.16 Mean : 13.56 Mean :192.8
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
## Max. :210.00 Max. :143.00 Max. :518.0
## sedentary_minutes calories
## Min. : 0.0 Min. : 0
## 1st Qu.: 729.8 1st Qu.:1828
## Median :1057.5 Median :2134
## Mean : 991.2 Mean :2304
## 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :1440.0 Max. :4900
sleep_daily_cln_v2 %>%
summary()
## id date total_sleep_records
## Min. :1.504e+09 Min. :2016-04-12 Min. :1.00
## 1st Qu.:3.977e+09 1st Qu.:2016-04-19 1st Qu.:1.00
## Median :4.703e+09 Median :2016-04-27 Median :1.00
## Mean :4.995e+09 Mean :2016-04-26 Mean :1.12
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:1.00
## Max. :8.792e+09 Max. :2016-05-12 Max. :3.00
## total_minutes_asleep total_time_in_bed
## Min. : 58.0 Min. : 61.0
## 1st Qu.:361.0 1st Qu.:403.8
## Median :432.5 Median :463.0
## Mean :419.2 Mean :458.5
## 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :796.0 Max. :961.0
With the cleaning process completed and the data was ready for analysis.
Descriptive statistic of the Daily Activity and Sleep Data:
# Analyze phase of the analysis
# Starting with descriptive statistics for daily activity data
activity_daily_cln_dt <- data.table(activity_daily_cln)
activity_daily_cln_dt_2 <- activity_daily_cln_dt[,list(total_active_minutes = sum(very_active_minutes, fairly_active_minutes, lightly_active_minutes)), by = c("id,date")]
activity_daily_cln_dt_3 <- activity_daily_cln_dt %>%
inner_join(activity_daily_cln_dt_2)
activity_daily_cln_dt_3$weekday <- weekdays(activity_daily_cln_dt_2$date)
activity_daily_mets_cln_merge <- activity_daily_cln_dt_3 %>%
inner_join(mets_minute_cln_dt_2)
tibble(activity_daily_mets_cln_merge)
## # A tibble: 934 x 20
## id date total_steps total_distance tracker_distance
## <dbl> <date> <int> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## 7 1503960366 2016-04-18 13019 8.59 8.59
## 8 1503960366 2016-04-19 15506 9.88 9.88
## 9 1503960366 2016-04-20 10544 6.68 6.68
## 10 1503960366 2016-04-21 9819 6.34 6.34
## # ... with 924 more rows, and 15 more variables:
## # logged_activities_distance <dbl>, very_active_distance <dbl>,
## # moderately_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <int>,
## # fairly_active_minutes <int>, lightly_active_minutes <int>,
## # sedentary_minutes <int>, calories <int>, total_active_minutes <int>,
## # weekday <chr>, mean_mets <dbl>, max_mets <int>, min_mets <int>
activity_daily_mets_cln_merge %>%
select(total_steps, total_distance, tracker_distance, logged_activities_distance, very_active_distance, moderately_active_distance, light_active_distance, sedentary_active_distance, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes, calories, total_active_minutes) %>%
summary()
## total_steps total_distance tracker_distance logged_activities_distance
## Min. : 0 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 3843 1st Qu.: 2.655 1st Qu.: 2.655 1st Qu.:0.0000
## Median : 7447 Median : 5.275 Median : 5.275 Median :0.0000
## Mean : 7686 Mean : 5.524 Mean : 5.510 Mean :0.1089
## 3rd Qu.:10734 3rd Qu.: 7.720 3rd Qu.: 7.718 3rd Qu.:0.0000
## Max. :36019 Max. :28.030 Max. :28.030 Max. :4.9421
## very_active_distance moderately_active_distance light_active_distance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 1.962
## Median : 0.220 Median :0.2450 Median : 3.385
## Mean : 1.512 Mean :0.5712 Mean : 3.362
## 3rd Qu.: 2.090 3rd Qu.:0.8000 3rd Qu.: 4.790
## Max. :21.920 Max. :6.4800 Max. :10.710
## sedentary_active_distance very_active_minutes fairly_active_minutes
## Min. :0.000000 Min. : 0.0 Min. : 0.00
## 1st Qu.:0.000000 1st Qu.: 0.0 1st Qu.: 0.00
## Median :0.000000 Median : 4.0 Median : 7.00
## Mean :0.001617 Mean : 21.3 Mean : 13.65
## 3rd Qu.:0.000000 3rd Qu.: 32.0 3rd Qu.: 19.00
## Max. :0.110000 Max. :210.0 Max. :143.00
## lightly_active_minutes sedentary_minutes calories total_active_minutes
## Min. : 0.0 Min. : 0.0 Min. : 120 Min. : 0.0
## 1st Qu.:129.0 1st Qu.: 730.0 1st Qu.:1837 1st Qu.:149.2
## Median :199.0 Median :1057.0 Median :2148 Median :248.5
## Mean :194.0 Mean : 991.3 Mean :2318 Mean :229.0
## 3rd Qu.:264.8 3rd Qu.:1226.0 3rd Qu.:2796 3rd Qu.:318.0
## Max. :518.0 Max. :1440.0 Max. :4900 Max. :552.0
# Descriptive statistics for sleep data
tibble(sleep_daily_cln_v2)
## # A tibble: 410 x 5
## id date total_sleep_records total_minutes_asl~ total_time_in_b~
## <dbl> <date> <int> <int> <int>
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
## 7 1503960366 2016-04-20 1 360 377
## 8 1503960366 2016-04-21 1 325 364
## 9 1503960366 2016-04-23 1 361 384
## 10 1503960366 2016-04-24 1 430 449
## # ... with 400 more rows
sleep_daily_cln_dt <- data.table(sleep_daily_cln_v2)
sleep_daily_cln_dt_2 <- sleep_daily_cln_dt[,list(mean_minutes_asleep = mean(total_minutes_asleep), mean_minutes_in_bed = mean(total_time_in_bed)), by = c("id")]
sleep_daily_cln_dt_3 <- sleep_daily_cln_dt_2[,list(mean_minutes_not_sleeping = mean_minutes_in_bed - mean_minutes_asleep, mean_hour_asleep = mean_minutes_asleep / 60, mean_hour_in_bed = mean_minutes_in_bed / 60), by = c("id")]
sleep_daily_cln_dt_4 <- sleep_daily_cln_dt_2 %>%
inner_join(sleep_daily_cln_dt_3)
# ID 1844505072 is suspicious as they spent 16 hour sleeping, which is unusual. As for ID 2320127002 spending 1 hour asleep also seems unusual. These two anomalies can be due to the users device being incorrectly configured.
tibble(sleep_daily_cln_dt_4)
## # A tibble: 24 x 6
## id mean_minutes_asl~ mean_minutes_in~ mean_minutes_not~ mean_hour_asleep
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 360. 383. 22.9 6.00
## 2 1.64e9 294 346 52 4.9
## 3 1.84e9 652 961 309 10.9
## 4 1.93e9 417 438. 20.8 6.95
## 5 2.03e9 506. 538. 31.5 8.44
## 6 2.32e9 61 69 8 1.02
## 7 2.35e9 447. 491. 44.5 7.45
## 8 3.98e9 294. 461. 168. 4.89
## 9 4.02e9 349. 380. 30.4 5.82
## 10 4.32e9 477. 502. 25.3 7.94
## # ... with 14 more rows, and 1 more variable: mean_hour_in_bed <dbl>
sleep_daily_cln_v2 %>%
select(total_sleep_records, total_minutes_asleep, total_time_in_bed) %>%
summary()
## total_sleep_records total_minutes_asleep total_time_in_bed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
sleep_daily_cln_dt_4 %>%
select(mean_minutes_asleep, mean_minutes_in_bed, mean_minutes_not_sleeping, mean_hour_asleep, mean_hour_in_bed) %>%
summary()
## mean_minutes_asleep mean_minutes_in_bed mean_minutes_not_sleeping
## Min. : 61.0 Min. : 69.0 Min. : 3.00
## 1st Qu.:336.3 1st Qu.:377.1 1st Qu.: 18.13
## Median :417.2 Median :446.0 Median : 24.18
## Mean :377.4 Mean :419.9 Mean : 42.48
## 3rd Qu.:449.3 3rd Qu.:487.3 3rd Qu.: 33.93
## Max. :652.0 Max. :961.0 Max. :309.00
## mean_hour_asleep mean_hour_in_bed
## Min. : 1.017 Min. : 1.150
## 1st Qu.: 5.605 1st Qu.: 6.284
## Median : 6.954 Median : 7.434
## Mean : 6.291 Mean : 6.999
## 3rd Qu.: 7.488 3rd Qu.: 8.121
## Max. :10.867 Max. :16.017
# Correlations found a negative correlation between sedentary min and calories burned which is expected, although the strong correlation between very active min and calories used is an interesting observation
correlation_activity <- activity_daily_mets_cln_merge %>%
summarise(cor(total_active_minutes,calories), cor(very_active_minutes,calories), cor(fairly_active_minutes,calories), cor(lightly_active_minutes,calories), cor(sedentary_minutes,calories), cor(mean_mets,calories), cor(max_mets,calories), cor(min_mets, calories))
weekday_total_act_min <- activity_daily_cln_dt_3[,list(weekday_mean_total_act_min = mean(total_active_minutes), weekday_max_total_act_min = max(total_active_minutes), weekday_min_total_act_min = min(total_active_minutes)), by = c("weekday")] %>%
arrange(weekday_mean_total_act_min)
The observed trends or relationships was as follow:
1. Very high activity minutes and calories used was strongly correlated
2. Sedentary minutes and calories burned was negatively correlated
3. The data for the Heart Rate and Sleep was incomplete which shows user's use this function on occasion.
4. Saturday was the most active day of the week
5. From the data it was established that activity tracking is the primary use, therefore both the 'Bellabeats Leaf' and 'Bellabeats Time' would be sufficient.
Visualizing the most active weekdays illustrated the specific days of the week where the activity levels where generally higher. Saturday was the most active day with Sunday being the least active. Traditionally, people work from Monday to Friday and have more free time on Saturday and Sunday.Most people tend to be more active on Saturdays and rest on Sundays before the next work week start.
There is a strong correlation between very active minutes and calories used in a day. As seen form the graph below:
The total active minutes and calorie usage is also correlated although it is interesting to note that the correlation is not as strong as with very active minutes.
FitBit makes use of METs which is a metric of activity minutes based on the graph below. It illustrates that their algorithm rewards more METs point to more vigorous activities and thus the correlation between METs is a strong compared to the total active minutes as seen below:
Total step shows a strong correlation against calories burned during the day which is intuitive as the more you walk the more calories you will burn.
The group slept less than the recommended 7 to 9 hour as prescribe by the CDC for adults with a mean of 6.3 hours of sleep per day.
The CDC recommends 22 minutes per day of moderate activity the group which was analyzed had a mean of 13.6 fairly active minutes (moderate activity) per day combined with the very active minutes of 21.6 minutes. Meant that the group analysed achieved there goal of 22 minutes as set out by the CDC.
The sleep data provides an opportunity for the app developers at Bellabeats, they can add the functionality to inform the user that they had less than the recommended 7 to 9 hours of sleep as set out by the CDC. Additionally, a follow-up survey can be provided to the participants in an attempt to understand why they did not consistently sleep with their fitness device.
Based on the analysis the fitness devices used during the collection of the data was primarily used for daily activity tracking and the two products that Bellabeats marketing team can focus on are the ‘Bellabeats Leaf’ and ‘Bellabbeats Time’. Based on my experience with fitness devices one of the best features are personalizing your fitness goals with your device. With that being said the app can focus on rewarding vigorous activities especially if the produce is being used for weight loss. Similarly Bellabeats can implement an activity metric such as METs or activity minutes which aligns with the CDC recommendations. Rewarding the user once they have achieved the minimum activity of 22 minutes of moderate activity and increasing the reward points as they increase pass the 22 minutes mark.