The downloaded dataset, FitBit Fitness Tracker Data, from Kaggle with provenance at zenodo.org, is stored on my hard drive and accessible by R Studio.
The dataset is organized in both wide and long formats. Each Id, representing one user, has data for each day recorded, resulting in many rows and fewer columns. The activities and other fitness-related measures are organized in wide format.
Reliability: Data appears to be accurate. Data incomplete due to lack of any demographics data. Data has sample selection bias due to collection method (only 33 users who are online-savvy, responding to distributed survey via Amazon Technical Turk during a limited, dated period of time (March 12, 2016 - May 12, 2016; findings may not extend to a female-only customer base). Zenodo seems to be reputable (see source).
Original: Data is original, although datasets come from second party (see provenance link above).
Comprehensive: The data is not comprehensive. There is no demographic information and only samples 31 days.
Current: The data is not current; it is from 2016.
Cited: The data is cited (see provenance link above).
The license for the dataset is CC0: Public Domain and has Creative Commons Attribution 4.0 International license. There are no personal identifiers in the data and each participant consented to the submission of personal tracker data. The data is accessible and free to the public.
According to Kaggle users who described the dataset, it is well-documented, well-maintained, clean and original. I explored and cleaned the data to further ensure integrity (see code chunks below).
The data contains both automatically tracked data and manually logged data in 33 users. The daily data for each participant helps assess whether use is consistent over time within users and across users. This allows me to determine trends in FitBit usage. Limitations include small sample size, short duration and lack of demographic information. Variables are not clearly defined and assumptions of what these data reflect are based on field names.
I chose to use tidyverse packages, including dplyr and ggplot2, because I want to hone my expertise in R programming.
Regarding data cleaning, I made sure I have backup copies (original csv’s), checked number of rows (no duplicates, 33 unique ids at most) and columns, deleted unneeded fields, filtered for unique values and blanks, cleaned field names, changed field data types as necessary, manipulated strings, checked for whitespace, and fixed dates and times using R. The cleaning process is documented in code chunks and outputs below.
# install.packages("tidyverse")
library(tidyverse)
# install.packages("dplyr")
library(dplyr)
# install.packages("ggplot2")
library(ggplot2)
# install.packages("tidyr")
library(tidyr)
# install.packages("skimr") # helpful for viewing data
library(skimr)
# install.packages("janitor") # helpful for cleaning data
library(janitor)
# install.packages("lubridate")
library(lubridate)
# install.packages("psych") # for generating summary tables
library(psych)
I am using the FitBit Fitness Tracker dataset, including the Daily Activity and Weight Log data. The Daily Activity dataset contains automatically-recorded device data including activity strenuousness (very active, moderate, light) and duration, number of steps, distance traveled, and number of calories burned in 33 participants over 31 days. Users could also manually log activities on the device and this data is included in the Daily Activity dataset. The Weight Log dataset contains data including weight, BMI and percent body fat in the same 33 users over the same 31 days. These values are automatically recorded by the device or app (assuming via smart scale) or manually logged by the user. These two datasets will be merged into one dataset for analyses and visualization.
daily_activity <- read.csv("DailyActivity_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")
skim_without_charts(daily_activity)
| Name | daily_activity |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDate | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| TotalSteps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| TotalDistance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| TrackerDistance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| LoggedActivitiesDistance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| VeryActiveDistance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| ModeratelyActiveDistance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| LightActiveDistance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| VeryActiveMinutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| FairlyActiveMinutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| LightlyActiveMinutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| SedentaryMinutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| Calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
head(daily_activity)
| Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 |
| 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 |
| 1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 |
| 1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 |
| 1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0 | 2.71 | 0.41 | 5.04 | 0 | 36 | 10 | 221 | 773 | 1863 |
| 1503960366 | 4/17/2016 | 9705 | 6.48 | 6.48 | 0 | 3.19 | 0.78 | 2.51 | 0 | 38 | 20 | 164 | 539 | 1728 |
Notes about “daily activity” data based on outputs:
skim_without_charts(weight_log)
| Name | weight_log |
| Number of rows | 67 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Date | 0 | 1 | 19 | 21 | 0 | 56 | 0 |
| IsManualReport | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 7.009282e+09 | 1.950322e+09 | 1.503960e+09 | 6.962181e+09 | 6.962181e+09 | 8.877689e+09 | 8.877689e+09 |
| WeightKg | 0 | 1.00 | 7.204000e+01 | 1.392000e+01 | 5.260000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 |
| WeightPounds | 0 | 1.00 | 1.588100e+02 | 3.070000e+01 | 1.159600e+02 | 1.353600e+02 | 1.377900e+02 | 1.875000e+02 | 2.943200e+02 |
| Fat | 65 | 0.03 | 2.350000e+01 | 2.120000e+00 | 2.200000e+01 | 2.275000e+01 | 2.350000e+01 | 2.425000e+01 | 2.500000e+01 |
| BMI | 0 | 1.00 | 2.519000e+01 | 3.070000e+00 | 2.145000e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 |
| LogId | 0 | 1.00 | 1.461772e+12 | 7.829948e+08 | 1.460444e+12 | 1.461079e+12 | 1.461802e+12 | 1.462375e+12 | 1.463098e+12 |
head(weight_log)
| Id | Date | WeightKg | WeightPounds | Fat | BMI | IsManualReport | LogId |
|---|---|---|---|---|---|---|---|
| 1503960366 | 5/2/2016 11:59:59 PM | 52.6 | 115.9631 | 22 | 22.65 | True | 1.462234e+12 |
| 1503960366 | 5/3/2016 11:59:59 PM | 52.6 | 115.9631 | NA | 22.65 | True | 1.462320e+12 |
| 1927972279 | 4/13/2016 1:08:52 AM | 133.5 | 294.3171 | NA | 47.54 | False | 1.460510e+12 |
| 2873212765 | 4/21/2016 11:59:59 PM | 56.7 | 125.0021 | NA | 21.45 | True | 1.461283e+12 |
| 2873212765 | 5/12/2016 11:59:59 PM | 57.3 | 126.3249 | NA | 21.69 | True | 1.463098e+12 |
| 4319703577 | 4/17/2016 11:59:59 PM | 72.4 | 159.6147 | 25 | 27.45 | True | 1.460938e+12 |
Notes about “weight log” data based on outputs:
daily_activity_2 <- daily_activity %>%
mutate(date = mdy(ActivityDate)) %>%
select(-ActivityDate)
str(daily_activity_2$date) # confirm date format
## Date[1:940], format: "2016-04-12" "2016-04-13" "2016-04-14" "2016-04-15" "2016-04-16" ...
weight_log_2 <- weight_log %>%
mutate(date = as_date(Date, format = "%m/%d/%Y %I:%M:%S %p")) %>%
select(-Date)
str(weight_log_2$date) # confirm date format
## Date[1:67], format: "2016-05-02" "2016-05-03" "2016-04-13" "2016-04-21" "2016-05-12" ...
daily_activity_2$Id <- as.character(daily_activity_2$Id)
str(daily_activity_2$Id) # confirm character format
## chr [1:940] "1503960366" "1503960366" "1503960366" "1503960366" ...
weight_log_2$Id <- as.character(weight_log_2$Id)
str(weight_log_2$Id)# confirm character format
## chr [1:67] "1503960366" "1503960366" "1927972279" "2873212765" ...
n_distinct(daily_activity_2$Id)
## [1] 33
Notes based on output: 33 unique participants, as expected
n_distinct(weight_log_2$Id)
## [1] 8
Notes based on output: 8 unique participants. 8 users x 31 possible dates = 248 observations. However, there are only 67 rows in this table. So there are only 8 unique participants and participants don’t have weight log data for every date.
sum(duplicated(daily_activity_2))
## [1] 0
sum(duplicated(weight_log_2))
## [1] 0
Notes based on outputs: There are no duplicate rows in either dataset and no duplicates need to be removed.
daily_activity_clean <- clean_names(daily_activity_2)
glimpse(daily_activity_clean)
## Rows: 940
## Columns: 15
## $ id <chr> "1503960366", "1503960366", "1503960366", "…
## $ total_steps <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…
## $ date <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
weight_log_clean <- clean_names(weight_log_2)
glimpse(weight_log_clean)
## Rows: 67
## Columns: 8
## $ id <chr> "1503960366", "1503960366", "1927972279", "2873212765…
## $ weight_kg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3…
## $ weight_pounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159…
## $ fat <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, N…
## $ bmi <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.2…
## $ is_manual_report <chr> "True", "True", "False", "True", "True", "True", "Tru…
## $ log_id <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+1…
## $ date <date> 2016-05-02, 2016-05-03, 2016-04-13, 2016-04-21, 2016…
daily_activity_clean %>%
select(total_steps,
total_distance,
tracker_distance,
logged_activities_distance,
very_active_distance,
moderately_active_distance,
light_active_distance,
sedentary_active_distance,
calories) %>%
summary()
## total_steps total_distance tracker_distance logged_activities_distance
## Min. : 0 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 2.620 1st Qu.:0.0000
## Median : 7406 Median : 5.245 Median : 5.245 Median :0.0000
## Mean : 7638 Mean : 5.490 Mean : 5.475 Mean :0.1082
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.: 7.710 3rd Qu.:0.0000
## Max. :36019 Max. :28.030 Max. :28.030 Max. :4.9421
## very_active_distance moderately_active_distance light_active_distance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 1.945
## Median : 0.210 Median :0.2400 Median : 3.365
## Mean : 1.503 Mean :0.5675 Mean : 3.341
## 3rd Qu.: 2.053 3rd Qu.:0.8000 3rd Qu.: 4.782
## Max. :21.920 Max. :6.4800 Max. :10.710
## sedentary_active_distance calories
## Min. :0.000000 Min. : 0
## 1st Qu.:0.000000 1st Qu.:1828
## Median :0.000000 Median :2134
## Mean :0.001606 Mean :2304
## 3rd Qu.:0.000000 3rd Qu.:2793
## Max. :0.110000 Max. :4900
Notes based on output:
weight_log_clean %>%
select(weight_pounds,
fat,
bmi,
is_manual_report) %>%
summary()
## weight_pounds fat bmi is_manual_report
## Min. :116.0 Min. :22.00 Min. :21.45 Length:67
## 1st Qu.:135.4 1st Qu.:22.75 1st Qu.:23.96 Class :character
## Median :137.8 Median :23.50 Median :24.39 Mode :character
## Mean :158.8 Mean :23.50 Mean :25.19
## 3rd Qu.:187.5 3rd Qu.:24.25 3rd Qu.:25.56
## Max. :294.3 Max. :25.00 Max. :47.54
## NA's :65
Notes based on output:
weight_log_clean$fat
## [1] 22 NA NA NA NA 25 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Notes based on output: only two non-NA value for fat percentage. Will not keep this variable or use it in analyses.
# determine if high "total_steps" values are outliers or possible errors
daily_act_clean_steps_test1 <- daily_activity_clean %>%
filter(total_steps > 10000) %>% # 10727 steps is 75th percentile
select(id,
total_steps,
total_distance,
tracker_distance) %>%
slice_max(order_by = total_steps, n = 10) %>%
arrange(desc(total_steps))
daily_act_clean_steps_test1
| id | total_steps | total_distance | tracker_distance |
|---|---|---|---|
| 1624580081 | 36019 | 28.03 | 28.03 |
| 8877689391 | 29326 | 25.29 | 25.29 |
| 8877689391 | 27745 | 26.72 | 26.72 |
| 8877689391 | 23629 | 20.65 | 20.65 |
| 8877689391 | 23186 | 20.40 | 20.40 |
| 8053475328 | 22988 | 17.95 | 17.95 |
| 4388161847 | 22770 | 17.54 | 17.54 |
| 8053475328 | 22359 | 17.19 | 17.19 |
| 2347167796 | 22244 | 15.08 | 15.08 |
| 8053475328 | 22026 | 17.65 | 17.65 |
daily_act_clean_steps_test2 <- daily_activity_clean %>%
filter(id == 1624580081) %>%
select(total_steps,
total_distance,
tracker_distance) %>%
slice_max(order_by = total_steps, n = 10) %>%
arrange(desc(total_steps))
daily_act_clean_steps_test2
| total_steps | total_distance | tracker_distance |
|---|---|---|
| 36019 | 28.03 | 28.03 |
| 10536 | 7.41 | 7.41 |
| 9107 | 5.92 | 5.92 |
| 8538 | 5.55 | 5.55 |
| 8367 | 5.44 | 5.44 |
| 8163 | 5.31 | 5.31 |
| 7155 | 4.93 | 4.93 |
| 7007 | 4.55 | 4.55 |
| 6497 | 4.22 | 4.22 |
| 6474 | 4.30 | 4.30 |
Notes based on outputs:
# determine if "sedentary" values are outliers or possible errors
daily_act_clean_sed_test <- daily_activity_clean %>%
filter(sedentary_minutes > 1229) %>% # 1229 min is 75th percentile
select(id,
total_steps,
total_distance,
tracker_distance,
sedentary_active_distance,
sedentary_minutes) %>%
slice_max(order_by = total_steps, n = 10) %>%
arrange(desc(sedentary_minutes))
daily_act_clean_sed_test
| id | total_steps | total_distance | tracker_distance | sedentary_active_distance | sedentary_minutes |
|---|---|---|---|---|---|
| 8583815059 | 12015 | 9.37 | 9.37 | 0 | 1440 |
| 4388161847 | 10122 | 7.78 | 7.78 | 0 | 1440 |
| 8583815059 | 12427 | 9.69 | 9.69 | 0 | 1370 |
| 8253242879 | 10232 | 8.18 | 8.18 | 0 | 1286 |
| 4388161847 | 10993 | 8.45 | 8.45 | 0 | 1275 |
| 8053475328 | 10520 | 8.29 | 8.29 | 0 | 1260 |
| 8053475328 | 14549 | 11.11 | 11.11 | 0 | 1255 |
| 8053475328 | 13953 | 11.00 | 11.00 | 0 | 1245 |
| 8253242879 | 10204 | 7.91 | 7.91 | 0 | 1237 |
| 2022484408 | 10100 | 7.09 | 7.09 | 0 | 1237 |
Notes based on output: There are several users whose sedentary minutes = or close to 1440 (24 hr). However, in some users, these observations also have a high number of steps etc. Due to uncertainty about this seeming error/inaccuracy, this field will not be included in further analyses. Same with sedentary_active_distance_km.
daily_activity_clean <- daily_activity_clean %>%
dplyr::rename_at(vars(-id,
-total_steps,
-very_active_minutes,
-fairly_active_minutes,
-lightly_active_minutes,
-sedentary_minutes,
-calories,
-date),
paste0,
"_km") #something about R version or conflict # with other package did not let me run rename
# without dplyr:: ("error in rename: unused argument")
daily_activity_clean <- daily_activity_clean %>%
dplyr::rename(calories_burned = calories)
colnames(daily_activity_clean) # confirm changes in field names
## [1] "id" "total_steps"
## [3] "total_distance_km" "tracker_distance_km"
## [5] "logged_activities_distance_km" "very_active_distance_km"
## [7] "moderately_active_distance_km" "light_active_distance_km"
## [9] "sedentary_active_distance_km" "very_active_minutes"
## [11] "fairly_active_minutes" "lightly_active_minutes"
## [13] "sedentary_minutes" "calories_burned"
## [15] "date"
weight_log_clean <- weight_log_clean %>%
dplyr::rename(weight_lb = weight_pounds,
BMI = bmi)
colnames(weight_log_clean) # confirm changes in field names
## [1] "id" "weight_kg" "weight_lb" "fat"
## [5] "BMI" "is_manual_report" "log_id" "date"
Since there are a lot more observations in the daily activity table than in the weight log table, do left join. Make sure merge columns have identical names.
merged_result <- left_join(daily_activity_clean,
weight_log_clean,
by = c("id", "date"))
glimpse(merged_result)
## Rows: 940
## Columns: 21
## $ id <chr> "1503960366", "1503960366", "1503960366"…
## $ total_steps <int> 13162, 10735, 10460, 9762, 12669, 9705, …
## $ total_distance_km <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ tracker_distance_km <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ logged_activities_distance_km <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance_km <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25…
## $ moderately_active_distance_km <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64…
## $ light_active_distance_km <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71…
## $ sedentary_active_distance_km <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, …
## $ fairly_active_minutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 2…
## $ lightly_active_minutes <int> 328, 217, 181, 209, 221, 164, 233, 264, …
## $ sedentary_minutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775…
## $ calories_burned <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921…
## $ date <date> 2016-04-12, 2016-04-13, 2016-04-14, 201…
## $ weight_kg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ weight_lb <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ fat <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BMI <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ is_manual_report <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ log_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
merged_result <- merged_result %>%
select(-sedentary_active_distance_km,
-very_active_minutes,
-fairly_active_minutes,
-lightly_active_minutes,
-sedentary_minutes,
-weight_kg,
-fat,
-log_id)
glimpse(merged_result)
## Rows: 940
## Columns: 13
## $ id <chr> "1503960366", "1503960366", "1503960366"…
## $ total_steps <int> 13162, 10735, 10460, 9762, 12669, 9705, …
## $ total_distance_km <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ tracker_distance_km <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ logged_activities_distance_km <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance_km <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25…
## $ moderately_active_distance_km <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64…
## $ light_active_distance_km <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71…
## $ calories_burned <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921…
## $ date <date> 2016-04-12, 2016-04-13, 2016-04-14, 201…
## $ weight_lb <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BMI <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ is_manual_report <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
merged_result <- merged_result %>%
mutate(
is_manual_report = fct_recode(as.factor(is_manual_report),
Manual = "True",
Device = "False")
)
head(merged_result$is_manual_report)
## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: Device Manual
merged_result <- merged_result %>%
filter(total_distance_km > 0)
describe(merged_result[2:9])
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| total_steps | 1 | 862 | 8329.0394432 | 4739.2469470 | 8053.50 | 8051.5898551 | 4608.662100 | 8.00 | 36019.000000 | 36011.000000 | 0.8164355 | 1.7900671 | 161.4193916 |
| total_distance_km | 2 | 862 | 5.9864501 | 3.7176164 | 5.59 | 5.6758116 | 3.358089 | 0.01 | 28.030001 | 28.020001 | 1.3289480 | 3.9830234 | 0.1266225 |
| tracker_distance_km | 3 | 862 | 5.9708005 | 3.6997561 | 5.59 | 5.6654493 | 3.343263 | 0.01 | 28.030001 | 28.020001 | 1.3422628 | 4.1052666 | 0.1260142 |
| logged_activities_distance_km | 4 | 862 | 0.1179590 | 0.6464734 | 0.00 | 0.0000000 | 0.000000 | 0.00 | 4.942142 | 4.942142 | 5.9904088 | 37.1787329 | 0.0220190 |
| very_active_distance_km | 5 | 862 | 1.6386543 | 2.7363079 | 0.41 | 1.0151884 | 0.607866 | 0.00 | 21.920000 | 21.920000 | 2.8543068 | 10.8089856 | 0.0931990 |
| moderately_active_distance_km | 6 | 862 | 0.6188979 | 0.9053288 | 0.31 | 0.4291304 | 0.459606 | 0.00 | 6.480000 | 6.480000 | 2.6525254 | 9.2645756 | 0.0308356 |
| light_active_distance_km | 7 | 862 | 3.6431206 | 1.8544341 | 3.58 | 3.6111304 | 1.890315 | 0.00 | 10.710000 | 10.710000 | 0.3002367 | 0.1816482 | 0.0631623 |
| calories_burned | 8 | 862 | 2362.4709977 | 702.2695833 | 2220.50 | 2316.0362319 | 714.613200 | 52.00 | 4900.000000 | 4848.000000 | 0.5474199 | 0.2092724 | 23.9193969 |
describe(merged_result[11:12])
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| weight_lb | 1 | 67 | 158.81180 | 30.695415 | 137.7889 | 157.06533 | 21.899432 | 115.9631 | 294.3171 | 178.354 | 1.308951 | 3.299141 | 3.7500419 |
| BMI | 2 | 67 | 25.18522 | 3.066963 | 24.3900 | 24.83964 | 1.363992 | 21.4500 | 47.5400 | 26.090 | 5.734248 | 39.243981 | 0.3746891 |
# device use
percent_count_daily_activity_device <- daily_activity_clean %>%
filter(total_distance_km > 0.00) %>%
group_by(date) %>%
summarize(n = n()) %>%
mutate(count = n, percent = (n/33)*100)
head(percent_count_daily_activity_device)
| date | n | count | percent |
|---|---|---|---|
| 2016-04-12 | 31 | 31 | 93.93939 |
| 2016-04-13 | 31 | 31 | 93.93939 |
| 2016-04-14 | 31 | 31 | 93.93939 |
| 2016-04-15 | 33 | 33 | 100.00000 |
| 2016-04-16 | 31 | 31 | 93.93939 |
| 2016-04-17 | 29 | 29 | 87.87879 |
#manual log use
percent_count_logged_activity <- daily_activity_clean %>%
filter(logged_activities_distance_km > 0.00) %>%
group_by(date) %>%
summarize(n = n()) %>%
mutate(count = n, percent = (n/33)*100)
head(percent_count_logged_activity)
| date | n | count | percent |
|---|---|---|---|
| 2016-04-12 | 2 | 2 | 6.060606 |
| 2016-04-13 | 2 | 2 | 6.060606 |
| 2016-04-14 | 2 | 2 | 6.060606 |
| 2016-04-18 | 2 | 2 | 6.060606 |
| 2016-04-19 | 2 | 2 | 6.060606 |
| 2016-04-20 | 2 | 2 | 6.060606 |
# device use plot
ggplot(percent_count_daily_activity_device,
aes(x = date, y = percent)) +
geom_col(fill = "blue") +
labs(x = "Date",
y = "Percent of Users",
title = "Daily Device Use",
caption = "FitBit Fitness Tracker Data") +
theme_classic()
Daily Device Use. Most of the 33 participants used the device daily, especially at the beginning of the study. The data used for this plot was filtered for total distance traveled > 0 km, assuming that 0 km indicates the device was not used on that particular date. Interestingly, device usage dropped to almost 50% during the last week of the study. We would need to know more about the study design and users to interpret the meaning of this drop. For example, were devices brand new when users agreed to contribute their data? This would contribute to a bias indicative of frequent use of a new device after purchase that ebbs off as the novelty wears off or after the battery dies for the first time.
# manual log use plot
ggplot(percent_count_logged_activity, aes(x = date, y = percent)) +
geom_col(fill = "purple") +
labs(x = "Date",
y = "Percent of users",
title = "Daily Activity Log Use",
caption = "FitBit Fitness Tracker Data") +
theme_classic() +
ylim(0, 25)
Daily Activity Log Use. In contrast to device use, fewer than 11% of users manually logged activities and no user logged activities daily.
# Convert data to long format
pivot_long_distance <- merged_result %>%
pivot_longer(cols = ("total_distance_km":"light_active_distance_km"),
names_to = "distance", values_to = "km")
head(pivot_long_distance$distance)
## [1] "total_distance_km" "tracker_distance_km"
## [3] "logged_activities_distance_km" "very_active_distance_km"
## [5] "moderately_active_distance_km" "light_active_distance_km"
head(pivot_long_distance$km)
## [1] 8.50 8.50 0.00 1.88 0.55 6.06
pivot_long_weight <- merged_result %>%
drop_na() %>%
pivot_longer(cols = ("weight_lb":"BMI"),
names_to = "variable", values_to = "weight")
head(pivot_long_weight$variable)
## [1] "weight_lb" "BMI" "weight_lb" "BMI" "weight_lb" "BMI"
head(pivot_long_weight$weight)
## [1] 115.9631 22.6500 115.9631 22.6500 294.3171 47.5400
pivot_long_distance_p <- pivot_long_distance %>%
group_by(distance) %>%
summarize(mean_distance = mean(km),
sd_distance = sd(km)) %>%
ggplot(., aes(x = distance,
y = mean_distance,
fill = distance)) +
geom_col() +
geom_errorbar(aes(ymax = mean_distance + sd_distance,
ymin = mean_distance)) +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
labs(x = "Activity Type",
y = "Average + SD km",
title = "Average Distance by Activity Type",
caption = "FitBit Fitness Tracker Data")
pivot_long_distance_p
Average Distance by Activity Type. On average, lighter activities accounted for the majority of total distance traveled by users, followed by very active and moderate activities. Manually logged distance was low compared to device-recorded distances (all others shown). As expected and consistent with appropriate use of the device, average total and tracker distances were nearly identical. These activities were averaged across 33 participants over up to 31 days of device usage.
manual_vs_device_weight_lb <- merged_result %>%
drop_na(weight_lb) %>%
ggplot(aes(x = is_manual_report,
y = weight_lb)) +
geom_boxplot(aes(fill = is_manual_report),
na.rm = TRUE,
show.legend = FALSE) +
theme_classic() +
labs(title = "Weight as Reported by Device vs. Manual Input",
y = "median weight (lb) +/- IQs",
caption = "FitBit Fitness Tracker Data") +
theme(axis.title.x = element_blank())
manual_vs_device_weight_lb
Weight as Reported by Device vs. Manual Input. Only 8 participants contributed to the weight log and for varying numbers of days. Each point reflects data points considered to be outliers (> or < 3 interquartile (IQ) range) and the horizontal bars represent the median weight of the sample depending on whether it was manually logged by the participant or recorded by the device. These results suggest that individuals who weigh more tend to have smart scales compared to individuals who weigh less, with the caveat that this is a very small sample.
manual_vs_device_bmi <- merged_result %>%
drop_na(BMI) %>%
ggplot(aes(x = is_manual_report, y = BMI)) +
geom_boxplot(aes(fill = is_manual_report),
na.rm = TRUE,
show.legend = FALSE) +
theme_classic() +
labs(title = "BMI as Reported by Device vs. Manual Input",
y = "median BMI (kg/m^2) +/- IQs",
caption = "FitBit Fitness Tracker Data") +
theme(axis.title.x = element_blank())
manual_vs_device_bmi
Body Mass Index (BMI) as Reported by Device vs. Manual Input. Each point reflects outlier data points (> or < 3 interquartile (IQ) range) and the horizontal bars represent the median BMI of the sample depending on whether it was manually logged by the participant or recorded by the device. These results suggest that individuals who have higher BMIs tend to have smart scales compared to individuals who have lower BMIs, with the caveat that this is a very small sample.
# are the number of total steps related to total distance or calories burned?
total_steps_and_total_distance_p <- merged_result %>%
filter(total_distance_km > 0.00,
total_steps > 0.00) %>%
ggplot(., aes(x = total_steps,
y = total_distance_km)) +
geom_point() +
geom_smooth() +
theme_classic() +
labs(x = "Total Steps",
y = "Total Distance (km)",
title = "Total Steps and Distance Traveled",
caption = "FitBit Fitness Tracker Data")
total_steps_and_total_distance_p
Total Steps and Distance Traveled. The higher the number of total steps, the greater the distance traveled, as recorded by the device. This is a strong positive correlation and a reassuring sign that the device is reliable, since these two variables should be strongly related. The data points reflect each user’s data recorded each day.
total_dist_and_calories_p <- merged_result %>%
filter(total_distance_km > 0.00,
calories_burned > 0.00) %>%
ggplot(., aes(x = total_distance_km,
y = calories_burned)) +
geom_point() +
geom_smooth() +
theme_classic() +
labs(x = "Total Distance (km)",
y = "Calories Burned",
title = "Total Distance and Calories Burned",
caption = "FitBit Fitness Tracker Data")
total_dist_and_calories_p
Total Distance and Calories Burned. The greater the total distance, the more calories burned, as recorded by the device. This is a strong positive correlation and a reassuring sign that the device is working as it should, since these two variables should be related. The correlation is not as strong as the previous one shown (total steps and distance traveled) because the device’s method of deriving number of calories burned is probably more complicated and may be dependent on user characteristics such as gender and weight. Alternatively, the device may not derive calories burned as accurately as it does distance. The data points reflect each user’s data recorded on each day of use.
# is calories burned related to very active, moderately or light active distance?
cal_and_very_active_dist_p <- merged_result %>%
filter(very_active_distance_km > 0.00,
calories_burned > 0.00) %>%
ggplot(., aes(x = very_active_distance_km,
y = calories_burned )) +
geom_point() +
geom_smooth() +
theme_classic() +
labs(x = "Very Active Distance (km)",
y = "Calories Burned",
title = "Very Active Distance and Calories Burned",
caption = "FitBit Fitness Tracker Data")
cal_and_very_active_dist_p
Calories Burned and Very Active Distance. Overall, these data indicate that the number of calories burned increase with higher levels of “very active” activity. Even at lower levels however, “very active” activity was associated with a high number of calories burned (~2500). Interestingly, the change in number of calories burned does not seem to be appreciable until “very active” activity accounts for at least 3 km of distance. The data points reflect each user’s data recorded on each day of use.
cal_and_mod_active_dist_p <- merged_result %>%
filter(moderately_active_distance_km > 0.00,
calories_burned > 0.00) %>%
ggplot(., aes(x = moderately_active_distance_km,
y = calories_burned)) +
geom_point() +
geom_smooth() +
theme_classic() +
labs(x = "Moderately Active Distance (km)",
y = "Calories Burned",
title = "Moderately Active Distance and Calories Burned",
caption = "FitBit Fitness Tracker Data")
cal_and_mod_active_dist_p
Calories Burned and Moderately Active Distance. These data indicate that changes in “moderate” activity levels do not relate to changes in the number of calories burned. “Moderate activity” was associated with ~2500 calories burned regardless of how much distance it accounted for. The data points reflect each user’s data recorded on each day of use.
cal_and_light_active_dist_p <- merged_result %>%
filter(light_active_distance_km > 0.00,
calories_burned > 0.00) %>%
ggplot(., aes(x = light_active_distance_km,
y = calories_burned)) +
geom_point() +
geom_smooth() +
theme_classic() +
labs(x = "Light Active Distance (km)",
y = "Calories Burned",
title = "Light Active Distance and Calories Burned",
caption = "FitBit Fitness Tracker Data")
cal_and_light_active_dist_p
Calories Burned and Light Active Distance. These data indicate that “light activity” relates to the number of calories burned. At the lower end of “light activity”, less than 2000 calories are burned. With increasing “light activity”, the number of calories burned modestly but steadily increases. The data points reflect each user’s data recorded on each day of use.
Over the course of a month, most users use the device most days, especially during the first 3 weeks.
Only up to 10% of users use the option to manually log activities and no user uses this option daily.
On average, users’ total distance, over the course of a month, is ~6 km. “Light active” distance (average = 3.6 km) accounts for most of total distance, followed by “very active” (average = 1.6 km) and “moderate distance” (average = 0.6 km).
Only 8 (24%) of users use the weight log, in which weight and BMI were recorded, and no user uses it daily. There are only two data points for body fat percentage.
Assuming that device-reported weight and BMI in the weight log is enabled via connection with a smart scale, it seems that users who weigh more or have higher BMIs are more likely to have smart scales. Users who manually log their weight and BMI weigh less and have lower BMIs. However, these data should be interpreted with caution due to the very small sample size.
A higher number of total steps, over the course of a month, is strongly related to greater total distance (km) across 33 users, as would be expected and in support of the reliability of the device.
As expected, greater total distance is related to more calories burned across 33 users over the course of a month. This result again supports the reliability of the device. The relationship is not as strong as the one between total distance and total steps, indicating that however the device derives “calories burned” is more complicated than for total steps.
Changes in “light active” and “very active” distances relates to changes in number of calories burned over the course of a month. “Moderate activity” tended to account for less total distance than “light” and “very active” activity and did not relate to changes in number of calories burned.
The majority of smart device users tend to forego utilizing manual options including the “logged activities” and weight log (weight, BMI and body fat percentage) options.
Users who weigh more or have higher BMI may be more likely to have smart scales that connect to the device, making manual logging of weight and BMI unnecessary. There were only two data points for body fat percentage, both recorded by the device, suggesting many smart scales do not yet measure it.
Like “very active” activity, increases in “light activity” are associated with increases in number of calories burned. Although causality cannot be inferred, it may be encouraging to users that activity does not have to be strenuous for it to be associated with burning calories, although longer distances may be required to reach the number of calories burned at shorter distances of “very active” activity.
According to the above analysis, users are unlikely to use manual input features. Focus and invest more in automated recording features in smart devices. Bellabeat is already doing a good job of focusing on three smart devices including Leaf, Time and Spring, which do not require manual input. However, the app has features that require manual input including menstrual cycle and mindfulness. Focus less time and expense on these manual input features.
Smart scales enable automated recording of weight and BMI, eliminating the need for manual logging. It may be helpful for marketing to emphasize the ability of the smart device to connect with smart scales and thereby increase use of the device-associated weight log. This duo-combination would allow users to easily track not only activity and calories burned but also changes in weight and BMI over time. Users may more easily develop personalized strategies to reach their goals with this extra data.
If the findings above are supported by further, larger studies of smart device usage, marketing may focus on the concept that increasing engagement in even light activities is associated with increases in calories burned. This concept may be encouraging to users who do not have the ability, time or equipment to perform more strenuous activities. The device makes it easy to monitor strenuousness, distance and number of calories burned and change activity levels or duration as necessary to meet their goals.
On the flip side, engaging in even low levels of “very active” activity is associated with burning a high number of calories. Marketing should also focus on emphasizing the time-saving aspect of “very active” activity. Again, the device makes it easy to monitor magnitude and duration of activity and adjust as needed to meet their goals in terms of burning calories.
Next steps. As mentioned by the CCO, further research on fitness-related smart devices should be done using a more timely, larger, and comprehensive dataset. The current dataset has limitations due to small sample size, lack of demographic information, and a limited window of one month of data collection in 2016. The co-founders may want to wait for the marketing analytics team to assess whether the current findings hold up in this separate dataset before making costly decisions. It is important that these decisions are based on findings that extend to Bellabeat’s female customers and to extended use of smart devices.