Capstone Project

Case Study for Google Data Analytics Professional Certificate

Case Study Topic: How Can a Wellness Technology Company, Bellabeat, Play it Smart?

Guiding Questions

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

How Can Your Insights Drive Business Decisions?

Trends in smart device usage may reflect habits of Bellabeat customers.
More popular trends may indicate what Bellabeat products will be popular with their customers.
Bellabeat may want to focus on and invest in products that will be popular.

Business Task: Key findings regarding trends in smart device usage will be identified in non-Bellabeat users and presented to key stakeholders including the cofounder/CCO, cofounder, and marketing analytics team. The presentation will include insights and recommendations for the marketing strategy based on the key findings.

Data Preparation

The downloaded dataset, FitBit Fitness Tracker Data, from Kaggle with provenance at zenodo.org, is stored on my hard drive and accessible by R Studio.
The dataset is organized in both wide and long formats. Each Id, representing one user, has data for each day recorded, resulting in many rows and fewer columns. The activities and other fitness-related measures are organized in wide format.

Bias and credibility

Reliability: Data appears to be accurate. Data incomplete due to lack of any demographics data. Data has sample selection bias due to collection method (only 33 users who are online-savvy, responding to distributed survey via Amazon Technical Turk during a limited, dated period of time (March 12, 2016 - May 12, 2016; findings may not extend to a female-only customer base). Zenodo seems to be reputable (see source).

Original: Data is original, although datasets come from second party (see provenance link above).

Comprehensive: The data is not comprehensive. There is no demographic information and only samples 31 days.

Current: The data is not current; it is from 2016.

Cited: The data is cited (see provenance link above).

Licensing, privacy, security and accessibility

The license for the dataset is CC0: Public Domain and has Creative Commons Attribution 4.0 International license. There are no personal identifiers in the data and each participant consented to the submission of personal tracker data. The data is accessible and free to the public.

Data Integrity

According to Kaggle users who described the dataset, it is well-documented, well-maintained, clean and original. I explored and cleaned the data to further ensure integrity (see code chunks below).

Dataset Description

The data contains both automatically tracked data and manually logged data in 33 users. The daily data for each participant helps assess whether use is consistent over time within users and across users. This allows me to determine trends in FitBit usage. Limitations include small sample size, short duration and lack of demographic information. Variables are not clearly defined and assumptions of what these data reflect are based on field names.

Data Processing and Cleaning

I chose to use tidyverse packages, including dplyr and ggplot2, because I want to hone my expertise in R programming.

Regarding data cleaning, I made sure I have backup copies (original csv’s), checked number of rows (no duplicates, 33 unique ids at most) and columns, deleted unneeded fields, filtered for unique values and blanks, cleaned field names, changed field data types as necessary, manipulated strings, checked for whitespace, and fixed dates and times using R. The cleaning process is documented in code chunks and outputs below.

Installing and Loading Packages

# install.packages("tidyverse")
library(tidyverse)

# install.packages("dplyr")
library(dplyr)

# install.packages("ggplot2")
library(ggplot2)

# install.packages("tidyr")
library(tidyr)

# install.packages("skimr") # helpful for viewing data
library(skimr) 

# install.packages("janitor") # helpful for cleaning data
library(janitor) 

# install.packages("lubridate")
library(lubridate)

# install.packages("psych") # for generating summary tables
library(psych)

Importing datasets of interest

I am using the FitBit Fitness Tracker dataset, including the Daily Activity and Weight Log data. The Daily Activity dataset contains automatically-recorded device data including activity strenuousness (very active, moderate, light) and duration, number of steps, distance traveled, and number of calories burned in 33 participants over 31 days. Users could also manually log activities on the device and this data is included in the Daily Activity dataset. The Weight Log dataset contains data including weight, BMI and percent body fat in the same 33 users over the same 31 days. These values are automatically recorded by the device or app (assuming via smart scale) or manually logged by the user. These two datasets will be merged into one dataset for analyses and visualization.

daily_activity <- read.csv("DailyActivity_merged.csv") 

weight_log <- read.csv("weightLogInfo_merged.csv")

Exploring dataset formats

skim_without_charts(daily_activity)

Data summary
Name	daily_activity
Number of rows	940
Number of columns	15
_______________________
Column type frequency:
character	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityDate	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	4.855407e+09	2.424805e+09	1503960366	2.320127e+09	4.445115e+09	6.962181e+09	8.877689e+09
TotalSteps	1	7.637910e+03	5.087150e+03	0	3.789750e+03	7.405500e+03	1.072700e+04	3.601900e+04
TotalDistance	1	5.490000e+00	3.920000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
TrackerDistance	1	5.480000e+00	3.910000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
LoggedActivitiesDistance	1	1.100000e-01	6.200000e-01	0	0.000000e+00	0.000000e+00	0.000000e+00	4.940000e+00
VeryActiveDistance	1	1.500000e+00	2.660000e+00	0	0.000000e+00	2.100000e-01	2.050000e+00	2.192000e+01
ModeratelyActiveDistance	1	5.700000e-01	8.800000e-01	0	0.000000e+00	2.400000e-01	8.000000e-01	6.480000e+00
LightActiveDistance	1	3.340000e+00	2.040000e+00	0	1.950000e+00	3.360000e+00	4.780000e+00	1.071000e+01
SedentaryActiveDistance	1	0.000000e+00	1.000000e-02	0	0.000000e+00	0.000000e+00	0.000000e+00	1.100000e-01
VeryActiveMinutes	1	2.116000e+01	3.284000e+01	0	0.000000e+00	4.000000e+00	3.200000e+01	2.100000e+02
FairlyActiveMinutes	1	1.356000e+01	1.999000e+01	0	0.000000e+00	6.000000e+00	1.900000e+01	1.430000e+02
LightlyActiveMinutes	1	1.928100e+02	1.091700e+02	0	1.270000e+02	1.990000e+02	2.640000e+02	5.180000e+02
SedentaryMinutes	1	9.912100e+02	3.012700e+02	0	7.297500e+02	1.057500e+03	1.229500e+03	1.440000e+03
Calories	1	2.303610e+03	7.181700e+02	0	1.828500e+03	2.134000e+03	2.793250e+03	4.900000e+03

head(daily_activity)

Id	ActivityDate	TotalSteps	TotalDistance	TrackerDistance	VeryActiveDistance	ModeratelyActiveDistance	LightActiveDistance	VeryActiveMinutes	FairlyActiveMinutes	LightlyActiveMinutes	SedentaryMinutes	Calories
1503960366	4/12/2016	13162	8.50	8.50	1.88	0.55	6.06	25	13	328	728	1985
1503960366	4/13/2016	10735	6.97	6.97	1.57	0.69	4.71	21	19	217	776	1797
1503960366	4/14/2016	10460	6.74	6.74	2.44	0.40	3.91	30	11	181	1218	1776
1503960366	4/15/2016	9762	6.28	6.28	2.14	1.26	2.83	29	34	209	726	1745
1503960366	4/16/2016	12669	8.16	8.16	2.71	0.41	5.04	36	10	221	773	1863
1503960366	4/17/2016	9705	6.48	6.48	3.19	0.78	2.51	38	20	164	539	1728

Notes about “daily activity” data based on outputs:

940 rows, 15 columns
long format: user id and date = one observation (33 user ids x 31 days = 1023 observations.; so not all 33 users have data for all 31 dates)
wide format for other variables
no group variables
no cells with missing values, 31 unique dates, no white space
activity date is character type, needs to be changed to date format
id data type is numeric, needs to be changed to character

skim_without_charts(weight_log)

Data summary
Name	weight_log
Number of rows	67
Number of columns	8
_______________________
Column type frequency:
character	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Date	0	1	19	21	0	56	0
IsManualReport	0	1	4	5	0	2	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	0	1.00	7.009282e+09	1.950322e+09	1.503960e+09	6.962181e+09	6.962181e+09	8.877689e+09	8.877689e+09
WeightKg	0	1.00	7.204000e+01	1.392000e+01	5.260000e+01	6.140000e+01	6.250000e+01	8.505000e+01	1.335000e+02
WeightPounds	0	1.00	1.588100e+02	3.070000e+01	1.159600e+02	1.353600e+02	1.377900e+02	1.875000e+02	2.943200e+02
Fat	65	0.03	2.350000e+01	2.120000e+00	2.200000e+01	2.275000e+01	2.350000e+01	2.425000e+01	2.500000e+01
BMI	0	1.00	2.519000e+01	3.070000e+00	2.145000e+01	2.396000e+01	2.439000e+01	2.556000e+01	4.754000e+01
LogId	0	1.00	1.461772e+12	7.829948e+08	1.460444e+12	1.461079e+12	1.461802e+12	1.462375e+12	1.463098e+12

head(weight_log)

Id	Date	WeightKg	WeightPounds	Fat	BMI	IsManualReport	LogId
1503960366	5/2/2016 11:59:59 PM	52.6	115.9631	22	22.65	True	1.462234e+12
1503960366	5/3/2016 11:59:59 PM	52.6	115.9631	NA	22.65	True	1.462320e+12
1927972279	4/13/2016 1:08:52 AM	133.5	294.3171	NA	47.54	False	1.460510e+12
2873212765	4/21/2016 11:59:59 PM	56.7	125.0021	NA	21.45	True	1.461283e+12
2873212765	5/12/2016 11:59:59 PM	57.3	126.3249	NA	21.69	True	1.463098e+12
4319703577	4/17/2016 11:59:59 PM	72.4	159.6147	25	27.45	True	1.460938e+12

Notes about “weight log” data based on outputs:

67 rows, 8 columns
no group variables
body fat is missing 65 data points; 2 unique “is manual?” (true or false), no whitespace
activity date is character type, needs to be changed to date format
NA = missing value

Changing Data Types

daily_activity_2 <- daily_activity %>% 
  mutate(date = mdy(ActivityDate)) %>%
  select(-ActivityDate)

str(daily_activity_2$date) # confirm date format

##  Date[1:940], format: "2016-04-12" "2016-04-13" "2016-04-14" "2016-04-15" "2016-04-16" ...

weight_log_2 <- weight_log %>% 
  mutate(date = as_date(Date, format = "%m/%d/%Y %I:%M:%S %p")) %>% 
  select(-Date)

str(weight_log_2$date) # confirm date format

##  Date[1:67], format: "2016-05-02" "2016-05-03" "2016-04-13" "2016-04-21" "2016-05-12" ...

daily_activity_2$Id <- as.character(daily_activity_2$Id)

str(daily_activity_2$Id) # confirm character format

##  chr [1:940] "1503960366" "1503960366" "1503960366" "1503960366" ...

weight_log_2$Id <- as.character(weight_log_2$Id)

str(weight_log_2$Id)# confirm character format

##  chr [1:67] "1503960366" "1503960366" "1927972279" "2873212765" ...

Checking for Duplicates

n_distinct(daily_activity_2$Id)

## [1] 33

Notes based on output: 33 unique participants, as expected

n_distinct(weight_log_2$Id)

## [1] 8

Notes based on output: 8 unique participants. 8 users x 31 possible dates = 248 observations. However, there are only 67 rows in this table. So there are only 8 unique participants and participants don’t have weight log data for every date.

sum(duplicated(daily_activity_2))

## [1] 0

sum(duplicated(weight_log_2))

## [1] 0

Notes based on outputs: There are no duplicate rows in either dataset and no duplicates need to be removed.

Cleaning Field Names

daily_activity_clean <- clean_names(daily_activity_2)
glimpse(daily_activity_clean)

## Rows: 940
## Columns: 15
## $ id                         <chr> "1503960366", "1503960366", "1503960366", "…
## $ total_steps                <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…
## $ date                       <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…

weight_log_clean <- clean_names(weight_log_2)
glimpse(weight_log_clean)

## Rows: 67
## Columns: 8
## $ id               <chr> "1503960366", "1503960366", "1927972279", "2873212765…
## $ weight_kg        <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3…
## $ weight_pounds    <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159…
## $ fat              <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, N…
## $ bmi              <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.2…
## $ is_manual_report <chr> "True", "True", "False", "True", "True", "True", "Tru…
## $ log_id           <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+1…
## $ date             <date> 2016-05-02, 2016-05-03, 2016-04-13, 2016-04-21, 2016…

Filtering and Sorting Data

daily_activity_clean %>% 
  select(total_steps, 
         total_distance, 
         tracker_distance, 
         logged_activities_distance, 
         very_active_distance, 
         moderately_active_distance, 
         light_active_distance, 
         sedentary_active_distance, 
         calories) %>% 
  summary()

##   total_steps    total_distance   tracker_distance logged_activities_distance
##  Min.   :    0   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000            
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 2.620   1st Qu.:0.0000            
##  Median : 7406   Median : 5.245   Median : 5.245   Median :0.0000            
##  Mean   : 7638   Mean   : 5.490   Mean   : 5.475   Mean   :0.1082            
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 7.710   3rd Qu.:0.0000            
##  Max.   :36019   Max.   :28.030   Max.   :28.030   Max.   :4.9421            
##  very_active_distance moderately_active_distance light_active_distance
##  Min.   : 0.000       Min.   :0.0000             Min.   : 0.000       
##  1st Qu.: 0.000       1st Qu.:0.0000             1st Qu.: 1.945       
##  Median : 0.210       Median :0.2400             Median : 3.365       
##  Mean   : 1.503       Mean   :0.5675             Mean   : 3.341       
##  3rd Qu.: 2.053       3rd Qu.:0.8000             3rd Qu.: 4.782       
##  Max.   :21.920       Max.   :6.4800             Max.   :10.710       
##  sedentary_active_distance    calories   
##  Min.   :0.000000          Min.   :   0  
##  1st Qu.:0.000000          1st Qu.:1828  
##  Median :0.000000          Median :2134  
##  Mean   :0.001606          Mean   :2304  
##  3rd Qu.:0.000000          3rd Qu.:2793  
##  Max.   :0.110000          Max.   :4900

Notes based on output:

A max of 36019 total steps seems very high but possible. This is ~28 km in a day. This does match up with max total/tracker distance of ~28 km when steps are converted to km.
For sedentary minutes, the max is 1440, or 24 hr. Will check to see if this is an outlier.
Note that logged activity distance is very low (mean = 0.1082 km) due to low use by users (most values = 0)
All other fields check out (reasonable min, max, agreement between each other)

weight_log_clean %>% 
  select(weight_pounds, 
         fat, 
         bmi, 
         is_manual_report) %>% 
  summary()

##  weight_pounds        fat             bmi        is_manual_report  
##  Min.   :116.0   Min.   :22.00   Min.   :21.45   Length:67         
##  1st Qu.:135.4   1st Qu.:22.75   1st Qu.:23.96   Class :character  
##  Median :137.8   Median :23.50   Median :24.39   Mode  :character  
##  Mean   :158.8   Mean   :23.50   Mean   :25.19                     
##  3rd Qu.:187.5   3rd Qu.:24.25   3rd Qu.:25.56                     
##  Max.   :294.3   Max.   :25.00   Max.   :47.54                     
##                  NA's   :65

Notes based on output:

all fields check out (reasonable min, max, agreement between each other)
NA = missing data

weight_log_clean$fat

##  [1] 22 NA NA NA NA 25 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Notes based on output: only two non-NA value for fat percentage. Will not keep this variable or use it in analyses.

# determine if high "total_steps" values are outliers or possible errors

daily_act_clean_steps_test1 <- daily_activity_clean %>% 
  filter(total_steps > 10000) %>%  # 10727 steps is 75th percentile 
  select(id, 
         total_steps, 
         total_distance, 
         tracker_distance) %>%
  slice_max(order_by = total_steps, n = 10) %>% 
  arrange(desc(total_steps))

daily_act_clean_steps_test1

id	total_steps	total_distance	tracker_distance
1624580081	36019	28.03	28.03
8877689391	29326	25.29	25.29
8877689391	27745	26.72	26.72
8877689391	23629	20.65	20.65
8877689391	23186	20.40	20.40
8053475328	22988	17.95	17.95
4388161847	22770	17.54	17.54
8053475328	22359	17.19	17.19
2347167796	22244	15.08	15.08
8053475328	22026	17.65	17.65

daily_act_clean_steps_test2 <- daily_activity_clean %>% 
  filter(id == 1624580081) %>% 
  select(total_steps, 
         total_distance,
         tracker_distance) %>% 
  slice_max(order_by = total_steps, n = 10) %>% 
  arrange(desc(total_steps))

daily_act_clean_steps_test2

total_steps	total_distance	tracker_distance
36019	28.03	28.03
10536	7.41	7.41
9107	5.92	5.92
8538	5.55	5.55
8367	5.44	5.44
8163	5.31	5.31
7155	4.93	4.93
7007	4.55	4.55
6497	4.22	4.22
6474	4.30	4.30

Notes based on outputs:

While 36019 steps is high, other users also had high values (29326 steps next highest in different user).
The user with 36019 steps had other dates with high total steps, although not nearly as high as 36019 steps.
With no reason to think these data are invalid, this user’s data will remain in analysis.

# determine if "sedentary" values are outliers or possible errors

daily_act_clean_sed_test <- daily_activity_clean %>% 
  filter(sedentary_minutes > 1229) %>%  # 1229 min is 75th percentile 
  select(id,
         total_steps,
         total_distance,
         tracker_distance,
         sedentary_active_distance,
         sedentary_minutes) %>% 
  slice_max(order_by = total_steps, n = 10) %>% 
  arrange(desc(sedentary_minutes))

daily_act_clean_sed_test

id	total_steps	total_distance	tracker_distance	sedentary_minutes
8583815059	12015	9.37	9.37	1440
4388161847	10122	7.78	7.78	1440
8583815059	12427	9.69	9.69	1370
8253242879	10232	8.18	8.18	1286
4388161847	10993	8.45	8.45	1275
8053475328	10520	8.29	8.29	1260
8053475328	14549	11.11	11.11	1255
8053475328	13953	11.00	11.00	1245
8253242879	10204	7.91	7.91	1237
2022484408	10100	7.09	7.09	1237

Notes based on output: There are several users whose sedentary minutes = or close to 1440 (24 hr). However, in some users, these observations also have a high number of steps etc. Due to uncertainty about this seeming error/inaccuracy, this field will not be included in further analyses. Same with sedentary_active_distance_km.

Renaming field names for clarity

daily_activity_clean <- daily_activity_clean %>% 
  dplyr::rename_at(vars(-id,
                        -total_steps,
                        -very_active_minutes,
                        -fairly_active_minutes,
                        -lightly_active_minutes,
                        -sedentary_minutes, 
                        -calories, 
                        -date),
                   paste0,
                   "_km") #something about R version or conflict # with other package did not let me run rename 
# without dplyr:: ("error in rename: unused argument")

daily_activity_clean <- daily_activity_clean %>% 
  dplyr::rename(calories_burned = calories)


colnames(daily_activity_clean) # confirm changes in field names

##  [1] "id"                            "total_steps"                  
##  [3] "total_distance_km"             "tracker_distance_km"          
##  [5] "logged_activities_distance_km" "very_active_distance_km"      
##  [7] "moderately_active_distance_km" "light_active_distance_km"     
##  [9] "sedentary_active_distance_km"  "very_active_minutes"          
## [11] "fairly_active_minutes"         "lightly_active_minutes"       
## [13] "sedentary_minutes"             "calories_burned"              
## [15] "date"

weight_log_clean <- weight_log_clean %>% 
  dplyr::rename(weight_lb = weight_pounds,
                BMI = bmi)

colnames(weight_log_clean) # confirm changes in field names

## [1] "id"               "weight_kg"        "weight_lb"        "fat"             
## [5] "BMI"              "is_manual_report" "log_id"           "date"

Merging Data

Since there are a lot more observations in the daily activity table than in the weight log table, do left join. Make sure merge columns have identical names.

merged_result <- left_join(daily_activity_clean, 
                          weight_log_clean, 
                          by = c("id", "date"))

glimpse(merged_result)

## Rows: 940
## Columns: 21
## $ id                            <chr> "1503960366", "1503960366", "1503960366"…
## $ total_steps                   <int> 13162, 10735, 10460, 9762, 12669, 9705, …
## $ total_distance_km             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ tracker_distance_km           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ logged_activities_distance_km <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance_km       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25…
## $ moderately_active_distance_km <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64…
## $ light_active_distance_km      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71…
## $ sedentary_active_distance_km  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes           <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, …
## $ fairly_active_minutes         <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 2…
## $ lightly_active_minutes        <int> 328, 217, 181, 209, 221, 164, 233, 264, …
## $ sedentary_minutes             <int> 728, 776, 1218, 726, 773, 539, 1149, 775…
## $ calories_burned               <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921…
## $ date                          <date> 2016-04-12, 2016-04-13, 2016-04-14, 201…
## $ weight_kg                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ weight_lb                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ fat                           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BMI                           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ is_manual_report              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ log_id                        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Removing data fields that will not be included in analyses

merged_result <- merged_result %>% 
  select(-sedentary_active_distance_km, 
         -very_active_minutes,
         -fairly_active_minutes,
         -lightly_active_minutes,
         -sedentary_minutes, 
         -weight_kg,
         -fat,
         -log_id)

glimpse(merged_result)

## Rows: 940
## Columns: 13
## $ id                            <chr> "1503960366", "1503960366", "1503960366"…
## $ total_steps                   <int> 13162, 10735, 10460, 9762, 12669, 9705, …
## $ total_distance_km             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ tracker_distance_km           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ logged_activities_distance_km <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance_km       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25…
## $ moderately_active_distance_km <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64…
## $ light_active_distance_km      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71…
## $ calories_burned               <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921…
## $ date                          <date> 2016-04-12, 2016-04-13, 2016-04-14, 201…
## $ weight_lb                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BMI                           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ is_manual_report              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Transforming factor level names for clarity

merged_result <- merged_result %>% 
  mutate(
    is_manual_report = fct_recode(as.factor(is_manual_report),
                                  Manual = "True",
                                  Device = "False")
  )

head(merged_result$is_manual_report)

## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: Device Manual

Removing zero values: We assume that days with zero total distance traveled are equivalent to days when the device is not worn.

merged_result <- merged_result %>% 
  filter(total_distance_km > 0)

Data Analyses and Visualization

Descriptive Statistics

describe(merged_result[2:9])

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
total_steps	1	862	8329.0394432	4739.2469470	8053.50	8051.5898551	4608.662100	8.00	36019.000000	36011.000000	0.8164355	1.7900671	161.4193916
total_distance_km	2	862	5.9864501	3.7176164	5.59	5.6758116	3.358089	0.01	28.030001	28.020001	1.3289480	3.9830234	0.1266225
tracker_distance_km	3	862	5.9708005	3.6997561	5.59	5.6654493	3.343263	0.01	28.030001	28.020001	1.3422628	4.1052666	0.1260142
logged_activities_distance_km	4	862	0.1179590	0.6464734	0.00	0.0000000	0.000000	0.00	4.942142	4.942142	5.9904088	37.1787329	0.0220190
very_active_distance_km	5	862	1.6386543	2.7363079	0.41	1.0151884	0.607866	0.00	21.920000	21.920000	2.8543068	10.8089856	0.0931990
moderately_active_distance_km	6	862	0.6188979	0.9053288	0.31	0.4291304	0.459606	0.00	6.480000	6.480000	2.6525254	9.2645756	0.0308356
light_active_distance_km	7	862	3.6431206	1.8544341	3.58	3.6111304	1.890315	0.00	10.710000	10.710000	0.3002367	0.1816482	0.0631623
calories_burned	8	862	2362.4709977	702.2695833	2220.50	2316.0362319	714.613200	52.00	4900.000000	4848.000000	0.5474199	0.2092724	23.9193969

describe(merged_result[11:12])

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
weight_lb	1	67	158.81180	30.695415	137.7889	157.06533	21.899432	115.9631	294.3171	178.354	1.308951	3.299141	3.7500419
BMI	2	67	25.18522	3.066963	24.3900	24.83964	1.363992	21.4500	47.5400	26.090	5.734248	39.243981	0.3746891

User Trends in Device vs Manual Use

# device use

percent_count_daily_activity_device <- daily_activity_clean %>% 
  filter(total_distance_km > 0.00) %>% 
  group_by(date) %>% 
  summarize(n = n()) %>% 
  mutate(count = n, percent = (n/33)*100) 

head(percent_count_daily_activity_device)

date	n	count	percent
2016-04-12	31	31	93.93939
2016-04-13	31	31	93.93939
2016-04-14	31	31	93.93939
2016-04-15	33	33	100.00000
2016-04-16	31	31	93.93939
2016-04-17	29	29	87.87879

#manual log use

percent_count_logged_activity <- daily_activity_clean %>% 
  filter(logged_activities_distance_km > 0.00) %>% 
  group_by(date) %>% 
  summarize(n = n()) %>% 
  mutate(count = n, percent = (n/33)*100)

head(percent_count_logged_activity)

date	n	count	percent
2016-04-12	2	2	6.060606
2016-04-13	2	2	6.060606
2016-04-14	2	2	6.060606
2016-04-18	2	2	6.060606
2016-04-19	2	2	6.060606
2016-04-20	2	2	6.060606

# device use plot

ggplot(percent_count_daily_activity_device, 
       aes(x = date, y = percent)) + 
  geom_col(fill = "blue") + 
  labs(x = "Date", 
       y = "Percent of Users", 
       title = "Daily Device Use",
       caption = "FitBit Fitness Tracker Data") + 
  theme_classic()

Daily Device Use. Most of the 33 participants used the device daily, especially at the beginning of the study. The data used for this plot was filtered for total distance traveled > 0 km, assuming that 0 km indicates the device was not used on that particular date. Interestingly, device usage dropped to almost 50% during the last week of the study. We would need to know more about the study design and users to interpret the meaning of this drop. For example, were devices brand new when users agreed to contribute their data? This would contribute to a bias indicative of frequent use of a new device after purchase that ebbs off as the novelty wears off or after the battery dies for the first time.

# manual log use plot

ggplot(percent_count_logged_activity, aes(x = date, y = percent)) + 
  geom_col(fill = "purple") + 
  labs(x = "Date", 
       y = "Percent of users", 
       title = "Daily Activity Log Use",
       caption = "FitBit Fitness Tracker Data") + 
  theme_classic() + 
  ylim(0, 25)

Daily Activity Log Use. In contrast to device use, fewer than 11% of users manually logged activities and no user logged activities daily.

# Convert data to long format

pivot_long_distance <- merged_result %>%
pivot_longer(cols = ("total_distance_km":"light_active_distance_km"),
names_to = "distance", values_to = "km")

head(pivot_long_distance$distance)

## [1] "total_distance_km"             "tracker_distance_km"          
## [3] "logged_activities_distance_km" "very_active_distance_km"      
## [5] "moderately_active_distance_km" "light_active_distance_km"

head(pivot_long_distance$km)

## [1] 8.50 8.50 0.00 1.88 0.55 6.06

pivot_long_weight <- merged_result %>%
  drop_na() %>% 
  pivot_longer(cols = ("weight_lb":"BMI"),
names_to = "variable", values_to = "weight")

head(pivot_long_weight$variable)

## [1] "weight_lb" "BMI"       "weight_lb" "BMI"       "weight_lb" "BMI"

head(pivot_long_weight$weight)

## [1] 115.9631  22.6500 115.9631  22.6500 294.3171  47.5400

pivot_long_distance_p <- pivot_long_distance %>% 
  group_by(distance) %>% 
  summarize(mean_distance = mean(km), 
            sd_distance = sd(km)) %>% 
ggplot(., aes(x = distance, 
              y = mean_distance, 
              fill = distance)) + 
  geom_col() + 
  geom_errorbar(aes(ymax = mean_distance + sd_distance,
                    ymin = mean_distance)) +
  theme_classic() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) + 
  labs(x = "Activity Type", 
       y = "Average + SD km",
       title = "Average Distance by Activity Type",
       caption = "FitBit Fitness Tracker Data") 

pivot_long_distance_p

Average Distance by Activity Type. On average, lighter activities accounted for the majority of total distance traveled by users, followed by very active and moderate activities. Manually logged distance was low compared to device-recorded distances (all others shown). As expected and consistent with appropriate use of the device, average total and tracker distances were nearly identical. These activities were averaged across 33 participants over up to 31 days of device usage.

Weight Log Use Trends

manual_vs_device_weight_lb <- merged_result %>%
  drop_na(weight_lb) %>% 
  ggplot(aes(x = is_manual_report, 
             y = weight_lb)) +
  geom_boxplot(aes(fill = is_manual_report), 
               na.rm = TRUE, 
               show.legend = FALSE) +
  theme_classic() + 
  labs(title = "Weight as Reported by Device vs. Manual Input", 
       y = "median weight (lb) +/- IQs", 
       caption = "FitBit Fitness Tracker Data") + 
  theme(axis.title.x = element_blank())  

manual_vs_device_weight_lb

Weight as Reported by Device vs. Manual Input. Only 8 participants contributed to the weight log and for varying numbers of days. Each point reflects data points considered to be outliers (> or < 3 interquartile (IQ) range) and the horizontal bars represent the median weight of the sample depending on whether it was manually logged by the participant or recorded by the device. These results suggest that individuals who weigh more tend to have smart scales compared to individuals who weigh less, with the caveat that this is a very small sample.

manual_vs_device_bmi <- merged_result %>%
  drop_na(BMI) %>% 
  ggplot(aes(x = is_manual_report, y = BMI)) +
  geom_boxplot(aes(fill = is_manual_report), 
               na.rm = TRUE, 
               show.legend = FALSE) +
  theme_classic() + 
  labs(title = "BMI as Reported by Device vs. Manual Input", 
       y = "median BMI (kg/m^2) +/- IQs", 
       caption = "FitBit Fitness Tracker Data") + 
  theme(axis.title.x = element_blank())  

manual_vs_device_bmi

Body Mass Index (BMI) as Reported by Device vs. Manual Input. Each point reflects outlier data points (> or < 3 interquartile (IQ) range) and the horizontal bars represent the median BMI of the sample depending on whether it was manually logged by the participant or recorded by the device. These results suggest that individuals who have higher BMIs tend to have smart scales compared to individuals who have lower BMIs, with the caveat that this is a very small sample.

Exploring Relationships between Variables

# are the number of total steps related to total distance or calories burned?


total_steps_and_total_distance_p <- merged_result %>% 
  filter(total_distance_km > 0.00,
         total_steps > 0.00) %>% 
  ggplot(., aes(x = total_steps, 
                          y = total_distance_km)) + 
  geom_point() + 
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Total Steps",
       y = "Total Distance (km)",
       title = "Total Steps and Distance Traveled",
       caption = "FitBit Fitness Tracker Data")

  
total_steps_and_total_distance_p

Total Steps and Distance Traveled. The higher the number of total steps, the greater the distance traveled, as recorded by the device. This is a strong positive correlation and a reassuring sign that the device is reliable, since these two variables should be strongly related. The data points reflect each user’s data recorded each day.

total_dist_and_calories_p <- merged_result %>% 
  filter(total_distance_km > 0.00,
         calories_burned > 0.00) %>% 
  ggplot(., aes(x = total_distance_km,
                          y = calories_burned)) +
  geom_point() +
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Total Distance (km)",
       y = "Calories Burned",
       title = "Total Distance and Calories Burned",
       caption = "FitBit Fitness Tracker Data")


total_dist_and_calories_p

Total Distance and Calories Burned. The greater the total distance, the more calories burned, as recorded by the device. This is a strong positive correlation and a reassuring sign that the device is working as it should, since these two variables should be related. The correlation is not as strong as the previous one shown (total steps and distance traveled) because the device’s method of deriving number of calories burned is probably more complicated and may be dependent on user characteristics such as gender and weight. Alternatively, the device may not derive calories burned as accurately as it does distance. The data points reflect each user’s data recorded on each day of use.

# is calories burned related to very active, moderately or light active distance?


cal_and_very_active_dist_p <- merged_result %>% 
  filter(very_active_distance_km > 0.00,
         calories_burned > 0.00) %>%  
  ggplot(., aes(x = very_active_distance_km, 
                          y = calories_burned )) + 
  geom_point()  + 
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Very Active Distance (km)",
       y = "Calories Burned",
       title = "Very Active Distance and Calories Burned",
       caption = "FitBit Fitness Tracker Data")

  
cal_and_very_active_dist_p

Calories Burned and Very Active Distance. Overall, these data indicate that the number of calories burned increase with higher levels of “very active” activity. Even at lower levels however, “very active” activity was associated with a high number of calories burned (~2500). Interestingly, the change in number of calories burned does not seem to be appreciable until “very active” activity accounts for at least 3 km of distance. The data points reflect each user’s data recorded on each day of use.

cal_and_mod_active_dist_p <- merged_result %>% 
  filter(moderately_active_distance_km > 0.00,
         calories_burned > 0.00) %>% 
  ggplot(., aes(x = moderately_active_distance_km, 
                          y = calories_burned)) + 
  geom_point() + 
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Moderately Active Distance (km)",
       y = "Calories Burned",
       title = "Moderately Active Distance and Calories Burned",
       caption = "FitBit Fitness Tracker Data")


  
cal_and_mod_active_dist_p

Calories Burned and Moderately Active Distance. These data indicate that changes in “moderate” activity levels do not relate to changes in the number of calories burned. “Moderate activity” was associated with ~2500 calories burned regardless of how much distance it accounted for. The data points reflect each user’s data recorded on each day of use.

cal_and_light_active_dist_p <- merged_result %>% 
  filter(light_active_distance_km > 0.00,
         calories_burned > 0.00) %>%  
  ggplot(., aes(x = light_active_distance_km, 
                          y =  calories_burned)) + 
  geom_point() + 
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Light Active Distance (km)",
       y = "Calories Burned",
       title = "Light Active Distance and Calories Burned",
       caption = "FitBit Fitness Tracker Data")


  
cal_and_light_active_dist_p

Calories Burned and Light Active Distance. These data indicate that “light activity” relates to the number of calories burned. At the lower end of “light activity”, less than 2000 calories are burned. With increasing “light activity”, the number of calories burned modestly but steadily increases. The data points reflect each user’s data recorded on each day of use.

Final Report

Trends and Relationships in Fitness Device Users

Over the course of a month, most users use the device most days, especially during the first 3 weeks.
Only up to 10% of users use the option to manually log activities and no user uses this option daily.
On average, users’ total distance, over the course of a month, is ~6 km. “Light active” distance (average = 3.6 km) accounts for most of total distance, followed by “very active” (average = 1.6 km) and “moderate distance” (average = 0.6 km).
Only 8 (24%) of users use the weight log, in which weight and BMI were recorded, and no user uses it daily. There are only two data points for body fat percentage.
Assuming that device-reported weight and BMI in the weight log is enabled via connection with a smart scale, it seems that users who weigh more or have higher BMIs are more likely to have smart scales. Users who manually log their weight and BMI weigh less and have lower BMIs. However, these data should be interpreted with caution due to the very small sample size.
A higher number of total steps, over the course of a month, is strongly related to greater total distance (km) across 33 users, as would be expected and in support of the reliability of the device.
As expected, greater total distance is related to more calories burned across 33 users over the course of a month. This result again supports the reliability of the device. The relationship is not as strong as the one between total distance and total steps, indicating that however the device derives “calories burned” is more complicated than for total steps.
Changes in “light active” and “very active” distances relates to changes in number of calories burned over the course of a month. “Moderate activity” tended to account for less total distance than “light” and “very active” activity and did not relate to changes in number of calories burned.

Key Findings

The majority of smart device users tend to forego utilizing manual options including the “logged activities” and weight log (weight, BMI and body fat percentage) options.
Users who weigh more or have higher BMI may be more likely to have smart scales that connect to the device, making manual logging of weight and BMI unnecessary. There were only two data points for body fat percentage, both recorded by the device, suggesting many smart scales do not yet measure it.
Like “very active” activity, increases in “light activity” are associated with increases in number of calories burned. Although causality cannot be inferred, it may be encouraging to users that activity does not have to be strenuous for it to be associated with burning calories, although longer distances may be required to reach the number of calories burned at shorter distances of “very active” activity.

Recommendations for Bellabeat Marketing Strategy Based on Key Findings

According to the above analysis, users are unlikely to use manual input features. Focus and invest more in automated recording features in smart devices. Bellabeat is already doing a good job of focusing on three smart devices including Leaf, Time and Spring, which do not require manual input. However, the app has features that require manual input including menstrual cycle and mindfulness. Focus less time and expense on these manual input features.
Smart scales enable automated recording of weight and BMI, eliminating the need for manual logging. It may be helpful for marketing to emphasize the ability of the smart device to connect with smart scales and thereby increase use of the device-associated weight log. This duo-combination would allow users to easily track not only activity and calories burned but also changes in weight and BMI over time. Users may more easily develop personalized strategies to reach their goals with this extra data.
If the findings above are supported by further, larger studies of smart device usage, marketing may focus on the concept that increasing engagement in even light activities is associated with increases in calories burned. This concept may be encouraging to users who do not have the ability, time or equipment to perform more strenuous activities. The device makes it easy to monitor strenuousness, distance and number of calories burned and change activity levels or duration as necessary to meet their goals.
On the flip side, engaging in even low levels of “very active” activity is associated with burning a high number of calories. Marketing should also focus on emphasizing the time-saving aspect of “very active” activity. Again, the device makes it easy to monitor magnitude and duration of activity and adjust as needed to meet their goals in terms of burning calories.
Next steps. As mentioned by the CCO, further research on fitness-related smart devices should be done using a more timely, larger, and comprehensive dataset. The current dataset has limitations due to small sample size, lack of demographic information, and a limited window of one month of data collection in 2016. The co-founders may want to wait for the marketing analytics team to assess whether the current findings hold up in this separate dataset before making costly decisions. It is important that these decisions are based on findings that extend to Bellabeat’s female customers and to extended use of smart devices.