Case Study for Google Data Analytics Professional Certificate

Case Study Topic: How Can a Wellness Technology Company, Bellabeat, Play it Smart?

Guiding Questions

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

How Can Your Insights Drive Business Decisions?

  1. Trends in smart device usage may reflect habits of Bellabeat customers.
  2. More popular trends may indicate what Bellabeat products will be popular with their customers.
  3. Bellabeat may want to focus on and invest in products that will be popular.

Data Preparation

  • The downloaded dataset, FitBit Fitness Tracker Data, from Kaggle with provenance at zenodo.org, is stored on my hard drive and accessible by R Studio.

  • The dataset is organized in both wide and long formats. Each Id, representing one user, has data for each day recorded, resulting in many rows and fewer columns. The activities and other fitness-related measures are organized in wide format.

Bias and credibility

Reliability: Data appears to be accurate. Data incomplete due to lack of any demographics data. Data has sample selection bias due to collection method (only 33 users who are online-savvy, responding to distributed survey via Amazon Technical Turk during a limited, dated period of time (March 12, 2016 - May 12, 2016; findings may not extend to a female-only customer base). Zenodo seems to be reputable (see source).

Original: Data is original, although datasets come from second party (see provenance link above).

Comprehensive: The data is not comprehensive. There is no demographic information and only samples 31 days.

Current: The data is not current; it is from 2016.

Cited: The data is cited (see provenance link above).

Licensing, privacy, security and accessibility

The license for the dataset is CC0: Public Domain and has Creative Commons Attribution 4.0 International license. There are no personal identifiers in the data and each participant consented to the submission of personal tracker data. The data is accessible and free to the public.

Data Integrity

According to Kaggle users who described the dataset, it is well-documented, well-maintained, clean and original. I explored and cleaned the data to further ensure integrity (see code chunks below).

Dataset Description

The data contains both automatically tracked data and manually logged data in 33 users. The daily data for each participant helps assess whether use is consistent over time within users and across users. This allows me to determine trends in FitBit usage. Limitations include small sample size, short duration and lack of demographic information. Variables are not clearly defined and assumptions of what these data reflect are based on field names.


Data Processing and Cleaning

I chose to use tidyverse packages, including dplyr and ggplot2, because I want to hone my expertise in R programming.

Regarding data cleaning, I made sure I have backup copies (original csv’s), checked number of rows (no duplicates, 33 unique ids at most) and columns, deleted unneeded fields, filtered for unique values and blanks, cleaned field names, changed field data types as necessary, manipulated strings, checked for whitespace, and fixed dates and times using R. The cleaning process is documented in code chunks and outputs below.

Installing and Loading Packages

# install.packages("tidyverse")
library(tidyverse)

# install.packages("dplyr")
library(dplyr)

# install.packages("ggplot2")
library(ggplot2)

# install.packages("tidyr")
library(tidyr)

# install.packages("skimr") # helpful for viewing data
library(skimr) 

# install.packages("janitor") # helpful for cleaning data
library(janitor) 

# install.packages("lubridate")
library(lubridate)

# install.packages("psych") # for generating summary tables
library(psych)

Importing datasets of interest

I am using the FitBit Fitness Tracker dataset, including the Daily Activity and Weight Log data. The Daily Activity dataset contains automatically-recorded device data including activity strenuousness (very active, moderate, light) and duration, number of steps, distance traveled, and number of calories burned in 33 participants over 31 days. Users could also manually log activities on the device and this data is included in the Daily Activity dataset. The Weight Log dataset contains data including weight, BMI and percent body fat in the same 33 users over the same 31 days. These values are automatically recorded by the device or app (assuming via smart scale) or manually logged by the user. These two datasets will be merged into one dataset for analyses and visualization.

daily_activity <- read.csv("DailyActivity_merged.csv") 

weight_log <- read.csv("weightLogInfo_merged.csv") 

Exploring dataset formats

skim_without_charts(daily_activity)
Data summary
Name daily_activity
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ActivityDate 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09
TotalSteps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04
TotalDistance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
TrackerDistance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
LoggedActivitiesDistance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00
VeryActiveDistance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01
ModeratelyActiveDistance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00
LightActiveDistance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01
SedentaryActiveDistance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01
VeryActiveMinutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02
FairlyActiveMinutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02
LightlyActiveMinutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02
SedentaryMinutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03
Calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03
head(daily_activity)
Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
1503960366 4/12/2016 13162 8.50 8.50 0 1.88 0.55 6.06 0 25 13 328 728 1985
1503960366 4/13/2016 10735 6.97 6.97 0 1.57 0.69 4.71 0 21 19 217 776 1797
1503960366 4/14/2016 10460 6.74 6.74 0 2.44 0.40 3.91 0 30 11 181 1218 1776
1503960366 4/15/2016 9762 6.28 6.28 0 2.14 1.26 2.83 0 29 34 209 726 1745
1503960366 4/16/2016 12669 8.16 8.16 0 2.71 0.41 5.04 0 36 10 221 773 1863
1503960366 4/17/2016 9705 6.48 6.48 0 3.19 0.78 2.51 0 38 20 164 539 1728

Notes about “daily activity” data based on outputs:

  • 940 rows, 15 columns
  • long format: user id and date = one observation (33 user ids x 31 days = 1023 observations.; so not all 33 users have data for all 31 dates)
  • wide format for other variables
  • no group variables
  • no cells with missing values, 31 unique dates, no white space
  • activity date is character type, needs to be changed to date format
  • id data type is numeric, needs to be changed to character
skim_without_charts(weight_log)
Data summary
Name weight_log
Number of rows 67
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date 0 1 19 21 0 56 0
IsManualReport 0 1 4 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1.00 7.009282e+09 1.950322e+09 1.503960e+09 6.962181e+09 6.962181e+09 8.877689e+09 8.877689e+09
WeightKg 0 1.00 7.204000e+01 1.392000e+01 5.260000e+01 6.140000e+01 6.250000e+01 8.505000e+01 1.335000e+02
WeightPounds 0 1.00 1.588100e+02 3.070000e+01 1.159600e+02 1.353600e+02 1.377900e+02 1.875000e+02 2.943200e+02
Fat 65 0.03 2.350000e+01 2.120000e+00 2.200000e+01 2.275000e+01 2.350000e+01 2.425000e+01 2.500000e+01
BMI 0 1.00 2.519000e+01 3.070000e+00 2.145000e+01 2.396000e+01 2.439000e+01 2.556000e+01 4.754000e+01
LogId 0 1.00 1.461772e+12 7.829948e+08 1.460444e+12 1.461079e+12 1.461802e+12 1.462375e+12 1.463098e+12
head(weight_log)
Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65 True 1.462234e+12
1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65 True 1.462320e+12
1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54 False 1.460510e+12
2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45 True 1.461283e+12
2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69 True 1.463098e+12
4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45 True 1.460938e+12

Notes about “weight log” data based on outputs:

  • 67 rows, 8 columns
  • no group variables
  • body fat is missing 65 data points; 2 unique “is manual?” (true or false), no whitespace
  • activity date is character type, needs to be changed to date format
  • NA = missing value

Changing Data Types

daily_activity_2 <- daily_activity %>% 
  mutate(date = mdy(ActivityDate)) %>%
  select(-ActivityDate)

str(daily_activity_2$date) # confirm date format
##  Date[1:940], format: "2016-04-12" "2016-04-13" "2016-04-14" "2016-04-15" "2016-04-16" ...
weight_log_2 <- weight_log %>% 
  mutate(date = as_date(Date, format = "%m/%d/%Y %I:%M:%S %p")) %>% 
  select(-Date)

str(weight_log_2$date) # confirm date format
##  Date[1:67], format: "2016-05-02" "2016-05-03" "2016-04-13" "2016-04-21" "2016-05-12" ...
daily_activity_2$Id <- as.character(daily_activity_2$Id)

str(daily_activity_2$Id) # confirm character format
##  chr [1:940] "1503960366" "1503960366" "1503960366" "1503960366" ...
weight_log_2$Id <- as.character(weight_log_2$Id)

str(weight_log_2$Id)# confirm character format
##  chr [1:67] "1503960366" "1503960366" "1927972279" "2873212765" ...

Checking for Duplicates

n_distinct(daily_activity_2$Id)
## [1] 33

Notes based on output: 33 unique participants, as expected

n_distinct(weight_log_2$Id)
## [1] 8

Notes based on output: 8 unique participants. 8 users x 31 possible dates = 248 observations. However, there are only 67 rows in this table. So there are only 8 unique participants and participants don’t have weight log data for every date.

sum(duplicated(daily_activity_2))
## [1] 0
sum(duplicated(weight_log_2))
## [1] 0

Notes based on outputs: There are no duplicate rows in either dataset and no duplicates need to be removed.

Cleaning Field Names

daily_activity_clean <- clean_names(daily_activity_2)
glimpse(daily_activity_clean)
## Rows: 940
## Columns: 15
## $ id                         <chr> "1503960366", "1503960366", "1503960366", "…
## $ total_steps                <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…
## $ date                       <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
weight_log_clean <- clean_names(weight_log_2)
glimpse(weight_log_clean)
## Rows: 67
## Columns: 8
## $ id               <chr> "1503960366", "1503960366", "1927972279", "2873212765…
## $ weight_kg        <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3…
## $ weight_pounds    <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159…
## $ fat              <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, N…
## $ bmi              <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.2…
## $ is_manual_report <chr> "True", "True", "False", "True", "True", "True", "Tru…
## $ log_id           <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+1…
## $ date             <date> 2016-05-02, 2016-05-03, 2016-04-13, 2016-04-21, 2016…

Filtering and Sorting Data

daily_activity_clean %>% 
  select(total_steps, 
         total_distance, 
         tracker_distance, 
         logged_activities_distance, 
         very_active_distance, 
         moderately_active_distance, 
         light_active_distance, 
         sedentary_active_distance, 
         calories) %>% 
  summary()
##   total_steps    total_distance   tracker_distance logged_activities_distance
##  Min.   :    0   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000            
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 2.620   1st Qu.:0.0000            
##  Median : 7406   Median : 5.245   Median : 5.245   Median :0.0000            
##  Mean   : 7638   Mean   : 5.490   Mean   : 5.475   Mean   :0.1082            
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 7.710   3rd Qu.:0.0000            
##  Max.   :36019   Max.   :28.030   Max.   :28.030   Max.   :4.9421            
##  very_active_distance moderately_active_distance light_active_distance
##  Min.   : 0.000       Min.   :0.0000             Min.   : 0.000       
##  1st Qu.: 0.000       1st Qu.:0.0000             1st Qu.: 1.945       
##  Median : 0.210       Median :0.2400             Median : 3.365       
##  Mean   : 1.503       Mean   :0.5675             Mean   : 3.341       
##  3rd Qu.: 2.053       3rd Qu.:0.8000             3rd Qu.: 4.782       
##  Max.   :21.920       Max.   :6.4800             Max.   :10.710       
##  sedentary_active_distance    calories   
##  Min.   :0.000000          Min.   :   0  
##  1st Qu.:0.000000          1st Qu.:1828  
##  Median :0.000000          Median :2134  
##  Mean   :0.001606          Mean   :2304  
##  3rd Qu.:0.000000          3rd Qu.:2793  
##  Max.   :0.110000          Max.   :4900

Notes based on output:

  • A max of 36019 total steps seems very high but possible. This is ~28 km in a day. This does match up with max total/tracker distance of ~28 km when steps are converted to km.
  • For sedentary minutes, the max is 1440, or 24 hr. Will check to see if this is an outlier.
  • Note that logged activity distance is very low (mean = 0.1082 km) due to low use by users (most values = 0)
  • All other fields check out (reasonable min, max, agreement between each other)
weight_log_clean %>% 
  select(weight_pounds, 
         fat, 
         bmi, 
         is_manual_report) %>% 
  summary()
##  weight_pounds        fat             bmi        is_manual_report  
##  Min.   :116.0   Min.   :22.00   Min.   :21.45   Length:67         
##  1st Qu.:135.4   1st Qu.:22.75   1st Qu.:23.96   Class :character  
##  Median :137.8   Median :23.50   Median :24.39   Mode  :character  
##  Mean   :158.8   Mean   :23.50   Mean   :25.19                     
##  3rd Qu.:187.5   3rd Qu.:24.25   3rd Qu.:25.56                     
##  Max.   :294.3   Max.   :25.00   Max.   :47.54                     
##                  NA's   :65

Notes based on output:

  • all fields check out (reasonable min, max, agreement between each other)
  • NA = missing data
weight_log_clean$fat
##  [1] 22 NA NA NA NA 25 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Notes based on output: only two non-NA value for fat percentage. Will not keep this variable or use it in analyses.

# determine if high "total_steps" values are outliers or possible errors

daily_act_clean_steps_test1 <- daily_activity_clean %>% 
  filter(total_steps > 10000) %>%  # 10727 steps is 75th percentile 
  select(id, 
         total_steps, 
         total_distance, 
         tracker_distance) %>%
  slice_max(order_by = total_steps, n = 10) %>% 
  arrange(desc(total_steps))

daily_act_clean_steps_test1
id total_steps total_distance tracker_distance
1624580081 36019 28.03 28.03
8877689391 29326 25.29 25.29
8877689391 27745 26.72 26.72
8877689391 23629 20.65 20.65
8877689391 23186 20.40 20.40
8053475328 22988 17.95 17.95
4388161847 22770 17.54 17.54
8053475328 22359 17.19 17.19
2347167796 22244 15.08 15.08
8053475328 22026 17.65 17.65
daily_act_clean_steps_test2 <- daily_activity_clean %>% 
  filter(id == 1624580081) %>% 
  select(total_steps, 
         total_distance,
         tracker_distance) %>% 
  slice_max(order_by = total_steps, n = 10) %>% 
  arrange(desc(total_steps))

daily_act_clean_steps_test2
total_steps total_distance tracker_distance
36019 28.03 28.03
10536 7.41 7.41
9107 5.92 5.92
8538 5.55 5.55
8367 5.44 5.44
8163 5.31 5.31
7155 4.93 4.93
7007 4.55 4.55
6497 4.22 4.22
6474 4.30 4.30

Notes based on outputs:

  • While 36019 steps is high, other users also had high values (29326 steps next highest in different user).
  • The user with 36019 steps had other dates with high total steps, although not nearly as high as 36019 steps.
  • With no reason to think these data are invalid, this user’s data will remain in analysis.
# determine if "sedentary" values are outliers or possible errors

daily_act_clean_sed_test <- daily_activity_clean %>% 
  filter(sedentary_minutes > 1229) %>%  # 1229 min is 75th percentile 
  select(id,
         total_steps,
         total_distance,
         tracker_distance,
         sedentary_active_distance,
         sedentary_minutes) %>% 
  slice_max(order_by = total_steps, n = 10) %>% 
  arrange(desc(sedentary_minutes))

daily_act_clean_sed_test
id total_steps total_distance tracker_distance sedentary_active_distance sedentary_minutes
8583815059 12015 9.37 9.37 0 1440
4388161847 10122 7.78 7.78 0 1440
8583815059 12427 9.69 9.69 0 1370
8253242879 10232 8.18 8.18 0 1286
4388161847 10993 8.45 8.45 0 1275
8053475328 10520 8.29 8.29 0 1260
8053475328 14549 11.11 11.11 0 1255
8053475328 13953 11.00 11.00 0 1245
8253242879 10204 7.91 7.91 0 1237
2022484408 10100 7.09 7.09 0 1237

Notes based on output: There are several users whose sedentary minutes = or close to 1440 (24 hr). However, in some users, these observations also have a high number of steps etc. Due to uncertainty about this seeming error/inaccuracy, this field will not be included in further analyses. Same with sedentary_active_distance_km.

Renaming field names for clarity

daily_activity_clean <- daily_activity_clean %>% 
  dplyr::rename_at(vars(-id,
                        -total_steps,
                        -very_active_minutes,
                        -fairly_active_minutes,
                        -lightly_active_minutes,
                        -sedentary_minutes, 
                        -calories, 
                        -date),
                   paste0,
                   "_km") #something about R version or conflict # with other package did not let me run rename 
# without dplyr:: ("error in rename: unused argument")

daily_activity_clean <- daily_activity_clean %>% 
  dplyr::rename(calories_burned = calories)


colnames(daily_activity_clean) # confirm changes in field names
##  [1] "id"                            "total_steps"                  
##  [3] "total_distance_km"             "tracker_distance_km"          
##  [5] "logged_activities_distance_km" "very_active_distance_km"      
##  [7] "moderately_active_distance_km" "light_active_distance_km"     
##  [9] "sedentary_active_distance_km"  "very_active_minutes"          
## [11] "fairly_active_minutes"         "lightly_active_minutes"       
## [13] "sedentary_minutes"             "calories_burned"              
## [15] "date"
weight_log_clean <- weight_log_clean %>% 
  dplyr::rename(weight_lb = weight_pounds,
                BMI = bmi)

colnames(weight_log_clean) # confirm changes in field names
## [1] "id"               "weight_kg"        "weight_lb"        "fat"             
## [5] "BMI"              "is_manual_report" "log_id"           "date"

Merging Data

Since there are a lot more observations in the daily activity table than in the weight log table, do left join. Make sure merge columns have identical names.

merged_result <- left_join(daily_activity_clean, 
                          weight_log_clean, 
                          by = c("id", "date"))

glimpse(merged_result)
## Rows: 940
## Columns: 21
## $ id                            <chr> "1503960366", "1503960366", "1503960366"…
## $ total_steps                   <int> 13162, 10735, 10460, 9762, 12669, 9705, …
## $ total_distance_km             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ tracker_distance_km           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ logged_activities_distance_km <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance_km       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25…
## $ moderately_active_distance_km <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64…
## $ light_active_distance_km      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71…
## $ sedentary_active_distance_km  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes           <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, …
## $ fairly_active_minutes         <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 2…
## $ lightly_active_minutes        <int> 328, 217, 181, 209, 221, 164, 233, 264, …
## $ sedentary_minutes             <int> 728, 776, 1218, 726, 773, 539, 1149, 775…
## $ calories_burned               <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921…
## $ date                          <date> 2016-04-12, 2016-04-13, 2016-04-14, 201…
## $ weight_kg                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ weight_lb                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ fat                           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BMI                           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ is_manual_report              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ log_id                        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Removing data fields that will not be included in analyses

merged_result <- merged_result %>% 
  select(-sedentary_active_distance_km, 
         -very_active_minutes,
         -fairly_active_minutes,
         -lightly_active_minutes,
         -sedentary_minutes, 
         -weight_kg,
         -fat,
         -log_id)

glimpse(merged_result)
## Rows: 940
## Columns: 13
## $ id                            <chr> "1503960366", "1503960366", "1503960366"…
## $ total_steps                   <int> 13162, 10735, 10460, 9762, 12669, 9705, …
## $ total_distance_km             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ tracker_distance_km           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59…
## $ logged_activities_distance_km <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance_km       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25…
## $ moderately_active_distance_km <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64…
## $ light_active_distance_km      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71…
## $ calories_burned               <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921…
## $ date                          <date> 2016-04-12, 2016-04-13, 2016-04-14, 201…
## $ weight_lb                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BMI                           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ is_manual_report              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Transforming factor level names for clarity

merged_result <- merged_result %>% 
  mutate(
    is_manual_report = fct_recode(as.factor(is_manual_report),
                                  Manual = "True",
                                  Device = "False")
  )

head(merged_result$is_manual_report)
## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: Device Manual

Removing zero values: We assume that days with zero total distance traveled are equivalent to days when the device is not worn.

merged_result <- merged_result %>% 
  filter(total_distance_km > 0)

Data Analyses and Visualization

Descriptive Statistics

describe(merged_result[2:9])
vars n mean sd median trimmed mad min max range skew kurtosis se
total_steps 1 862 8329.0394432 4739.2469470 8053.50 8051.5898551 4608.662100 8.00 36019.000000 36011.000000 0.8164355 1.7900671 161.4193916
total_distance_km 2 862 5.9864501 3.7176164 5.59 5.6758116 3.358089 0.01 28.030001 28.020001 1.3289480 3.9830234 0.1266225
tracker_distance_km 3 862 5.9708005 3.6997561 5.59 5.6654493 3.343263 0.01 28.030001 28.020001 1.3422628 4.1052666 0.1260142
logged_activities_distance_km 4 862 0.1179590 0.6464734 0.00 0.0000000 0.000000 0.00 4.942142 4.942142 5.9904088 37.1787329 0.0220190
very_active_distance_km 5 862 1.6386543 2.7363079 0.41 1.0151884 0.607866 0.00 21.920000 21.920000 2.8543068 10.8089856 0.0931990
moderately_active_distance_km 6 862 0.6188979 0.9053288 0.31 0.4291304 0.459606 0.00 6.480000 6.480000 2.6525254 9.2645756 0.0308356
light_active_distance_km 7 862 3.6431206 1.8544341 3.58 3.6111304 1.890315 0.00 10.710000 10.710000 0.3002367 0.1816482 0.0631623
calories_burned 8 862 2362.4709977 702.2695833 2220.50 2316.0362319 714.613200 52.00 4900.000000 4848.000000 0.5474199 0.2092724 23.9193969
describe(merged_result[11:12])
vars n mean sd median trimmed mad min max range skew kurtosis se
weight_lb 1 67 158.81180 30.695415 137.7889 157.06533 21.899432 115.9631 294.3171 178.354 1.308951 3.299141 3.7500419
BMI 2 67 25.18522 3.066963 24.3900 24.83964 1.363992 21.4500 47.5400 26.090 5.734248 39.243981 0.3746891

Exploring Relationships between Variables

# are the number of total steps related to total distance or calories burned?


total_steps_and_total_distance_p <- merged_result %>% 
  filter(total_distance_km > 0.00,
         total_steps > 0.00) %>% 
  ggplot(., aes(x = total_steps, 
                          y = total_distance_km)) + 
  geom_point() + 
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Total Steps",
       y = "Total Distance (km)",
       title = "Total Steps and Distance Traveled",
       caption = "FitBit Fitness Tracker Data")

  
total_steps_and_total_distance_p

Total Steps and Distance Traveled. The higher the number of total steps, the greater the distance traveled, as recorded by the device. This is a strong positive correlation and a reassuring sign that the device is reliable, since these two variables should be strongly related. The data points reflect each user’s data recorded each day.

total_dist_and_calories_p <- merged_result %>% 
  filter(total_distance_km > 0.00,
         calories_burned > 0.00) %>% 
  ggplot(., aes(x = total_distance_km,
                          y = calories_burned)) +
  geom_point() +
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Total Distance (km)",
       y = "Calories Burned",
       title = "Total Distance and Calories Burned",
       caption = "FitBit Fitness Tracker Data")


total_dist_and_calories_p

Total Distance and Calories Burned. The greater the total distance, the more calories burned, as recorded by the device. This is a strong positive correlation and a reassuring sign that the device is working as it should, since these two variables should be related. The correlation is not as strong as the previous one shown (total steps and distance traveled) because the device’s method of deriving number of calories burned is probably more complicated and may be dependent on user characteristics such as gender and weight. Alternatively, the device may not derive calories burned as accurately as it does distance. The data points reflect each user’s data recorded on each day of use.

# is calories burned related to very active, moderately or light active distance?


cal_and_very_active_dist_p <- merged_result %>% 
  filter(very_active_distance_km > 0.00,
         calories_burned > 0.00) %>%  
  ggplot(., aes(x = very_active_distance_km, 
                          y = calories_burned )) + 
  geom_point()  + 
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Very Active Distance (km)",
       y = "Calories Burned",
       title = "Very Active Distance and Calories Burned",
       caption = "FitBit Fitness Tracker Data")

  
cal_and_very_active_dist_p

Calories Burned and Very Active Distance. Overall, these data indicate that the number of calories burned increase with higher levels of “very active” activity. Even at lower levels however, “very active” activity was associated with a high number of calories burned (~2500). Interestingly, the change in number of calories burned does not seem to be appreciable until “very active” activity accounts for at least 3 km of distance. The data points reflect each user’s data recorded on each day of use.

cal_and_mod_active_dist_p <- merged_result %>% 
  filter(moderately_active_distance_km > 0.00,
         calories_burned > 0.00) %>% 
  ggplot(., aes(x = moderately_active_distance_km, 
                          y = calories_burned)) + 
  geom_point() + 
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Moderately Active Distance (km)",
       y = "Calories Burned",
       title = "Moderately Active Distance and Calories Burned",
       caption = "FitBit Fitness Tracker Data")


  
cal_and_mod_active_dist_p

Calories Burned and Moderately Active Distance. These data indicate that changes in “moderate” activity levels do not relate to changes in the number of calories burned. “Moderate activity” was associated with ~2500 calories burned regardless of how much distance it accounted for. The data points reflect each user’s data recorded on each day of use.

cal_and_light_active_dist_p <- merged_result %>% 
  filter(light_active_distance_km > 0.00,
         calories_burned > 0.00) %>%  
  ggplot(., aes(x = light_active_distance_km, 
                          y =  calories_burned)) + 
  geom_point() + 
  geom_smooth() + 
  theme_classic() + 
  labs(x = "Light Active Distance (km)",
       y = "Calories Burned",
       title = "Light Active Distance and Calories Burned",
       caption = "FitBit Fitness Tracker Data")


  
cal_and_light_active_dist_p

Calories Burned and Light Active Distance. These data indicate that “light activity” relates to the number of calories burned. At the lower end of “light activity”, less than 2000 calories are burned. With increasing “light activity”, the number of calories burned modestly but steadily increases. The data points reflect each user’s data recorded on each day of use.


Final Report

Key Findings

  • The majority of smart device users tend to forego utilizing manual options including the “logged activities” and weight log (weight, BMI and body fat percentage) options.

  • Users who weigh more or have higher BMI may be more likely to have smart scales that connect to the device, making manual logging of weight and BMI unnecessary. There were only two data points for body fat percentage, both recorded by the device, suggesting many smart scales do not yet measure it.

  • Like “very active” activity, increases in “light activity” are associated with increases in number of calories burned. Although causality cannot be inferred, it may be encouraging to users that activity does not have to be strenuous for it to be associated with burning calories, although longer distances may be required to reach the number of calories burned at shorter distances of “very active” activity.

Recommendations for Bellabeat Marketing Strategy Based on Key Findings

  • According to the above analysis, users are unlikely to use manual input features. Focus and invest more in automated recording features in smart devices. Bellabeat is already doing a good job of focusing on three smart devices including Leaf, Time and Spring, which do not require manual input. However, the app has features that require manual input including menstrual cycle and mindfulness. Focus less time and expense on these manual input features.

  • Smart scales enable automated recording of weight and BMI, eliminating the need for manual logging. It may be helpful for marketing to emphasize the ability of the smart device to connect with smart scales and thereby increase use of the device-associated weight log. This duo-combination would allow users to easily track not only activity and calories burned but also changes in weight and BMI over time. Users may more easily develop personalized strategies to reach their goals with this extra data.

  • If the findings above are supported by further, larger studies of smart device usage, marketing may focus on the concept that increasing engagement in even light activities is associated with increases in calories burned. This concept may be encouraging to users who do not have the ability, time or equipment to perform more strenuous activities. The device makes it easy to monitor strenuousness, distance and number of calories burned and change activity levels or duration as necessary to meet their goals.

  • On the flip side, engaging in even low levels of “very active” activity is associated with burning a high number of calories. Marketing should also focus on emphasizing the time-saving aspect of “very active” activity. Again, the device makes it easy to monitor magnitude and duration of activity and adjust as needed to meet their goals in terms of burning calories.

  • Next steps. As mentioned by the CCO, further research on fitness-related smart devices should be done using a more timely, larger, and comprehensive dataset. The current dataset has limitations due to small sample size, lack of demographic information, and a limited window of one month of data collection in 2016. The co-founders may want to wait for the marketing analytics team to assess whether the current findings hold up in this separate dataset before making costly decisions. It is important that these decisions are based on findings that extend to Bellabeat’s female customers and to extended use of smart devices.