BellaBeat_Case_Study_Google_DA_Certificate

Mohamed Fergany Omran

3/19/2022

# Before we begin, load all the required libraries
library(rmarkdown)
library(tidyverse)
library(lubridate)
library(janitor)
library(here)
library(skimr)
library(scales)

Introduction

The Chief executive officer of BellaBeat has asked you to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation.

Based on the provided information above, our business task is now clear. Use usage data from non-Bellabeat yet similar product. In our case we will use data from Fitbit consumers which is an american consumer electronics and fitness company owned by Google. It’s considered to be one of the largest companies in that market. According to Wikipedia, Fitbit has more than 29 million active users in their community and has sold more than 120 million devices. Their products are sold in 39,000 retail stores and in over 100 countries. Fitbit has a revenue of $1.13 billion in 2020. Thus choosing data from such a company to inform the marketing strategy for Bellabeat products was a really good decision.

Ask

Our main business task is to use these data to gain insights about:

  1. Why consumers selected that company to get their smart devices from

  2. What information those devices provide to them

  3. How can Bellabeat use these data to improve their marketing strategy or even improve their product features

Prepare

For this project we shall use the Fitbit dataset from Kaggle. You may wanna take a look at the metadata for easier interpretation. Also have a quick look to know how the tracker actually works. Now According to the description,

This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 - 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

It consists of 18 csv files with data such as dates, times, duration of workouts, amounts of burned calories, number of sleeping hours and body weight info. As mentioned previously, this dataset can be downloaded from Kaggle and I saved a backup copy on my local device. Some data files are stored in wide format and some are stored in the long one. Some files are even stored in both formats.

It’s really important to note here that the CEO, who is our main stakeholder in this case, says that this data set might have some limitations and encourages us to consider adding other data to help address those limitations as we begin to work more with this data.

  1. Regarding the reliability of the source it’s important to mention that Kaggle doesn’t put strict conditions to ensure the datasets being uploaded meet specific criteria. Personally I’ve uploaded multiple datasets to Kaggle without having to answer any questions about its original source. The dataset uploader howerver states that the source is Zenodo website which in turn states that the owner of the dataset is RTI, an independent, nonprofit institute that provides research, development, and technical services to government and commercial clients worldwide. Hence we may say that the source of this datasets is reliable and the dataset is cited
  2. We haven’t collected this dataset by ourselves thus it’s a second party data
  3. This dataset was collected and uploaded about six years ago and later in this analysis we shall explore the dataset and decide whether it contains all the information we need or we have to look for other sources
  4. Despite Fitibit tracker is a product that’s being sold and used worldwide, This data is collected via Amazon Mechanical Turk where participants from only specific countries are eligible to take the survey and this raise concerns regarding the sampling bias. Other than that, I can’t see any other type of bias exist in the data
  5. The dataset owners made it an open access type. It’s available for download, copy and use by anyone for free
  6. Regarding ownership, according to the dataset owners thirty eligible Fitbit users consented to the submission of personal tracker data. However no additional information is provided with the description related to the transaction transparency. Thus we don’t know whether the participants were aware how exactly their data will processed or not
  7. IDs were used instead of participants names. Also no personally identifiable information (location, contact info, medical records, etc.) are found in the files
  8. In addition to the sampling bias, Compared to the number of customers, the size of the sample is quite small. These two are the main problems associated with this dataset which appeared so far

Process

We shall use R in both RStudio and Kaggle for this case study because I see R as a more powerful tool than spreadsheets and maybe later we repeat the case study with SQL and compare between the two tools. The dataset has 18 csv files so let’s explore and clean the dataset step by step.

1) Daily Activity

# Step_1
# First modify the dataframe columns names to a more standard format
dailyActivity_merged = read_csv(
  "case_study/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv"
)

dailyActivity_merged = clean_names(dailyActivity_merged)

# Running the below code, we can see we have 33 id/person
dailyActivity_merged %>%
  count(id)
## # A tibble: 33 × 2
##            id     n
##         <dbl> <int>
##  1 1503960366    31
##  2 1624580081    31
##  3 1644430081    30
##  4 1844505072    31
##  5 1927972279    31
##  6 2022484408    31
##  7 2026352035    31
##  8 2320127002    31
##  9 2347167796    18
## 10 2873212765    31
## # … with 23 more rows

As you can see we have a total number of IDs equals 33. But as mentioned earlier we should only have thirty ids. So could it Possible that there are some users with more than one device?!! Well as you can see below for some days we have the whole 33 participants working out on the same day, thus we shall consider we have 33 participants until we get strong valid evidence that proves otherwise. Another important note is that despite the description that says that the duration of study is between 3/12/2016 & 5/12/2016 (62 days), the first day in the dataset is 4/12/2016 thus the total number of days in the analysis is 31 days.

dailyActivity_merged %>%
  group_by(activity_date) %>%
  summarize(no_of_participants = n_distinct(id))
## # A tibble: 31 × 2
##    activity_date no_of_participants
##    <chr>                      <int>
##  1 4/12/2016                     33
##  2 4/13/2016                     33
##  3 4/14/2016                     33
##  4 4/15/2016                     33
##  5 4/16/2016                     32
##  6 4/17/2016                     32
##  7 4/18/2016                     32
##  8 4/19/2016                     32
##  9 4/20/2016                     32
## 10 4/21/2016                     32
## # … with 21 more rows
# Step_2
# Remove any duplicate entries if exist
dailyActivity_merged = unique(dailyActivity_merged)
# Step_3
# Check the data types of each column and modify if necessary
str(dailyActivity_merged)
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
##  $ id                        : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activity_date             : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ total_steps               : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ total_distance            : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ tracker_distance          : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ logged_activities_distance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ very_active_distance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ moderately_active_distance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ light_active_distance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ sedentary_active_distance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ very_active_minutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ fairly_active_minutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ lightly_active_minutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ sedentary_minutes         : num [1:940] 728 776 1218 726 773 ...
##  $ calories                  : num [1:940] 1985 1797 1776 1745 1863 ...
dailyActivity_merged$activity_date = mdy(dailyActivity_merged$activity_date)
# Step_4
# Take a quick overview of the dataframe and check for any messing values
skim_without_charts(dailyActivity_merged)
Data summary
Name dailyActivity_merged
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
Date 1
numeric 14
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
activity_date 0 1 2016-04-12 2016-05-12 2016-04-26 31

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09
total_steps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04
total_distance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
tracker_distance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
logged_activities_distance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00
very_active_distance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01
moderately_active_distance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00
light_active_distance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01
sedentary_active_distance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01
very_active_minutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02
fairly_active_minutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02
lightly_active_minutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02
sedentary_minutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03
calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03

But maybe despite not having any null values we still have rows that all of its values equal zero so let’s check.

# Step_5
# Check for any rows where every column has zero value
filter(dailyActivity_merged, if_all(everything(), ~ . == 0))
## # A tibble: 0 × 15
## # … with 15 variables: id <dbl>, activity_date <date>, total_steps <dbl>,
## #   total_distance <dbl>, tracker_distance <dbl>,
## #   logged_activities_distance <dbl>, very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>
# Step_6
# Check for days where there are no input data
input_data = dplyr::select(dailyActivity_merged, !c("id", "activity_date"))
filter(input_data, if_all(everything(), ~ . == 0))
## # A tibble: 0 × 13
## # … with 13 variables: total_steps <dbl>, total_distance <dbl>,
## #   tracker_distance <dbl>, logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
# Step_7
# Check if there's any IDs entered or entered inocrrectly (more or less than 
# 10 numbers)
min(dailyActivity_merged$id)
## [1] 1503960366
max(dailyActivity_merged$id)
## [1] 8877689391

From the following code, we can see that the value of tracker distance equals the total distance which is the sum of four degrees of active distance ONLY when there isn’t any value entered for the logged activities. However we shall not consider that as a cross-field invalidation and leave the dataframe as it is for now.

dailyActivity_merged %>%
  filter(
   tracker_distance == rowSums(dailyActivity_merged[, 7:10]) &
     total_distance == tracker_distance & logged_activities_distance == 0
  )
## # A tibble: 306 × 15
##            id activity_date total_steps total_distance tracker_distance
##         <dbl> <date>              <dbl>          <dbl>            <dbl>
##  1 1503960366 2016-05-07          11992           7.71             7.71
##  2 1503960366 2016-05-12              0           0                0   
##  3 1624580081 2016-04-12           8163           5.31             5.31
##  4 1624580081 2016-04-13           7007           4.55             4.55
##  5 1624580081 2016-04-16           5370           3.49             3.49
##  6 1624580081 2016-04-19           2916           1.90             1.90
##  7 1624580081 2016-04-20           4974           3.23             3.23
##  8 1624580081 2016-04-29           2390           1.55             1.55
##  9 1624580081 2016-05-07           2104           1.37             1.37
## 10 1624580081 2016-05-09           1732           1.13             1.13
## # … with 296 more rows, and 10 more variables:
## #   logged_activities_distance <dbl>, very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>

The dataframes daily calories, daily intensities and daily steps are just a replication of the data in daily activity one. Thus they shall not be considered in this analysis. In general out of the 18 dataframes we’re going to use only six of them as the other 12 have replicate / unnecessary info. That said, let’s proceed to clean the remaining five.

2) Burned Calories per Hour

# Step_1
# First modify the dataframe columns names to a more standard format
hourlyCalories_merged = read_csv(
  "case_study/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv"
)
hourlyCalories_merged = clean_names(hourlyCalories_merged)

# Step_2
# Remove any duplicate entries if exist
hourlyCalories_merged = unique(hourlyCalories_merged)

# Step_3
# Check the data types of each column and modify if necessary
str(hourlyCalories_merged)
## tibble [22,099 × 3] (S3: tbl_df/tbl/data.frame)
##  $ id           : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activity_hour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ calories     : num [1:22099] 81 61 59 47 48 48 48 47 68 141 ...
hourlyCalories_merged$activity_hour = mdy_hms(hourlyCalories_merged$activity_hour)

# Step_4
# Take a quick overview the dataframe and check for any missing values
# This is also useful to make sure all dates have been converted correctly
# Otherwise we'll end up having null values
skim_without_charts(hourlyCalories_merged)
Data summary
Name hourlyCalories_merged
Number of rows 22099
Number of columns 3
_______________________
Column type frequency:
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 4.848235e+09 2.4225e+09 1503960366 2320127002 4445114986 6962181067 8877689391
calories 0 1 9.739000e+01 6.0700e+01 42 63 83 108 948

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
activity_hour 0 1 2016-04-12 2016-05-12 15:00:00 2016-04-26 06:00:00 736
# Step_5
# Check for any zero values
filter(hourlyCalories_merged, if_any(everything(), ~ . == 0))
## # A tibble: 0 × 3
## # … with 3 variables: id <dbl>, activity_hour <dttm>, calories <dbl>
# Step_6
# Check if there's any IDs entered or entered incorrectly (more or less than 
# 10 numbers)
min(hourlyCalories_merged$id)
## [1] 1503960366
max(hourlyCalories_merged$id)
## [1] 8877689391
# Step_7
# Check for any wrong calories values
hourlyCalories_merged %>%
  filter(calories < 0)
## # A tibble: 0 × 3
## # … with 3 variables: id <dbl>, activity_hour <dttm>, calories <dbl>

3) Intensities Values per Hour

# Step_1
# First modify the dataframe columns names to a more standard format
hourlyIntensities_merged = read_csv(
  "case_study/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv"
)
hourlyIntensities_merged = clean_names(hourlyIntensities_merged)

# Step_2
# Remove any duplicate entries if exist
hourlyIntensities_merged = unique(hourlyIntensities_merged)

# Step_3
# Check the data types of each column and modify if necessary
str(hourlyIntensities_merged)
## tibble [22,099 × 4] (S3: tbl_df/tbl/data.frame)
##  $ id               : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activity_hour    : chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ total_intensity  : num [1:22099] 20 8 7 0 0 0 0 0 13 30 ...
##  $ average_intensity: num [1:22099] 0.333 0.133 0.117 0 0 ...
hourlyIntensities_merged$activity_hour = mdy_hms(hourlyIntensities_merged$activity_hour)

# Step_4
# Take a quick overview the dataframe and check for any missing values
# This is also useful to make sure all dates have been converted correctly
# Otherwise we'll end up having null values
skim_without_charts(hourlyIntensities_merged)
Data summary
Name hourlyIntensities_merged
Number of rows 22099
Number of columns 4
_______________________
Column type frequency:
numeric 3
POSIXct 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 4.848235e+09 2.4225e+09 1503960366 2320127002 4.445115e+09 6.962181e+09 8877689391
total_intensity 0 1 1.204000e+01 2.1130e+01 0 0 3.000000e+00 1.600000e+01 180
average_intensity 0 1 2.000000e-01 3.5000e-01 0 0 5.000000e-02 2.700000e-01 3

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
activity_hour 0 1 2016-04-12 2016-05-12 15:00:00 2016-04-26 06:00:00 736
# Step_5
# Check for any row that has all zero values
filter(hourlyIntensities_merged, if_all(everything(), ~ . == 0))
## # A tibble: 0 × 4
## # … with 4 variables: id <dbl>, activity_hour <dttm>, total_intensity <dbl>,
## #   average_intensity <dbl>
# Step_6
# Check if there's any IDs entered or entered incorrectly (more or less than 
# 10 numbers)
min(hourlyIntensities_merged$id)
## [1] 1503960366
max(hourlyIntensities_merged$id)
## [1] 8877689391
# Step_7
# Check for any wrong intensities values
hourlyIntensities_merged %>%
  filter(total_intensity < 0 | average_intensity < 0)
## # A tibble: 0 × 4
## # … with 4 variables: id <dbl>, activity_hour <dttm>, total_intensity <dbl>,
## #   average_intensity <dbl>

4) Steps per Hour

# Step_1
# First modify the dataframe columns names to a more standard format
hourlySteps_merged = read_csv(
  "case_study/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv"
)
hourlySteps_merged = clean_names(hourlySteps_merged)

# Step_2
# Remove any duplicate entries if exist
hourlySteps_merged = unique(hourlySteps_merged)

# Step_3
# Check the data types of each column and modify if necessary
str(hourlySteps_merged)
## tibble [22,099 × 3] (S3: tbl_df/tbl/data.frame)
##  $ id           : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activity_hour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ step_total   : num [1:22099] 373 160 151 0 0 ...
hourlySteps_merged$activity_hour = mdy_hms(hourlySteps_merged$activity_hour)

# Step_4
# Take a quick overview the dataframe and check for any messing values
# This is also useful to make sure all dates have been converted correctly
# Otherwise we'll end up having null values
skim_without_charts(hourlySteps_merged)
Data summary
Name hourlySteps_merged
Number of rows 22099
Number of columns 3
_______________________
Column type frequency:
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 4.848235e+09 2.4225e+09 1503960366 2320127002 4445114986 6962181067 8877689391
step_total 0 1 3.201700e+02 6.9038e+02 0 0 40 357 10554

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
activity_hour 0 1 2016-04-12 2016-05-12 15:00:00 2016-04-26 06:00:00 736
# Step_5
# Check for any row that has all zero values
filter(hourlySteps_merged, if_all(everything(), ~ . == 0))
## # A tibble: 0 × 3
## # … with 3 variables: id <dbl>, activity_hour <dttm>, step_total <dbl>
# Step_6
# Check if there's any IDs entered or entered incorrectly (more or less than 
# 10 numbers)
min(hourlySteps_merged$id)
## [1] 1503960366
max(hourlySteps_merged$id)
## [1] 8877689391
# Step_7
# Check for any wrong steps values
hourlySteps_merged %>%
  filter(step_total < 0)
## # A tibble: 0 × 3
## # … with 3 variables: id <dbl>, activity_hour <dttm>, step_total <dbl>

5) Sleep Periods

# Step_1
# First modify the dataframe columns names to a more standard format
sleepDay_merged = read_csv(
  "case_study/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv"
)
sleepDay_merged = clean_names(sleepDay_merged)

# Step_2
# Remove any duplicate entries if exist
sleepDay_merged = unique(sleepDay_merged)

# Step_3
# Check the data types of each column and modify if necessary
str(sleepDay_merged)
## tibble [410 × 5] (S3: tbl_df/tbl/data.frame)
##  $ id                  : num [1:410] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ sleep_day           : chr [1:410] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ total_sleep_records : num [1:410] 1 2 1 2 1 1 1 1 1 1 ...
##  $ total_minutes_asleep: num [1:410] 327 384 412 340 700 304 360 325 361 430 ...
##  $ total_time_in_bed   : num [1:410] 346 407 442 367 712 320 377 364 384 449 ...
sleepDay_merged$sleep_day = mdy_hms(sleepDay_merged$sleep_day)

# Step_4
# Take a quick overview the dataframe and check for any messing values
# This is also useful to make sure all dates have been converted correctly  
# Otherwise. we'll end up having null values
skim_without_charts(sleepDay_merged)
Data summary
Name sleepDay_merged
Number of rows 410
Number of columns 5
_______________________
Column type frequency:
numeric 4
POSIXct 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 4.994963e+09 2.060863e+09 1503960366 3.977334e+09 4702921684.0 6962181067 8792009665
total_sleep_records 0 1 1.120000e+00 3.500000e-01 1 1.000000e+00 1.0 1 3
total_minutes_asleep 0 1 4.191700e+02 1.186400e+02 58 3.610000e+02 432.5 490 796
total_time_in_bed 0 1 4.584800e+02 1.274600e+02 61 4.037500e+02 463.0 526 961

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
sleep_day 0 1 2016-04-12 2016-05-12 2016-04-27 31
# Step_5
# Check for any zero values
filter(sleepDay_merged, if_all(everything(), ~ . == 0))
## # A tibble: 0 × 5
## # … with 5 variables: id <dbl>, sleep_day <dttm>, total_sleep_records <dbl>,
## #   total_minutes_asleep <dbl>, total_time_in_bed <dbl>
# Step_6
# Check if there're no IDs entered or entered inocrrectly (more or less than 
# 10 numbers)
min(sleepDay_merged$id)
## [1] 1503960366
max(sleepDay_merged$id)
## [1] 8792009665
# Step_7
# Check for any wrong values
sleepDay_merged %>%
  filter(
    total_sleep_records < 0 | total_minutes_asleep < 0 | total_time_in_bed < 0
  )
## # A tibble: 0 × 5
## # … with 5 variables: id <dbl>, sleep_day <dttm>, total_sleep_records <dbl>,
## #   total_minutes_asleep <dbl>, total_time_in_bed <dbl>

6) Body Mass Info

# Step_1
# First modify the dataframe columns names to a more standard format
weightLogInfo_merged = read_csv(
  "case_study/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv"
)
weightLogInfo_merged = clean_names(weightLogInfo_merged)

# Step_2
# Remove any duplicate entries if exist
weightLogInfo_merged = unique(weightLogInfo_merged)

# Step_3
# Check the data types of each column and modify if necessary
str(weightLogInfo_merged)
## tibble [67 × 8] (S3: tbl_df/tbl/data.frame)
##  $ id              : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ date            : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ weight_kg       : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
##  $ weight_pounds   : num [1:67] 116 116 294 125 126 ...
##  $ fat             : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
##  $ bmi             : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
##  $ is_manual_report: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
##  $ log_id          : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
weightLogInfo_merged$date = mdy_hms(weightLogInfo_merged$date)

# Step_4
# Take a quick overview the dataframe and check for any messing values
# This is also useful to make sure all dates have been converted correctly 
# Otherwise we'll end up having null values
skim_without_charts(weightLogInfo_merged)
Data summary
Name weightLogInfo_merged
Number of rows 67
Number of columns 8
_______________________
Column type frequency:
logical 1
numeric 6
POSIXct 1
________________________
Group variables None

Variable type: logical

skim_variable n_missing complete_rate mean count
is_manual_report 0 1 0.61 TRU: 41, FAL: 26

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1.00 7.009282e+09 1.950322e+09 1.503960e+09 6.962181e+09 6.962181e+09 8.877689e+09 8.877689e+09
weight_kg 0 1.00 7.204000e+01 1.392000e+01 5.260000e+01 6.140000e+01 6.250000e+01 8.505000e+01 1.335000e+02
weight_pounds 0 1.00 1.588100e+02 3.070000e+01 1.159600e+02 1.353600e+02 1.377900e+02 1.875000e+02 2.943200e+02
fat 65 0.03 2.350000e+01 2.120000e+00 2.200000e+01 2.275000e+01 2.350000e+01 2.425000e+01 2.500000e+01
bmi 0 1.00 2.519000e+01 3.070000e+00 2.145000e+01 2.396000e+01 2.439000e+01 2.556000e+01 4.754000e+01
log_id 0 1.00 1.461772e+12 7.829948e+08 1.460444e+12 1.461079e+12 1.461802e+12 1.462375e+12 1.463098e+12

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2016-04-12 06:47:11 2016-05-12 23:59:59 2016-04-27 23:59:59 56
# Step_5
# Check for any zero values
filter(weightLogInfo_merged, if_all(everything(), ~ . == 0))
## # A tibble: 0 × 8
## # … with 8 variables: id <dbl>, date <dttm>, weight_kg <dbl>,
## #   weight_pounds <dbl>, fat <dbl>, bmi <dbl>, is_manual_report <lgl>,
## #   log_id <dbl>
# Step_6
# Check if there're no IDs entered or entered inocrrectly (more or less than
# 10 numbers)
min(weightLogInfo_merged$id)
## [1] 1503960366
max(weightLogInfo_merged$id)
## [1] 8877689391
min(weightLogInfo_merged$log_id)
## [1] 1.460444e+12
max(weightLogInfo_merged$log_id)
## [1] 1.463098e+12
# Step_7
# Check for any wrong values
# We know from step_4 that the fat column has too many null values 
weightLogInfo_merged %>%
  drop_na(fat) %>%
  filter(
    weight_kg < 0 | weight_pounds < 0 | fat < 0 | bmi < 0 | 
      is_manual_report == 0
  )
## # A tibble: 0 × 8
## # … with 8 variables: id <dbl>, date <dttm>, weight_kg <dbl>,
## #   weight_pounds <dbl>, fat <dbl>, bmi <dbl>, is_manual_report <lgl>,
## #   log_id <dbl>

Analyze & Share

# Step_1 summarize the following dataframes
merged_1 = merge(
  hourlySteps_merged,
  hourlyIntensities_merged,
  by = c("id", "activity_hour")
)

hourly_sic_merged = merge(
  merged_1,
  hourlyCalories_merged,
  by = c("id", "activity_hour")
)

head(hourly_sic_merged)
##           id       activity_hour step_total total_intensity average_intensity
## 1 1503960366 2016-04-12 00:00:00        373              20          0.333333
## 2 1503960366 2016-04-12 01:00:00        160               8          0.133333
## 3 1503960366 2016-04-12 02:00:00        151               7          0.116667
## 4 1503960366 2016-04-12 03:00:00          0               0          0.000000
## 5 1503960366 2016-04-12 04:00:00          0               0          0.000000
## 6 1503960366 2016-04-12 05:00:00          0               0          0.000000
##   calories
## 1       81
## 2       61
## 3       59
## 4       47
## 5       48
## 6       48
# To see the total number of sumbitted hours per each participant
hour_frequency = tabyl(hourly_sic_merged, id) %>%
  arrange(-n) %>%
  dplyr::rename(no_of_submitted_hours = n) %>%
  dplyr::select(-percent)

hour_frequency
##          id no_of_submitted_hours
##  1624580081                   736
##  1927972279                   736
##  2022484408                   736
##  2026352035                   736
##  2873212765                   736
##  4558609924                   736
##  2320127002                   735
##  4388161847                   735
##  4445114986                   735
##  8053475328                   735
##  8378563200                   735
##  8877689391                   735
##  7086361926                   733
##  4020332650                   732
##  6962181067                   732
##  1844505072                   731
##  4702921684                   731
##  5553957443                   730
##  4319703577                   724
##  8583815059                   718
##  1503960366                   717
##  1644430081                   708
##  5577150313                   708
##  3977333714                   696
##  8792009665                   672
##  6290855005                   665
##  6117666160                   660
##  6775888955                   610
##  7007744171                   601
##  3372868164                   472
##  8253242879                   431
##  2347167796                   414
##  4057192912                    88
daily_frequency = tabyl(dailyActivity_merged, id) %>%
  arrange(-n) %>%  
  dplyr::rename(no_of_days = n) %>%
  dplyr::select(-percent)

daily_frequency
##          id no_of_days
##  1503960366         31
##  1624580081         31
##  1844505072         31
##  1927972279         31
##  2022484408         31
##  2026352035         31
##  2320127002         31
##  2873212765         31
##  4020332650         31
##  4319703577         31
##  4388161847         31
##  4445114986         31
##  4558609924         31
##  4702921684         31
##  5553957443         31
##  6962181067         31
##  7086361926         31
##  8053475328         31
##  8378563200         31
##  8583815059         31
##  8877689391         31
##  1644430081         30
##  3977333714         30
##  5577150313         30
##  6290855005         29
##  8792009665         29
##  6117666160         28
##  6775888955         26
##  7007744171         26
##  3372868164         20
##  8253242879         19
##  2347167796         18
##  4057192912          4
merged_frequencies = merge(hour_frequency, daily_frequency, by = "id") 

merged_frequencies = merged_frequencies %>%
  arrange(-no_of_submitted_hours)

merged_frequencies
##            id no_of_submitted_hours no_of_days
## 1  1624580081                   736         31
## 2  1927972279                   736         31
## 3  2022484408                   736         31
## 4  2026352035                   736         31
## 5  2873212765                   736         31
## 6  4558609924                   736         31
## 7  2320127002                   735         31
## 8  4388161847                   735         31
## 9  4445114986                   735         31
## 10 8053475328                   735         31
## 11 8378563200                   735         31
## 12 8877689391                   735         31
## 13 7086361926                   733         31
## 14 4020332650                   732         31
## 15 6962181067                   732         31
## 16 1844505072                   731         31
## 17 4702921684                   731         31
## 18 5553957443                   730         31
## 19 4319703577                   724         31
## 20 8583815059                   718         31
## 21 1503960366                   717         31
## 22 1644430081                   708         30
## 23 5577150313                   708         30
## 24 3977333714                   696         30
## 25 8792009665                   672         29
## 26 6290855005                   665         29
## 27 6117666160                   660         28
## 28 6775888955                   610         26
## 29 7007744171                   601         26
## 30 3372868164                   472         20
## 31 8253242879                   431         19
## 32 2347167796                   414         18
## 33 4057192912                    88          4
summarized_hourly_data = hourly_sic_merged %>%
  group_by(id) %>%
  summarize(
    avg_calories_per_hour = mean(calories),
    avg_intensities_per_hour = mean(total_intensity),
    avg_steps_per_hour = mean(step_total)
  )

summarized_hourly_data
## # A tibble: 33 × 4
##            id avg_calories_per_hour avg_intensities_per_hour avg_steps_per_hour
##         <dbl>                 <dbl>                    <dbl>              <dbl>
##  1 1503960366                  78.5                    16.2               522. 
##  2 1624580081                  62.5                     8.04              242. 
##  3 1644430081                 119.                     10.5               308. 
##  4 1844505072                  66.6                     5.02              109. 
##  5 1927972279                  91.5                     1.86               38.6
##  6 2022484408                 105.                     17.0               478. 
##  7 2026352035                  64.9                    10.8               234. 
##  8 2320127002                  72.6                     8.74              199. 
##  9 2347167796                  88.7                    14.5               414. 
## 10 2873212765                  80.2                    15.1               318. 
## # … with 23 more rows
weight_summary = weightLogInfo_merged %>%
  group_by(id) %>%
  summarize(
    avg_weight_kg = mean(weight_kg),
    avg_BMI = mean(bmi)
  )

joined_df = full_join(merged_frequencies, weight_summary)

summary_of_daily_data = dailyActivity_merged %>%
  group_by(id) %>%
  summarise(
    avg_steps_per_day = mean(total_steps),
    avg_distance_per_day = mean(total_distance),
    avg_calories_per_day = mean(calories),
  )

summary_data_1 = full_join(joined_df, summary_of_daily_data)
summary_data = full_join(summary_data_1, summarized_hourly_data)

summary_data
##            id no_of_submitted_hours no_of_days avg_weight_kg  avg_BMI
## 1  1624580081                   736         31            NA       NA
## 2  1927972279                   736         31     133.50000 47.54000
## 3  2022484408                   736         31            NA       NA
## 4  2026352035                   736         31            NA       NA
## 5  2873212765                   736         31      57.00000 21.57000
## 6  4558609924                   736         31      69.64000 27.21400
## 7  2320127002                   735         31            NA       NA
## 8  4388161847                   735         31            NA       NA
## 9  4445114986                   735         31            NA       NA
## 10 8053475328                   735         31            NA       NA
## 11 8378563200                   735         31            NA       NA
## 12 8877689391                   735         31      85.14583 25.48708
## 13 7086361926                   733         31            NA       NA
## 14 4020332650                   732         31            NA       NA
## 15 6962181067                   732         31      61.55333 24.02800
## 16 1844505072                   731         31            NA       NA
## 17 4702921684                   731         31            NA       NA
## 18 5553957443                   730         31            NA       NA
## 19 4319703577                   724         31      72.35000 27.41500
## 20 8583815059                   718         31            NA       NA
## 21 1503960366                   717         31      52.60000 22.65000
## 22 1644430081                   708         30            NA       NA
## 23 5577150313                   708         30      90.70000 28.00000
## 24 3977333714                   696         30            NA       NA
## 25 8792009665                   672         29            NA       NA
## 26 6290855005                   665         29            NA       NA
## 27 6117666160                   660         28            NA       NA
## 28 6775888955                   610         26            NA       NA
## 29 7007744171                   601         26            NA       NA
## 30 3372868164                   472         20            NA       NA
## 31 8253242879                   431         19            NA       NA
## 32 2347167796                   414         18            NA       NA
## 33 4057192912                    88          4            NA       NA
##    avg_steps_per_day avg_distance_per_day avg_calories_per_day
## 1           5743.903            3.9148387             1483.355
## 2            916.129            0.6345161             2172.806
## 3          11370.645            8.0841935             2509.968
## 4           5566.871            3.4548387             1540.645
## 5           7555.774            5.1016129             1916.968
## 6           7685.129            5.0806452             2033.258
## 7           4716.871            3.1877419             1724.161
## 8          10813.935            8.3932259             3093.871
## 9           4796.548            3.2458064             2186.194
## 10         14763.290           11.4751612             2945.806
## 11          8717.710            6.9135485             3436.581
## 12         16040.032           13.2129031             3420.258
## 13          9371.774            6.3880645             2566.355
## 14          2267.226            1.6261290             2385.806
## 15          9794.806            6.5858065             1982.032
## 16          2580.065            1.7061290             1573.484
## 17          8572.065            6.9551613             2965.548
## 18          8612.581            5.6396774             1875.677
## 19          7268.839            4.8922580             2037.677
## 20          7198.516            5.6154838             2732.032
## 21         12116.742            7.8096774             1816.419
## 22          7282.967            5.2953334             2811.300
## 23          8304.433            6.2133333             3359.633
## 24         10984.567            7.5169999             1513.667
## 25          1853.724            1.1865517             1962.310
## 26          5649.552            4.2724138             2599.621
## 27          7046.714            5.3421429             2261.143
## 28          2519.692            1.8134615             2131.769
## 29         11323.423            8.0153846             2544.000
## 30          6861.650            4.7070000             1933.100
## 31          6482.158            4.6673685             1788.000
## 32          9519.667            6.3555555             2043.444
## 33          3838.000            2.8625000             1973.750
##    avg_calories_per_hour avg_intensities_per_hour avg_steps_per_hour
## 1               62.47283                 8.039402          241.50815
## 2               91.50408                 1.857337           38.58696
## 3              105.47962                17.031250          477.86957
## 4               64.91168                10.812500          233.78804
## 5               80.24321                15.101902          318.11141
## 6               85.66033                14.449728          323.24592
## 7               72.55510                 8.742857          198.68707
## 8              128.11837                14.311565          435.05442
## 9               92.13741                 9.793197          202.10612
## 10             124.23129                17.949660          622.39864
## 11             144.79864                14.858503          366.94558
## 12             143.87211                19.081633          674.31701
## 13             108.24147                13.563438          392.22783
## 14             101.00137                 4.357923           93.44126
## 15              83.96311                14.875683          414.78279
## 16              66.59508                 5.021888          109.35978
## 17             125.70588                12.931601          363.28181
## 18              79.63562                12.843836          365.62466
## 19              85.50276                11.310773          289.31492
## 20             110.58774                 9.130919          245.40390
## 21              78.50349                16.170153          522.37936
## 22             118.82062                10.519774          307.80650
## 23             142.39548                19.895480          351.01554
## 24              65.08908                15.228448          471.61494
## 25              84.43452                 4.434524           79.98512
## 26             113.34887                10.600000          246.13233
## 27              91.75455                12.540909          281.32273
## 28              91.06721                 4.373770          107.16557
## 29             110.03161                17.580699          489.86522
## 30              81.75636                15.379237          290.37500
## 31              78.45012                 9.106729          285.00696
## 32              88.71739                14.521739          413.85749
## 33              89.19318                 4.897727          173.92045
# Now let's find out if there's a relation between the weekday and workouts 
daily_activity_wday = dailyActivity_merged %>%
  mutate("week_day" = weekdays(activity_date), .after = activity_date)

ggplot(
  daily_activity_wday, 
  aes(week_day, fill = week_day)
) + 
  geom_bar(show.legend = FALSE) + 
  labs(
    title = "Participation vs. Weekday",
    subtitle = "The relation between the weekday and the frequency of workouts",
    caption = "Data colllected by the Research Triangle Institute"
  ) + 
  theme(
    axis.text = element_text(size = 10, angle = 45),
    axis.text.x = element_text(vjust = 0.7),
    axis.title = element_text(size = 15)
  ) +
  stat_count(
    geom = 'text', 
    color = 'black', 
    aes(label = ..count..), 
    size = 5, 
    position = position_stack(vjust = 0.5)
  ) 

As we can see there’s a relation (not very strong though) between the weekday and the frequency of participation. Now let’s see if there’s a relation between the weekday and the amount of total distance as that reflects the amount of physical activity . Also if there’s a relation between the time of day and workouts

daily_activity_wday %>%
  group_by(week_day) %>%
  summarise(total_distance = sum(total_distance)) %>%
  ggplot(aes(week_day, total_distance, fill = week_day)) + 
  geom_col(show.legend = FALSE) + 
  geom_text(
    aes(label = round(total_distance, 0)), 
    position=position_stack(vjust = 0.5)
  ) + 
  theme(
    axis.text = element_text(size = 10, angle = 45),
    axis.text.x = element_text(vjust = 0.7),
    axis.title = element_text(size = 15)
  ) + 
  labs(
    title = "Total_distance vs. Weekday",
    subtitle = 
      "The relation between the weekday and the amount of total distance",
    caption = "Data collected by the Research Triangle Institute",
    x = "Weekday",
    y = "Total Distance (Km)"
  ) 

Finding No.1

As we can see from the above chart, participants tend to practice less at the beginning and end of the week (Friday and Monday) and also on weekends. It’s really interesting that despite having more spare time on weekends, they workout more on working days than the non-working ones. Also on Monday after returning to work, we can see there’s some kind of a humble start and that’s understandable from personal experience as the first working day is a sort of transitional phase between two different daily routines. what really surprised me however is Friday, is it because the participants after a long week preferred to spend their Friday nights on something else other than workouts? Let’s check the below charts to find out more.

hourly_sic_merged = hourly_sic_merged %>%
  mutate(hour = hour(activity_hour), .after = activity_hour)

average_values = hourly_sic_merged %>%
  group_by(hour) %>%
  summarize(
    avg_amt_of_calories = mean(calories),
    avg_total_steps = mean(step_total),   
    avg_intensity_per_hour = mean(total_intensity)
  )

ggplot(average_values, aes(hour, avg_amt_of_calories, fill = hour)) + 
  geom_col(show.legend = FALSE) +
  geom_text(
    aes(label = round(avg_amt_of_calories,0)), 
    size = 3.5, 
    color = 'white',
    angle = 90,
    position = position_stack(vjust = 0.5) 
  ) + 
  scale_x_continuous(n.breaks = 23) + 
  labs(
    title = "Average amount of burned calories per hour",
    subtitle = 
      "The relation between the time of the day and the average amount of burned calories",
    caption = "Data collected by the Research Triangle Institute",
    x = "Hour",
    y = "Average Bunred Calories"
  ) 

ggplot(average_values, aes(hour, avg_intensity_per_hour, fill = hour)) +
  geom_col(show.legend = FALSE) +
  scale_x_continuous(n.breaks = 23) +
  geom_label(
    aes(label = round(avg_intensity_per_hour, 0)),
    fill = "white",
    size = 3 
  ) +
  labs(
    title = "Average amount of intensities per hour",
    subtitle =
      "The relation between the time of the day and the average intensity per hour",
    caption = "Data collected by the Research Triangle Institute",
    x = "Hour",
    y = "Average Intensity per Hour"
  )

Finding No.2

By looking at the above charts we can clearly see the relation between the time of day and the degree of physical activity. Knowing that the usual working hours are from 8 to 5, we may focus on encouraging customers to workout during these periods (6:00 AM to 8:00 AM) and (8:00 PM to 11:00 PM) while maintaining a minimum of seven sleeping hours per day.

sleepDay_merged = sleepDay_merged %>%
  mutate(week_day = weekdays(sleep_day), .after = sleep_day)
head(sleepDay_merged)
## # A tibble: 6 × 6
##           id sleep_day           week_day  total_sleep_records total_minutes_as…
##        <dbl> <dttm>              <chr>                   <dbl>             <dbl>
## 1 1503960366 2016-04-12 00:00:00 Tuesday                     1               327
## 2 1503960366 2016-04-13 00:00:00 Wednesday                   2               384
## 3 1503960366 2016-04-15 00:00:00 Friday                      1               412
## 4 1503960366 2016-04-16 00:00:00 Saturday                    2               340
## 5 1503960366 2016-04-17 00:00:00 Sunday                      1               700
## 6 1503960366 2016-04-19 00:00:00 Tuesday                     1               304
## # … with 1 more variable: total_time_in_bed <dbl>
sleepDay_merged %>%
  group_by(week_day) %>%
  summarise(average_sleeping_hours = mean(total_minutes_asleep)/60) %>%
  ggplot(aes(week_day, average_sleeping_hours, fill = week_day)) + 
  geom_col(show.legend = FALSE) + 
  geom_text(
    aes(label = round(average_sleeping_hours, 1)), 
    position=position_stack(vjust = 0.5)
  ) + 
  theme(
    axis.text = element_text(size = 10, angle = 45),
    axis.text.x = element_text(vjust = 0.7),
    axis.title = element_text(size = 15)
  ) + 
  labs(
    title = "Average Sleeping Hours vs. Weekday",
    subtitle = 
      "The relation between the weekday and the average amount of sleeping hours",
    caption = "Data collected by the Research Triangle Institute",
    x = "Weekday",
    y = "Average Hours Asleep"
  ) 

sleepDay_merged %>%
  group_by(week_day) %>%
  summarise(average_time_in_bed = mean(total_time_in_bed)/60) %>%
  ggplot(aes(week_day, average_time_in_bed, fill = week_day)) + 
  geom_col(show.legend = FALSE) + 
  geom_text(
    aes(label = round(average_time_in_bed, 1)), 
    position=position_stack(vjust = 0.5)
  ) + 
  theme(
    axis.text = element_text(size = 10, angle = 45),
    axis.text.x = element_text(vjust = 0.7),
    axis.title = element_text(size = 15)
  ) + 
  labs(
    title = "Average Time in Bed vs. Weekday",
    subtitle = "The relation between the weekday and the average time in bed",
    caption = "Data collected by the Research Triangle Institute",
    x = "Weekday",
    y = "Average Hours in Bed"
  ) 

ggplot(
  sleepDay_merged, 
  aes(x = factor(total_sleep_records), fill = total_sleep_records)
) + 
  geom_bar()  +
  labs(
    title = "Count of Sleep Records",
    subtitle = "Comparison between the Sleep Records ",
    caption = "Data collected by the Research Triangle Institute",
    x = "Total Sleep Records",
    y = "Count"
  ) 

Finding No.3

The recommended amount of sleeping hours for adults is (7-9) hours per day. Thus we shouldn’t fall below the minimum of seven hours to maintain a healthy daily routine. By looking at the above graphs, we can see that the majority don’t have the ability and/or the willingness to get a >60 minutes nap during the day. Although they always devote more than 7 hours for bed time each day, the actual sleeping hours is often less than 7.

# We can see that not all 33 participants have shared their weight, sleep periods
percent(
  (n_distinct(dailyActivity_merged$id) - n_distinct(sleepDay_merged$id)) / 33
)
## [1] "27%"

Finding No.4

Another important observation from this dataframe is that more than a quarter of the participants haven’t shared any data about their sleeping periods. That’s a percentage that needs to be taken into consideration. We need to find the reason(s) why they’re not interested in using the fitbit tracker to record their sleep quality (Awake/Light/REM/Deep) while being in bed.

n_distinct(weightLogInfo_merged$id)
## [1] 8
weightLogInfo_merged %>%
  drop_na(fat) %>%
  summarise(count = n_distinct(id))
## # A tibble: 1 × 1
##   count
##   <int>
## 1     2

Finding No.5

We can see that only 8 out of 33 participants have shared their weight & BMI either using a connected smart scale or manually. We also see that almost no one of the participants have shared his/her body fat percentage.

Act

The five recommendations shown below are based on the five findings in our dataset respectively. Before proceeding, allow me to remind you what I’ve said earlier about the limitations of this dataset and why we should collect more data by conducting another survey that’s more global, specifically addressed for women and has more participants. The CEO has also asked to do so.

  1. It’s finally the weekend and everyone of us want to hang out with some friends, binge watch some shows, go for a quick trip and the list goes on. For me I can see the challenge here is not just to send a reminder via the app for users not to miss training on those days but rather to prepare some fun collections of different workouts that are mainly bodyweight ones, so they can be done anywhere. Thus no matter where you’re planning to go this weekend you don’t need to go look for a gym or a good running track and you don’t even have to carry those exercise tools with you. I think what we need to bring to the table is a diverse list of fun bodyweight programs and keep updating those programs so that users are looking forward to exploring them each new weekend. We may also bring group workout programs to encourage friends to do them while hanging out.

  2. There are many fitness experts that strongly advise to workout in the early morning to get the best results for the body. You may wanna read this article by the TIME for instance. We may develop a collection of morning workouts. To keep our users motivated we may send daily notifications such as links to articles, studies, etc. that are showing more information on the benefits of waking up early and practice before going to work. This will also help the users to not miss workouts on Fridays.

  3. Sending quick info & charts to users about their sleep quality last night and how that will affect the day ahead can be a good start. Also a creative reminders that are set at a user_defined times each day to tell the user to go to bed and turn off any distractions

  4. & 5 We need to conduct another survey for Fitbit users asking questions about

    a. Whether they use their tracker while they’re asleep or not

    b. Is the sleep-related information they receive from the tracker helpful or not

    c. Asking whether the tracker itself is comfortable to wear while asleep or not and if not why it’s uncomfortable

    d. Do they record their weight, BMI and fat info or not

    e. Is the weight-related information they receive from the tracker helpful or not

    f. On average how long does it take to record this weight-related info via the app

    p.s. I know the following four questions are out of scope of this case study but since we’re conducting a survey anyway let’s gather this info for future considerations

    g. If they’re planning to get a new smart scale what physical and technical features are they looking for and what’s the desired price range

    h. why they’ve chosen that brand to get their current tracker from

    i. If they’re planning to get a new fitness tracker what physical and technical features are they looking for and what’s the desired price range

    j. Would they use their trackers to look for dietary plans and why

    k. Finally we may request feedback and suggestions. For instance when they’re looking for information about health status, sleep quality, etc., what they’re expecting