Bellabeat is a high-tech company that manufactures health-focused smart products, including Bellabeat application, Leaf, Time and Spring. Their ecosystem of products and services focused on women’s health by collecting and providing user’s health data related to their activity, sleep, stress.
The company expect to become a larger player in the global smart device market. Urška Sršen Cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.
This is a case study of Google Data Analytics Certification program, then I will follow the Data Analysis Processes which include 6 phases : Ask, Prepare, Process, Analyze, Share and Act.
I will use R programming to analyse and document for this case study.
Unlock new growth opportunity by analyzing the trend of smart device usage and suggesting data-driven marketing stategy
Top high-level recommendations for marketing team.
FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius) contains personal fitness tracker from thirty Fitbit users.
Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
ROCCC Analysis
ROCCC (Reliable, Original, Comprehensive, Current & Cited) will be used to assess the credibility and integrity of the data
There are 18 files in this dataset. But as the focus is to identify the pattern of activity in smart devices usages, I would suggest the following files:
R Packages and libraries are installed and loaded as below:
library(tidyverse)
library(skimr)
library(janitor)
library(ggplot2)
library(lubridate)
library(here)
Import selected datasets that will be used in analysis:
file | colnames | # obs | # variables |
|---|---|---|---|
dailyActivity | Id, ActivityDate, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, SedentaryActiveDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories | 940 | 15 |
dailyCalories | Id, ActivityDay, Calories | 940 | 3 |
dailyIntensities | Id, ActivityDay, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, SedentaryActiveDistance, LightActiveDistance, ModeratelyActiveDistance, VeryActiveDistance | 940 | 10 |
dailySteps | Id, ActivityDay, StepTotal | 940 | 3 |
dailySleep | Id, SleepDay, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed | 413 | 5 |
weightLogInfo | Id, Date, WeightKg, WeightPounds, Fat, BMI, IsManualReport, LogId | 67 | 8 |
Then, quick check of data to ensure if cleaning data is needed:
head(dailyActivity)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
str(dailyActivity)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(dailyCalories)
## spc_tbl_ [940 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDay = col_character(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(dailyIntensities)
## spc_tbl_ [940 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDay = col_character(),
## .. SedentaryMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. VeryActiveDistance = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(dailySteps)
## spc_tbl_ [940 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ StepTotal : num [1:940] 13162 10735 10460 9762 12669 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDay = col_character(),
## .. StepTotal = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(dailySleep)
## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(weightLogInfo)
## spc_tbl_ [67 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num [1:67] 116 116 294 125 126 ...
## $ Fat : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
## $ LogId : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. Date = col_character(),
## .. WeightKg = col_double(),
## .. WeightPounds = col_double(),
## .. Fat = col_double(),
## .. BMI = col_double(),
## .. IsManualReport = col_logical(),
## .. LogId = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
n_distinct(dailyActivity$Id)
## [1] 33
n_distinct(dailyCalories$Id)
## [1] 33
n_distinct(dailyIntensities$Id)
## [1] 33
n_distinct(dailySteps$Id)
## [1] 33
n_distinct(dailySleep$Id)
## [1] 24
n_distinct(weightLogInfo$Id)
## [1] 8
With expected 30 user participating in the survey, but the daily dataset have 33 unique Id, this means some users might have created more Id. While the dailySleep and weightLogIn dataset have a missing of 6 and 22 participant’s information.
We will check if any duplicates:
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 3
## [1] 0
Knowing duplicates in dailySleep, we can remove them or retrieve only distinct and non_NA rows as follows:
dailyActivity <- dailyActivity %>% distinct() %>% drop_na()
dailyCalories <- dailyCalories %>% distinct() %>% drop_na()
dailyIntensities <- dailyIntensities %>% distinct() %>% drop_na()
dailySteps <- dailySteps %>% distinct() %>% drop_na()
dailySleep <- dailySleep %>% distinct() %>% drop_na()
With weightLogInfo, we will not remove NAs, because this table are input manually by user, and some user may skip inputting their weight information into application.
To ensure column names are consistently using same syntax format, we should clean and format them using lowercase:
dailyActivity <- clean_names(dailyActivity)
dailyCalories <- clean_names(dailyCalories)
dailyIntensities <- clean_names(dailyIntensities)
dailySteps <- clean_names(dailySteps)
dailySleep <- clean_names(dailySleep)
weightLogInfo <- clean_names(weightLogInfo)
With date/time columns, we should convert them as_datetime format and separate into date and time columns if any. We will also add weekday for further analysis.
dailyActivity <- dailyActivity %>%
rename(date=activity_date) %>%
mutate(date = as_date(date, format = "%m/%d/%Y")) %>%
mutate(weekday = weekdays(date)) %>%
mutate(weekday = ordered(weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday", "Sunday")))
dailyCalories <- dailyCalories %>%
rename(date = activity_day) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
dailyIntensities <- dailyIntensities %>%
rename(date = activity_day) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
dailySteps <- dailySteps %>%
rename(date = activity_day) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
dailySleep <- dailySleep %>%
rename(date = sleep_day) %>%
mutate(date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p"))
weightLogInfo <- weightLogInfo %>%
mutate(date=as.POSIXct(date, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())) %>%
separate(date,into=c("date","time"),sep=" ")%>%
mutate(date=as_date(date))
We will join two tables: dailyActivity and dailySleep to analyse the impact of the daily activities to the sleeps.
daily_Activity_Sleep <- merge(dailyActivity,dailySleep,by=c("id","date"), all = TRUE)
# or:
# daily_Activity_Sleep <- dailyActivity %>%
# full_join(dailySleep)
After analyzing the Fitbit trackers data, it’s came up with the following data-driven recommendations for Bellabeat market strategy:
Bellabeat can include function in their application to alert user who tend to have higher non-active minutes. They also can add function to alert if user haven’t worn their tracker devices in period time.
Taking 7,500 steps per day can help to reduce risk for all-cause of mortality as CDC’s recommendation; Bellabeat can add function to promote that benefit and track whether user to met this daily goal.
Bellabeat can include timely notification in Leaf/Time to motivate user to move around regularly to reduce their sedentary minutes.
Bellabeat can add a function to remind user inputting their weight information, then use the relation between sedentary minutes/active minutes and weight to promote an active lifestyle.
Bellabeat can enhance their sleep tracking function to promote the sleep/non-active relation. Tracking and alert user sleep less than 8 hours a day. User can set up the desired time to sleep and Bellabeat application can notify user. Articles or postcast related to sleeping technique or some relaxing music to help user a better sleep are also helpful.
THANK YOU.