Introduction

This is a Capstone Project for the Google Data Analytics Professional Certification.

Bellabeat is a high-tech company that manufactures health-focused smart products that help women easily track their overall health and wellness, and get connected to their body and mind throughout different stages in life.

I will be using the 6 phases of the analysis process (Ask, Prepare, Process, Analyse, Share and Act) to help guide my analysis of the datasets.

Phase 1: Ask

1. Identify the business task:

  1. Analyse smart devices usage data
  2. Provide high quality recommendations for Bellabeat’s marketing strategy

Questions to guide analysis:

  1. What are some of the trends in smart device usage?
  2. How can you apply these trends to Bellabeat customers?
  3. How can these trends help influence Bellabeat marketing strategy?

2. Consider key stakeholders

Primary Stakeholder(s):

  • Urska Srsen - Chief Executive Officer (CEO) and co-founder of Bellabeat
  • Sando Mur - Mathematician and co-founder of Bellabeat

Secondary Stakeholder:

  • Bellabeat Marketing Analytics Team

Phases 2 & 3: Prepare & Process

1. Identify the data source:

Dataset: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available by Mobius). Click here to access the dataset –> link: This Kaggle dataset contains a personal fitness tracker from thirty FitBit users. Thirty eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ behaviours and habits. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between April 2016 to May 2016.

2. Determine the credibility of the data:

I will use the “ROCCC” system to determine the credibility and integrity of the data.

Reliability: This data is not reliable. There is no information about the margin of error and a small sample size (30 participants) has been used, which can limit the amount of analysis that can be done.

Originality: This is not an original dataset as it was originally collected from Amazon Mechanical Murk.

Comprehensiveness: This data is not comprehensive. There is no information about the participants, such as gender, age, health state, etc. This could mean that data was not randomised. If the data is biased, then the insights from the analysis will be unfair to all types of people.

Current: This data was collected in 2016, which means it is currently outdated and may not represent the current rends in smart device usage.

Cited: As stated before, Amazon Mechanical Murk created the dataset, but we have no information on whether this is a credible source.

The data integrity and credibility is clearly insufficient to provide reliable and comprehensive insights to Bellabeat. Therefore, the following analysis can only provide first hints and directions which should be verified through an analysis of a larger and much more reliable dataset.

3. Sort and filter the data

For this analysis, I will be focusing on the daily data as my analysis will be on detecting high-level trends in smart device usage. I will be using the ‘dailyActivity_merged’, and ‘sleepDay_merged’ datasets as they will probably give some interesting insights into the user data.

NOTE: All my analysis will be completed in RStudio Cloud.

I will start by installing and loading the required R packages for my analysis.

install.packages("tidyverse")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("magrittr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("devtools")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(readr)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(tidyr)
library(devtools)
## Loading required package: usethis
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggplot2)

Next, I will import the necessary files onto R.

daily_activity <- read.csv("dailyActivity_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")

Cleaning the datasets

Lets have a closer look at each dataset.

head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
head(sleep_day)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

From the above tables, I can see that the date and time are in the same column in the ‘sleepDay’ datasets. I will use the ‘separate()’ function to separate them into 2 different columns.

sleep_day_new <- sleep_day %>% 
  separate(SleepDay, c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

I will now identify all the columns in each dataframe.

colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(sleep_day_new)
## [1] "Id"                 "Date"               "Time"              
## [4] "TotalSleepRecords"  "TotalMinutesAsleep" "TotalTimeInBed"

I will now find out how many distinct users there are in each dataframe.

n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day_new$Id)
## [1] 24

From this, we can see that there are:

  • 33 unique participants in the daily_activity dataframe
  • 24 unique participants in the sleep_day_new dataframe

I will now check to see how many observations there are in each dataframe.

nrow(daily_activity)
## [1] 940
nrow(sleep_day_new)
## [1] 413

From this, we can see that there are:

  • 940 rows in the daily_activity dataframe
  • 413 rows in the sleep_day_new dataframe

Next, I will check to see if there are any duplicate rows in each dataframe.

nrow(daily_activity[duplicated(daily_activity),])
## [1] 0
nrow(sleep_day_new[duplicated(sleep_day_new),])
## [1] 3

We can see that 3 duplicate rows were found in the sleep_day_new dataframe and will have to be removed.

nrow(sleep_day_new)
## [1] 413
sleep_day_new <- unique(sleep_day_new)
nrow(sleep_day_new)
## [1] 410

While exploring the datasets, I also found a lot cells with “0” values, so I will omit these to prevent skewed results.

daily_activity <- daily_activity %>% filter(TotalSteps !=0)
daily_activity <- daily_activity %>%  filter(TotalDistance !=0)

Phase 4: Analyse

I will now check for statistical summary of the variables in each dataframe.

1. daily_activity dataframe

daily_activity %>% 
  select(TotalSteps,
         TotalDistance,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes,
         Calories) %>% 
  summary()
##    TotalSteps    TotalDistance    VeryActiveMinutes FairlyActiveMinutes
##  Min.   :    8   Min.   : 0.010   Min.   :  0.00    Min.   :  0.00     
##  1st Qu.: 4927   1st Qu.: 3.373   1st Qu.:  0.00    1st Qu.:  0.00     
##  Median : 8054   Median : 5.590   Median :  7.00    Median :  8.00     
##  Mean   : 8329   Mean   : 5.986   Mean   : 23.04    Mean   : 14.79     
##  3rd Qu.:11096   3rd Qu.: 7.905   3rd Qu.: 35.00    3rd Qu.: 21.00     
##  Max.   :36019   Max.   :28.030   Max.   :210.00    Max.   :143.00     
##  LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.0        Min.   :   0.0   Min.   :  52  
##  1st Qu.:147.0        1st Qu.: 721.2   1st Qu.:1857  
##  Median :208.5        Median :1020.5   Median :2220  
##  Mean   :210.3        Mean   : 955.2   Mean   :2362  
##  3rd Qu.:272.0        3rd Qu.:1189.0   3rd Qu.:2832  
##  Max.   :518.0        Max.   :1440.0   Max.   :4900

Observations:

  • Average sedentary minutes were at 955.2 minutes or 16 hours.
  • Average very active minutes and fairly active minutes were at 23.04 minutes and 14.79 minutes respectively.
  • Average lightly active minutes were at 210.3 minutes, or 3.5 hours.
  • The average amount of calories burnt per day was around 2362kcal.

Deductions:

  • Participants were largely inactive throughout the day.
  • Participants spent a low amount of time exercising.
  • Participants are unlikely to take part in vigorous activities.

2. sleep_day_new dataframe

sleep_day_new %>% 
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>% 
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

Observations:

  • Participants spent, on average, 458.6 minutes (7.64 hours) in bed.
  • Average sleeping time was 419.5 minutes or 7 hours.
  • Participants slept once per day on average.

Deductions:

  • Participants had an adequate amount of sleep.

Phase 5: Share

Visualisations

Here, I will create visualisations to find relationships between the variables.

We will first look at the relationship between the total number of steps and calories burned.

Fig 1: TotalSteps vs. Calories

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=TotalSteps, y=Calories), color="red") +
  geom_smooth(mapping=aes(x=TotalSteps, y=Calories)) +
  labs(title="The Relationship Between Total Steps and Calories", x="Total Steps", y="Calories Burned (kcal)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This graph shows a positive correlation between the total amount of steps and the calories burned - the larger the total amount of steps, the more calories burned.

Let’s check the relationship between TotalSteps and Sedentary Minutes.

Fig 2: TotalSteps vs. Sedentary Minutes

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=TotalSteps, y=SedentaryMinutes), color="orange") +
  geom_smooth(mapping=aes(x=TotalSteps, y=SedentaryMinutes)) +
  labs(title="The Relationship Between Total Steps and Sedentary Minutes", x="Total Steps", y="Sedentary Minutes")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This graph shows a negative correlation between total steps and sedentary minutes - the lower the total steps, the higher the sedentary minutes.

Let’s see what the relationships are like between the active minutes (very, fairly, and lightly) and the number of calories burned.

Fig 3: Active minutes vs. calories burned

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=VeryActiveMinutes, y=Calories), color="orange") +
  geom_smooth(mapping=aes(x=VeryActiveMinutes, y=Calories)) +
  labs(title="The Relationship Between Very Active Minutes and Calories Burned", x="Very Active Minutes", y="Calories Burned (kcal)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=FairlyActiveMinutes, y=Calories), color="orange") +
  geom_smooth(mapping=aes(x=FairlyActiveMinutes, y=Calories)) +
  labs(title="The Relationship Between Fairly Minutes and Calories Burned", x="Failry Active Minutes", y="Calories Burned (kcal)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=daily_activity) +
  geom_point(mapping=aes(x=LightlyActiveMinutes, y=Calories), color="orange") +
  geom_smooth(mapping=aes(x=LightlyActiveMinutes, y=Calories)) +
  labs(title="The Relationship Between Lightly Active Minutes and Calories Burned", x="Lightly Active Minutes", y="Calories Burned (kcal)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

From these graphs, we can clearly see that there are positive relationships between very active minutes, and lightly active minutes’ against the calories burned. However, there seems to be a negative relationship between fairly active minutes and the number of calories burned. We can also see that more calories were burned with people who did lighter activities compared to those who were very and/or fairly active.

Next, we will look at the relationship between the total minutes asleep and total minutes in bed.

Fig 4: Total minutes asleep vs. total time in bed

ggplot(data=sleep_day_new, mapping=aes(x=TotalMinutesAsleep, y=TotalTimeInBed), color="purple") +
  geom_point() +
  geom_smooth() +
  labs(title="The Relationship Between the Total Minutes Asleep and the Total Time in Bed", x="Total Minutes Alseep", y="Total Time in Bed (min)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As you can see from the graph above, there is a positive correlation between the total minutes asleep and the amount of time spent in bed. Using this data, Bellabeat can use an app that notifies its customers about when it would be the right time to go to bed so that they can get an adequate amount of sleep.

Phase 6: Act

Recommendations for Bellabeat Marketing Strategy:

  1. Based on the activity levels and amount of calories burned, users appear to burn more calories with more exercise. Therefore, Bellabeat should encourage users to exercise more through reminders. They could also offer app incentives, such as give users app credits for every 1000 steps, which can then be used to redeem prizes or vouchers.

  2. The data also shows many people lead either a lightly active or sedentary lifestyle, which may be due to the nature of their work or the lack of time to exercise. Bellabeat could have a section on their app for short workout videos or short exercises (for example, 10 minute videos) that their customers can follow along to if they don’t necessarily want to exercise alone.

  3. To encourage better sleeping habits, Bellabeat could incorporate reminders through an app that notifies users of the best time to go to sleep and wake up in order to feel refreshed in the morning and get adequate amount of sleep. The app could also automatically turn on ‘do not disturb’ mode and turn on ‘night mode’ on the customers’ phones to signal the user that they are not disturbed by messages or phone calls from family and friends.

Recommendations based on the limitations of the dataset:

  1. A larger sample size in order to improve the statistical significance of the analysis.

  2. Collect a longer period of tracking data, ideally for 6 months to a year, to account for behavioural changes due to the changes in seasons.

  3. The need to obtain current data in order to better reflect current consumer behaviours and/or trends in smart device usage.

  4. Collect data from internal sources (if possible) and/or from primary/secondary data sources to increase credibility and reliability of the datasets.