This is a Capstone Project for the Google Data Analytics Professional Certification.
Bellabeat is a high-tech company that manufactures health-focused smart products that help women easily track their overall health and wellness, and get connected to their body and mind throughout different stages in life.
I will be using the 6 phases of the analysis process (Ask, Prepare, Process, Analyse, Share and Act) to help guide my analysis of the datasets.
1. Identify the business task:
Questions to guide analysis:
2. Consider key stakeholders
Primary Stakeholder(s):
Secondary Stakeholder:
1. Identify the data source:
Dataset: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available by Mobius). Click here to access the dataset –> link: This Kaggle dataset contains a personal fitness tracker from thirty FitBit users. Thirty eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ behaviours and habits. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between April 2016 to May 2016.
2. Determine the credibility of the data:
I will use the “ROCCC” system to determine the credibility and integrity of the data.
Reliability: This data is not reliable. There is no information about the margin of error and a small sample size (30 participants) has been used, which can limit the amount of analysis that can be done.
Originality: This is not an original dataset as it was originally collected from Amazon Mechanical Murk.
Comprehensiveness: This data is not comprehensive. There is no information about the participants, such as gender, age, health state, etc. This could mean that data was not randomised. If the data is biased, then the insights from the analysis will be unfair to all types of people.
Current: This data was collected in 2016, which means it is currently outdated and may not represent the current rends in smart device usage.
Cited: As stated before, Amazon Mechanical Murk created the dataset, but we have no information on whether this is a credible source.
The data integrity and credibility is clearly insufficient to provide reliable and comprehensive insights to Bellabeat. Therefore, the following analysis can only provide first hints and directions which should be verified through an analysis of a larger and much more reliable dataset.
3. Sort and filter the data
For this analysis, I will be focusing on the daily data as my analysis will be on detecting high-level trends in smart device usage. I will be using the ‘dailyActivity_merged’, and ‘sleepDay_merged’ datasets as they will probably give some interesting insights into the user data.
NOTE: All my analysis will be completed in RStudio Cloud.
I will start by installing and loading the required R packages for my analysis.
install.packages("tidyverse")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("magrittr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("devtools")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(readr)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(tidyr)
library(devtools)
## Loading required package: usethis
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2)
Next, I will import the necessary files onto R.
daily_activity <- read.csv("dailyActivity_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")
Lets have a closer look at each dataset.
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
head(sleep_day)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
From the above tables, I can see that the date and time are in the same column in the ‘sleepDay’ datasets. I will use the ‘separate()’ function to separate them into 2 different columns.
sleep_day_new <- sleep_day %>%
separate(SleepDay, c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
I will now identify all the columns in each dataframe.
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(sleep_day_new)
## [1] "Id" "Date" "Time"
## [4] "TotalSleepRecords" "TotalMinutesAsleep" "TotalTimeInBed"
I will now find out how many distinct users there are in each dataframe.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day_new$Id)
## [1] 24
From this, we can see that there are:
I will now check to see how many observations there are in each dataframe.
nrow(daily_activity)
## [1] 940
nrow(sleep_day_new)
## [1] 413
From this, we can see that there are:
Next, I will check to see if there are any duplicate rows in each dataframe.
nrow(daily_activity[duplicated(daily_activity),])
## [1] 0
nrow(sleep_day_new[duplicated(sleep_day_new),])
## [1] 3
We can see that 3 duplicate rows were found in the sleep_day_new dataframe and will have to be removed.
nrow(sleep_day_new)
## [1] 413
sleep_day_new <- unique(sleep_day_new)
nrow(sleep_day_new)
## [1] 410
While exploring the datasets, I also found a lot cells with “0” values, so I will omit these to prevent skewed results.
daily_activity <- daily_activity %>% filter(TotalSteps !=0)
daily_activity <- daily_activity %>% filter(TotalDistance !=0)
I will now check for statistical summary of the variables in each dataframe.
1. daily_activity dataframe
daily_activity %>%
select(TotalSteps,
TotalDistance,
VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes,
SedentaryMinutes,
Calories) %>%
summary()
## TotalSteps TotalDistance VeryActiveMinutes FairlyActiveMinutes
## Min. : 8 Min. : 0.010 Min. : 0.00 Min. : 0.00
## 1st Qu.: 4927 1st Qu.: 3.373 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 8054 Median : 5.590 Median : 7.00 Median : 8.00
## Mean : 8329 Mean : 5.986 Mean : 23.04 Mean : 14.79
## 3rd Qu.:11096 3rd Qu.: 7.905 3rd Qu.: 35.00 3rd Qu.: 21.00
## Max. :36019 Max. :28.030 Max. :210.00 Max. :143.00
## LightlyActiveMinutes SedentaryMinutes Calories
## Min. : 0.0 Min. : 0.0 Min. : 52
## 1st Qu.:147.0 1st Qu.: 721.2 1st Qu.:1857
## Median :208.5 Median :1020.5 Median :2220
## Mean :210.3 Mean : 955.2 Mean :2362
## 3rd Qu.:272.0 3rd Qu.:1189.0 3rd Qu.:2832
## Max. :518.0 Max. :1440.0 Max. :4900
Observations:
Deductions:
2. sleep_day_new dataframe
sleep_day_new %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
Observations:
Deductions:
Recommendations for Bellabeat Marketing Strategy:
Based on the activity levels and amount of calories burned, users appear to burn more calories with more exercise. Therefore, Bellabeat should encourage users to exercise more through reminders. They could also offer app incentives, such as give users app credits for every 1000 steps, which can then be used to redeem prizes or vouchers.
The data also shows many people lead either a lightly active or sedentary lifestyle, which may be due to the nature of their work or the lack of time to exercise. Bellabeat could have a section on their app for short workout videos or short exercises (for example, 10 minute videos) that their customers can follow along to if they don’t necessarily want to exercise alone.
To encourage better sleeping habits, Bellabeat could incorporate reminders through an app that notifies users of the best time to go to sleep and wake up in order to feel refreshed in the morning and get adequate amount of sleep. The app could also automatically turn on ‘do not disturb’ mode and turn on ‘night mode’ on the customers’ phones to signal the user that they are not disturbed by messages or phone calls from family and friends.
Recommendations based on the limitations of the dataset:
A larger sample size in order to improve the statistical significance of the analysis.
Collect a longer period of tracking data, ideally for 6 months to a year, to account for behavioural changes due to the changes in seasons.
The need to obtain current data in order to better reflect current consumer behaviours and/or trends in smart device usage.
Collect data from internal sources (if possible) and/or from primary/secondary data sources to increase credibility and reliability of the datasets.