About

Part of the Google Data Analytics Professional Certificate course (Capstone Project).

Click Here to Download FitBit Fitness Tracker Data

See My FitBit Fitness Tracker Data Analysis in kaggle Click Here

This is a data analytics fictional case study with the purpose of gaining insights to improve business decisions for the Bellabeat company, a company focused on building products designed to gather information about a woman’s overall activity, sleep schedules, stress and reproductive health.

This analysis follows the same structure that was learned throughout this course, which is the following:

Ask
Prepare
Process
Analyze
Share
Act

For this fictitional case study, I have joined Bellabeat’s marketing analytics team, responsible for collecting, analyzing and reporting data that helps guide Bellabeat’s marketing strategy. This report’s point of view will then be written as a (fictitional) team.

Ask

In the Ask phase, we (the fictional team) should focus in what is the problem that we are trying to solve, and how can our insights drive business decisions.

Business Task

Bellabeat wants to gather insights regarding the following business tasks:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Key Stakeholders

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematicion and Bellabeat’s cofounder; key member of the Bellabeat executive team

How our insights will help

By analysing the activity of a particular set of costumers (that gave consent to share their activity’s data), we can see what type of features are mostly used or impact the healthy lifestyle the most for those costumers. With these insights, Bellabeat’s team can focus more on improving or spotlighting these features in marketing.

Prepare

For this step, we’ll take a closer look into the dataset that will be used for this case study and which will help us answer the previous business tasks. We’ll give a brief overview regarding particular aspects of this dataset, such as how is the data being stored, how it’s organized, bias/credibility issues (ROCC), licensing, privacy, security, acessibility, data’s integrity, how does this data help answer our questions, and are there any problems with the data. Finally, this phase also includes the selection or filtering of the data available, and any sortage if necessary.

Data overview

The data FitBit Fitness Tracker Data is available in Kaggle. As it is described in the dataset’s page: “These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.”

Storage: The data is being currently being stored in Kaggle’s platform and will also be downloaded to a local file system for any analysis or clean up.
Organized: The databate consists of 18 CSV files:
- dailyActivity_merged.csv
- dailyCalories_merged.csv
- dailyIntensities_merged.csv
- dailySteps_merged.csv
- heartrate_seconds_merged.csv
- hourlyCalories_merged.csv
- hourlyIntensities_merged.csv
- hourlySteps_merged.csv
- minuteCaloriesNarrow_merged.csv
- minuteCaloriesWide_merged.csv
- minuteIntensitiesNarrow_merged.csv
- minuteIntensitiesWide_merged.csv
- minuteMETsNarrow_merged.csv
- minuteSleep_merged.csv
- minuteStepsNarrow_merged.csv
- minuteStepsWide_merged.csv
- sleepActivities_merged.csv
- weightLogInfo_merged.csv
Bias/Credibility (ROCC)
- Reliable: is the data accurate, complete, and non-biased?
  - Can’t be entirely sure if it is.
Original: can we locate the original data source?
- Yes, as described in their dataset’s metadata
Comprehensive: Does it have the necessary and important information to find the solution?
- Yes, by looking into the device’s data, we can see the usage trends, which is what we are trying to discover.
Current: Yes it current?
- No, all the data is from 2016.
Cited:
- Yes, as described in their dataset’s metadata
Licence, privacy, security, acessability, data’s integrity According to the description of the dataset, thirty eligible Fitbit users consented to the submission of personal tracker data. Looking into the metadata, the data was made public with the following license: CCO: Public Domain.
Problems with the data The data is not current (from 2016), has a relatively small sample size (30 costumers) and can not be sure if the data is actually representative of women, or if it’s biased in some way.

Data Selection

The entire data is distributed across multiple files, and the observations (rows) on each file are registered by different time intervals. Some are high intervals such as dailyActivity_merged.csv where each observation is on a daily basis, while others are low intervals (heartrate_seconds_merged.csv) where each observation is measured by the second. To have a generic view of the usage trends of Bellabeat’s costumers, we will focus on the files with higher intervals observations, which are the following:

dailyActivity_merged.csv
dailyCalories_merged.csv
dailyIntensities_merged.csv
dailySteps_merged.csv
sleepDay_merged.csv
weightLogInfo_merged.csv But, looking into these datasets, we can see that dailyActivity_merged.csv already contains all the columns of either dailyCalories, dailyIntensities and dailySteps files. Also, by having a quick look at the weightLogInfo_merged.csv file, it only contains weight information about 9 costumers, so this file won’t be used as well. We end up with the following files:
dailyActivity_merged.csv
sleepDay_merged.csv

Process and Analysis

We will combine both steps (Process and Analysis) into this one section.

Process: In the Process phase, the key tasks include checking data for errors, choosing the appropriate or most desired tools to do so, transform the data so we can work with it effectively and document the entire cleaning process. The tool we chose to clean the data was R.
Analysis: The Analysis phase consists of aggregating the data so it’s useful and accessible, organizing and formatting the data, perform any necessary calculations and identify trends and relationships.

As described in the previous phase, the files that were left off for us to work with are the dailyActivity_merged.csv and the sleepDay_merged.csv. There maybe a correlation between the sleep behaviours and the overall activity of the costumers, so the next step will be to merge the two files together so we can have a single data frame to work with.

library(tidyverse)
library(lubridate)
library(skimr)

Loading Files:

dailyActivities <- read_csv("data/dailyActivity_merged.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   ActivityDate = col_character(),
##   TotalSteps = col_double(),
##   TotalDistance = col_double(),
##   TrackerDistance = col_double(),
##   LoggedActivitiesDistance = col_double(),
##   VeryActiveDistance = col_double(),
##   ModeratelyActiveDistance = col_double(),
##   LightActiveDistance = col_double(),
##   SedentaryActiveDistance = col_double(),
##   VeryActiveMinutes = col_double(),
##   FairlyActiveMinutes = col_double(),
##   LightlyActiveMinutes = col_double(),
##   SedentaryMinutes = col_double(),
##   Calories = col_double()
## )

sleepActivities  <- read_csv("data/sleepDay_merged.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   SleepDay = col_character(),
##   TotalSleepRecords = col_double(),
##   TotalMinutesAsleep = col_double(),
##   TotalTimeInBed = col_double()
## )

head(dailyActivities)

## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

head(sleepActivities)

## # A tibble: 6 x 5
##          Id SleepDay           TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
##       <dbl> <chr>                         <dbl>             <dbl>          <dbl>
## 1    1.50e9 4/12/2016 12:00:0~                1               327            346
## 2    1.50e9 4/13/2016 12:00:0~                2               384            407
## 3    1.50e9 4/15/2016 12:00:0~                1               412            442
## 4    1.50e9 4/16/2016 12:00:0~                2               340            367
## 5    1.50e9 4/17/2016 12:00:0~                1               700            712
## 6    1.50e9 4/19/2016 12:00:0~                1               304            320

As we can see, we have the Id and ActivityDate for the dailyActivities data frame, and the Id and sleepActivities for the sleepActivities data frame. Both of these columns will be used to make the merge but first we need to make a few changes. R has identified both date-related columns as a Factor or categorical variable (fct). Also, ActivityDate is representing a date, while sleepActivities is date-time. This date-time however is unecessary as they all show the exact same time (12 am). So the next steps will be to:

Changing the format of ActivityDate and sleepActivities to a date-related format usable by R. (will be using lubridate package).
Rename both date columns to have the same name (this is unecessary for the merge since you can specify the columns but it will be more clearer to do so.
Merge both data frames.
Quick look into the final merge result.

Step 1: Changing the format of columns ActivityDate and sleepActivities

Using lubridate library, we can use the mdy function in order to parse the date (mdy because the current column date format is in month/day/year).

dailyActivities <- mutate(dailyActivities, ActivityDate = mdy(ActivityDate))
sleepActivities <- mutate(sleepActivities, SleepDay = mdy_hms(SleepDay))

Now it is a date! and it’s in the default date format (and much more readable format: Year > month > day). Let’s quickly recheck:

head(dailyActivities)

## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <date>            <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 2016-04-12        13162          8.5             8.5                 0
## 2  1.50e9 2016-04-13        10735          6.97            6.97                0
## 3  1.50e9 2016-04-14        10460          6.74            6.74                0
## 4  1.50e9 2016-04-15         9762          6.28            6.28                0
## 5  1.50e9 2016-04-16        12669          8.16            8.16                0
## 6  1.50e9 2016-04-17         9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

head(sleepActivities)

## # A tibble: 6 x 5
##         Id SleepDay            TotalSleepRecords TotalMinutesAsl~ TotalTimeInBed
##      <dbl> <dttm>                          <dbl>            <dbl>          <dbl>
## 1   1.50e9 2016-04-12 00:00:00                 1              327            346
## 2   1.50e9 2016-04-13 00:00:00                 2              384            407
## 3   1.50e9 2016-04-15 00:00:00                 1              412            442
## 4   1.50e9 2016-04-16 00:00:00                 2              340            367
## 5   1.50e9 2016-04-17 00:00:00                 1              700            712
## 6   1.50e9 2016-04-19 00:00:00                 1              304            320

Step 2: Renaming both columns ActivityDate and sleepActivities to simply ’Date

dailyActivities <- rename(dailyActivities, Date = ActivityDate)
sleepActivities <- sleepActivities %>% rename(Date = SleepDay)
glimpse(dailyActivities)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date                     <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~

glimpse(sleepActivities)

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ Date               <dttm> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 20~
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~

Step 3: Merging both data frames.

Before merging data frames, let’s just see if all our costumers in one data frame are also present in the other. We can simply do this by counting the number of distinct Id’s in both data frames.

n_distinct(dailyActivities$Id)

## [1] 33

n_distinct(sleepActivities$Id)

## [1] 24

Apparently not all costumers have sleep data. In dailyActivities data frame there are 33 costumers, while in sleepActivities there are 24! In order to solve this. We’ve decided to use the dplyr library function, inner_join. This function will merge both data frames by Id and by Date and will only add observations or rows to the result if they are present in both data frames.

mergeActivity <- inner_join(dailyActivities, sleepActivities, by=c("Id", "Date"))

Step 4: Looking into the final merge result

Let’s now have an overview of our final data frame that we will work with using the skimr library. But first let’s just add one more column to our new data frame to see who’s been lazy or has insonia! Maybe it can have a correlation with the overall workout activity!

mergeActivity <- mutate(mergeActivity, TimeAwakeInBed =  TotalTimeInBed - TotalMinutesAsleep)
n_distinct(mergeActivity$Id)

## [1] 24

glimpse(mergeActivity)

## Rows: 413
## Columns: 19
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date                     <dttm> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-~
## $ TotalSteps               <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3~
## $ FairlyActiveMinutes      <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,~
## $ LightlyActiveMinutes     <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, ~
## $ SedentaryMinutes         <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, ~
## $ Calories                 <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177~
## $ TotalSleepRecords        <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep       <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, ~
## $ TotalTimeInBed           <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, ~
## $ TimeAwakeInBed           <dbl> 19, 23, 30, 27, 12, 16, 17, 39, 23, 19, 46, 2~

skim(mergeActivity)

Data summary
Name	mergeActivity
Number of rows	413
Number of columns	19
_______________________
Column type frequency:
numeric	18
POSIXct	1
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Id	1	5.000979e+09	2.06036e+09	1.50396e+09	3.977334e+09	4.702922e+09	6.962181e+09	8.79201e+09	▆▆▇▅▃
TotalSteps	1	8.541140e+03	4.15693e+03	1.70000e+01	5.206000e+03	8.925000e+03	1.139300e+04	2.27700e+04	▅▆▇▂▁
TotalDistance	1	6.040000e+00	3.05000e+00	1.00000e-02	3.600000e+00	6.290000e+00	8.030000e+00	1.75400e+01	▅▇▇▁▁
TrackerDistance	1	6.030000e+00	3.05000e+00	1.00000e-02	3.600000e+00	6.290000e+00	8.020000e+00	1.75400e+01	▅▇▇▁▁
LoggedActivitiesDistance	1	1.100000e-01	5.10000e-01	0.00000e+00	0.000000e+00	0.000000e+00	0.000000e+00	4.08000e+00	▇▁▁▁▁
VeryActiveDistance	1	1.450000e+00	1.99000e+00	0.00000e+00	0.000000e+00	5.700000e-01	2.370000e+00	1.25400e+01	▇▂▁▁▁
ModeratelyActiveDistance	1	7.500000e-01	1.00000e+00	0.00000e+00	0.000000e+00	4.200000e-01	1.040000e+00	6.48000e+00	▇▂▁▁▁
LightActiveDistance	1	3.810000e+00	1.73000e+00	1.00000e-02	2.540000e+00	3.680000e+00	4.930000e+00	9.48000e+00	▂▇▆▂▁
SedentaryActiveDistance	1	0.000000e+00	1.00000e-02	0.00000e+00	0.000000e+00	0.000000e+00	0.000000e+00	1.10000e-01	▇▁▁▁▁
VeryActiveMinutes	1	2.519000e+01	3.63900e+01	0.00000e+00	0.000000e+00	9.000000e+00	3.800000e+01	2.10000e+02	▇▂▁▁▁
FairlyActiveMinutes	1	1.804000e+01	2.24000e+01	0.00000e+00	0.000000e+00	1.100000e+01	2.700000e+01	1.43000e+02	▇▂▁▁▁
LightlyActiveMinutes	1	2.168500e+02	8.71600e+01	2.00000e+00	1.580000e+02	2.080000e+02	2.630000e+02	5.18000e+02	▂▇▇▂▁
SedentaryMinutes	1	7.121700e+02	1.65960e+02	0.00000e+00	6.310000e+02	7.170000e+02	7.830000e+02	1.26500e+03	▁▁▇▃▁
Calories	1	2.397570e+03	7.62890e+02	2.57000e+02	1.850000e+03	2.220000e+03	2.926000e+03	4.90000e+03	▁▇▇▃▁
TotalSleepRecords	1	1.120000e+00	3.50000e-01	1.00000e+00	1.000000e+00	1.000000e+00	1.000000e+00	3.00000e+00	▇▁▁▁▁
TotalMinutesAsleep	1	4.194700e+02	1.18340e+02	5.80000e+01	3.610000e+02	4.330000e+02	4.900000e+02	7.96000e+02	▁▂▇▃▁
TotalTimeInBed	1	4.586400e+02	1.27100e+02	6.10000e+01	4.030000e+02	4.630000e+02	5.260000e+02	9.61000e+02	▁▃▇▁▁
TimeAwakeInBed	1	3.917000e+01	4.65700e+01	0.00000e+00	1.700000e+01	2.500000e+01	4.000000e+01	3.71000e+02	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Date	0	1	2016-04-12	2016-05-12	2016-04-27	31

Let’s just see if there’s any duplicates in the final data frame:

sum(duplicated(mergeActivity))

## [1] 3

# Removing duplicates from the data frame
mergeActivity <- distinct(mergeActivity)

Exploratory Visualizations

Let’s now do some obvious exploratory visualizations in R between variables that we are sure are correlated.

Relationship between TotalSteps and TotalDistance:

ggplot(data = mergeActivity, aes(x=TotalSteps, y=TotalDistance)) +
    geom_point(color = "red") + labs(title = "Relationship between TotalSteps and TotalDistance")

Relationship between TotalSteps and Calories

ggplot(data = mergeActivity, aes(x=TotalSteps, y=Calories)) +
    geom_point(color = "red") + labs(title = "Relationship between TotalSteps and Calories")

Deeper Analysis

Let’s now look at the actual behaviours of our costumers. We can analyse their sleep behaviours and if this has any impact with the overall activity. What type of activity they do the most, and in what time of the day or day of the week it’s more likely for the costumers to do activity.

Possible analysis:

Sleep behaviours.
Most frequent types of activity.
Day of the week where there is more activity.
Relationship between sleep and activity.

1. Sleep behaviours

ggplot(mergeActivity, aes(x=TotalMinutesAsleep)) + 
    geom_histogram(aes(y=..density..) ,binwidth = 30, alpha = 0.6, color = "red") +
    geom_density(alpha = 0.2, fill="blue") +
    labs(title = "Total Minute Asleep")

From this histogram, we can see that most of the costumers sleep around 420 minutes which equals to 7 hours a day.

ggplot(mergeActivity, aes(x=TimeAwakeInBed)) + 
    geom_histogram(aes(y=..density..), binwidth = 30, alpha = 0.6, color = "red") +
    geom_density(fill= "blue", alpha = 0.2) +
    geom_vline(aes(xintercept = mean(TimeAwakeInBed)), color="green", linetype = "dashed") +
    geom_vline(aes(xintercept = max(TimeAwakeInBed)), color="green", linetype = "dashed") +
    annotate(geom = "text", x = 39 + 45, y = 0.02, label = "Mean = 39", size = 5) +
    annotate(geom = "text", x = 371 - 30, y = 0.02, label = "Max = 371", size = 5 ) +
    labs(x = "Time Awake in Bed", y = "Density")

Using the same type of visualization, and using our calculated field TimeAwakeInBed, we can see if the costumers are spending too much time awake in bed or not, which could be a signal of bad sleeping habits. In the histogram, we see that most costumers have an average time of 39 minutes awake in bed, which is not particularly bad, but could be improved. Also, there’s still a small amount of costumers that are awake between 100 and 300 minutes, and the maximum being 371 minutes! Will leave further comments regarding this subject in the Share phase.

Most frequent types of activities

For this analysis, we’ll construct a data frame where for each date (or day) we’ll measure the mean for each activity, then plotting these measurments using lines with different colors.

activitySummary <- mergeActivity %>% group_by(Date) %>%
    summarise(Mean_VeryActiveMinutes = mean(VeryActiveMinutes),
              Mean_FairlyActiveMinutes = mean(FairlyActiveMinutes),
              Mean_LightlyActiveMinutes = mean(LightlyActiveMinutes),
              Mean_SedentaryMinutes = mean(SedentaryMinutes)) %>%
    arrange(Date)
cols <- c("Very" = "violetred3", "Fairly" = "black", "Lightly" = "orange", "Sedentary" = "red")
ggplot(data = activitySummary, aes(x=Date)) +
    geom_line(aes(y=Mean_VeryActiveMinutes, color = "Very"), size = 1) +
    geom_line(aes(y=Mean_FairlyActiveMinutes, color = "Fairly"), size = 1) +
    geom_line(aes(y=Mean_LightlyActiveMinutes, color = "Lightly"), size = 1) +
    geom_line(aes(y=Mean_SedentaryMinutes, color = "Sedentary"), size = 1) +
    labs(title = "Most Frequent Types of Activity", y = "Minutes (mean)", color = "Activity") +
    scale_color_manual(values = cols)

From the chart above, we can see that most of the activity is of Sedentary type, then Light activity, and finally the least amount are Very active and Fairly active. Nowadays, most people spend alot of their day working sitting on a chair, which may explain why most of the activity is Sedentary, and the Lightly activity being from simply doing diverse types of chores.

Days of the week with most activity

To analyse the most activity per weekday, we can do something similar as we did for the previous data frame (activityByDate) but this time grouping by weekday using the weekdays function.

activityByweekDay <- mergeActivity %>% group_by(weekday = weekdays(Date, abbreviate = T)) %>%
    summarise(Very = mean(VeryActiveMinutes),
              Fairly = mean(FairlyActiveMinutes),
              Lightly = mean(LightlyActiveMinutes),
              Sedentary = mean(SedentaryMinutes)) %>% 
    pivot_longer( cols = Very:Sedentary, names_to = "Activity", values_to = "Mean")
activityByweekDay$weekday <- factor(activityByweekDay$weekday, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
head(activityByweekDay,5)

## # A tibble: 5 x 3
##   weekday Activity   Mean
##   <fct>   <chr>     <dbl>
## 1 Fri     Very       21.2
## 2 Fri     Fairly     14.6
## 3 Fri     Lightly   223. 
## 4 Fri     Sedentary 743. 
## 5 Mon     Very       30.7

#plotting
ggplot(activityByweekDay, aes(x= weekday, y = Mean, fill = Activity)) +
    geom_bar(position = "dodge", stat = "identity") + # stat = "identity" important if i use geom_col() function then no need to use stat param
    labs(title = "Activity per weekday")

Not much we can take from this chart since it’s relatively well distributed, simply that Sedentary type of activity is more frequent on Fridays and Tuesdays. Lightly activity on Saturdays and Fridays. Fairly and Very are a bit hard to distinguish, so we’ll filter those and plot again:

activityByweekDay %>%
    filter(Activity == "Very" | Activity == "Fairly") %>%
    ggplot(aes(x= weekday, y = Mean, fill = Activity)) +
    geom_bar(position = "dodge", stat = "identity") + # stat = "identity" important if i use geom_col() function then no need to use stat param
    labs(title = "Activity per weekday Of Fairly & Very")

Fairly activity is more frequent on Mondays and Tuesdays, while Very activity is more frequent on Saturdays and Tuesdays.

Relationship between sleep and activity.

For this analysis, we’ll try to see if there’s any correlation between the sleep and activity.

ggplot(data = mergeActivity) + 
    geom_point(mapping = aes(x = TotalMinutesAsleep, y = TotalSteps, size = Calories, color = Calories)) + # aes here must imp
    geom_smooth(formula = y ~ x, method=lm, mapping = aes(x = TotalMinutesAsleep, y = TotalSteps), color = "red") +
    labs(title = "Sleep and Activity", x = "Total Minutes Asleep", y = "Total Steps")

Apparently, it’s more likely for our costumers to do more activity if they slept less than usual, rather than more. As noted in the previous analysis of sleep behaviours, most of the costumers have relatively good sleep (420 minutes = 7 hours).

In the Share phase, the key tasks include determining the best way to share our findings in the analysis, create effective visualizations, present the findings and ensure the work is accessible. This step consists essentially in telling a compelling story about our data that hopefully answers the initial business tasks. Since the visualizations were already been presented while doing the analysis phase, we’ll now present the final recomendations and summary of our findings.

Let’s first recall the three business tasks:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

The insights gathered from our analysis:

Missing records: On the prepare and analysis phase, we noticed that there were 33 distinct costumers registries on the dailyActivity_merged.csv file, 24 distinct costumers on the sleepDay_merged.csv file and only 9 distinct costumers on the weightLogInfo_merged.csv file. This can be an indicator that the costumers are not registring sleep data and weight data. Therefore, we proceed only with the files dailyActivity and sleepDay, which, when merged, would leave us with 24 costumers data. The fact that there’s not enough sleep data could mean that the costumers simply don’t use the device while sleeping. Setting daily reminders or rewards for registering weight data and making a more confortable device to use during sleep could be a way to boost these recordings and thus use this data to improve costumers overall life/activity.
Sleep behaviours: As seen on the analysis phase, most of our costumers have relatively good amount of sleep, with an average of 420 minutes per day which equals to 7 hours. However, they also spend a slightly high amount of time awake in bed, 39 minutes on average, and a less amount of costumers spend between 100 minutes and 300 minutes. Having a 39 minutes average of time awake in bed might not be something serious to consider, since it’s not a very high number and it may be the case that some people simply like to read while on the bed, but it could be an indicator that some costumers have trouble falling asleep or poor sleeping habits. Additionally, looking into the relationship between the amount of sleep and activity made for a particular day, led us to the conclusion that it’s more likely for our costumers to do more activity if they slept less than usual, rather than more. It is not a very relevant insight but what we can get from this whole sleep behaviour analysis is that dedicated sleeping features integrated on the app could help improve the sleeping habbits of our costumers and so improve overall activity and life.
Activity frequency: We also noticed that, from the four types of activity (Sedentary > Lightly > Fairly > Very), the most frequent types were either Sedentary or Light. Most of the people spend their entire day working on a chair (Sedentary) and do some small chores either inside or outside their house (Lightly), which could be the reason why these activities are more frequent. Although the frequency of Fairly and Very activity is much lower than the Sedentary and Lightly, if we take a look to the filtered graph (containing only Fairly and Very) we see that the average per weekday is not that low considering that these activities might include intensive exercises, so spending on average 25 minutes doing those type of exercises is not particurlaly bad but could be improved. Perhaps the possibility for the costumers to set daily goals combined with motivational reminders/notifications would be a good way for them to boost the frequency of these higher intensive activities.

The first two business questions are answered by these three insights, leaving us with the final third: How could these trends help influence Bellabeat marketing strategy? In short and generally speaking, a typical costumer from this analysis is someone who:

Doesn’t reguraly register weight or sleep data (registries of weight may be manual, and sleep autonomous).
Have a relatively good amount of sleep (7h), but may spend some time awake in bed (40m).
Does intensive activities, but not for too long. Based on these criteria, the development team should first approach the resolution/improvement of these points and then the marketing team could focus on targeting people who may have a similar activity patterns in order to improve their overall quality of life.

Act

In this phase, we take into consideration all the insights gathered and make a business decision based on our findings. The recommendations were presented in the previous phase and is thus now up to the stakeholders to make the desired decision. In this section we can also discuss possible data sets that could be used to expand this analysis for any future reference, as for example:

This dataset contains more than 4 years of steps and sleep data (also using a band).
This dataset contains plenty of records regarding sleep activity.

FitBit Fitness Tracker Data

Yeasin

7/8/2021