About

Part of the Google Data Analytics Professional Certificate course (Capstone Project).

Click Here to Download FitBit Fitness Tracker Data

See My FitBit Fitness Tracker Data Analysis in kaggle Click Here

This is a data analytics fictional case study with the purpose of gaining insights to improve business decisions for the Bellabeat company, a company focused on building products designed to gather information about a woman’s overall activity, sleep schedules, stress and reproductive health.

This analysis follows the same structure that was learned throughout this course, which is the following:

For this fictitional case study, I have joined Bellabeat’s marketing analytics team, responsible for collecting, analyzing and reporting data that helps guide Bellabeat’s marketing strategy. This report’s point of view will then be written as a (fictitional) team.

Ask

In the Ask phase, we (the fictional team) should focus in what is the problem that we are trying to solve, and how can our insights drive business decisions.

Business Task

Bellabeat wants to gather insights regarding the following business tasks:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Key Stakeholders

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mur: Mathematicion and Bellabeat’s cofounder; key member of the Bellabeat executive team

How our insights will help

By analysing the activity of a particular set of costumers (that gave consent to share their activity’s data), we can see what type of features are mostly used or impact the healthy lifestyle the most for those costumers. With these insights, Bellabeat’s team can focus more on improving or spotlighting these features in marketing.

Prepare

For this step, we’ll take a closer look into the dataset that will be used for this case study and which will help us answer the previous business tasks. We’ll give a brief overview regarding particular aspects of this dataset, such as how is the data being stored, how it’s organized, bias/credibility issues (ROCC), licensing, privacy, security, acessibility, data’s integrity, how does this data help answer our questions, and are there any problems with the data. Finally, this phase also includes the selection or filtering of the data available, and any sortage if necessary.

Data overview

The data FitBit Fitness Tracker Data is available in Kaggle. As it is described in the dataset’s page: “These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.”

  • Storage: The data is being currently being stored in Kaggle’s platform and will also be downloaded to a local file system for any analysis or clean up.

  • Organized: The databate consists of 18 CSV files:

    • dailyActivity_merged.csv
    • dailyCalories_merged.csv
    • dailyIntensities_merged.csv
    • dailySteps_merged.csv
    • heartrate_seconds_merged.csv
    • hourlyCalories_merged.csv
    • hourlyIntensities_merged.csv
    • hourlySteps_merged.csv
    • minuteCaloriesNarrow_merged.csv
    • minuteCaloriesWide_merged.csv
    • minuteIntensitiesNarrow_merged.csv
    • minuteIntensitiesWide_merged.csv
    • minuteMETsNarrow_merged.csv
    • minuteSleep_merged.csv
    • minuteStepsNarrow_merged.csv
    • minuteStepsWide_merged.csv
    • sleepActivities_merged.csv
    • weightLogInfo_merged.csv
  • Bias/Credibility (ROCC)

    • Reliable: is the data accurate, complete, and non-biased?
      • Can’t be entirely sure if it is.
  • Original: can we locate the original data source?

    • Yes, as described in their dataset’s metadata
  • Comprehensive: Does it have the necessary and important information to find the solution?

    • Yes, by looking into the device’s data, we can see the usage trends, which is what we are trying to discover.
  • Current: Yes it current?

    • No, all the data is from 2016.
  • Cited:

    • Yes, as described in their dataset’s metadata
  • Licence, privacy, security, acessability, data’s integrity According to the description of the dataset, thirty eligible Fitbit users consented to the submission of personal tracker data. Looking into the metadata, the data was made public with the following license: CCO: Public Domain.

  • Problems with the data The data is not current (from 2016), has a relatively small sample size (30 costumers) and can not be sure if the data is actually representative of women, or if it’s biased in some way.

Data Selection

The entire data is distributed across multiple files, and the observations (rows) on each file are registered by different time intervals. Some are high intervals such as dailyActivity_merged.csv where each observation is on a daily basis, while others are low intervals (heartrate_seconds_merged.csv) where each observation is measured by the second. To have a generic view of the usage trends of Bellabeat’s costumers, we will focus on the files with higher intervals observations, which are the following:

  • dailyActivity_merged.csv

  • dailyCalories_merged.csv

  • dailyIntensities_merged.csv

  • dailySteps_merged.csv

  • sleepDay_merged.csv

  • weightLogInfo_merged.csv But, looking into these datasets, we can see that dailyActivity_merged.csv already contains all the columns of either dailyCalories, dailyIntensities and dailySteps files. Also, by having a quick look at the weightLogInfo_merged.csv file, it only contains weight information about 9 costumers, so this file won’t be used as well. We end up with the following files:

  • dailyActivity_merged.csv

  • sleepDay_merged.csv

Process and Analysis

We will combine both steps (Process and Analysis) into this one section.

As described in the previous phase, the files that were left off for us to work with are the dailyActivity_merged.csv and the sleepDay_merged.csv. There maybe a correlation between the sleep behaviours and the overall activity of the costumers, so the next step will be to merge the two files together so we can have a single data frame to work with.

library(tidyverse)
library(lubridate)
library(skimr)

Loading Files:

dailyActivities <- read_csv("data/dailyActivity_merged.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   ActivityDate = col_character(),
##   TotalSteps = col_double(),
##   TotalDistance = col_double(),
##   TrackerDistance = col_double(),
##   LoggedActivitiesDistance = col_double(),
##   VeryActiveDistance = col_double(),
##   ModeratelyActiveDistance = col_double(),
##   LightActiveDistance = col_double(),
##   SedentaryActiveDistance = col_double(),
##   VeryActiveMinutes = col_double(),
##   FairlyActiveMinutes = col_double(),
##   LightlyActiveMinutes = col_double(),
##   SedentaryMinutes = col_double(),
##   Calories = col_double()
## )
sleepActivities  <- read_csv("data/sleepDay_merged.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   SleepDay = col_character(),
##   TotalSleepRecords = col_double(),
##   TotalMinutesAsleep = col_double(),
##   TotalTimeInBed = col_double()
## )
head(dailyActivities)
## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
head(sleepActivities)
## # A tibble: 6 x 5
##          Id SleepDay           TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
##       <dbl> <chr>                         <dbl>             <dbl>          <dbl>
## 1    1.50e9 4/12/2016 12:00:0~                1               327            346
## 2    1.50e9 4/13/2016 12:00:0~                2               384            407
## 3    1.50e9 4/15/2016 12:00:0~                1               412            442
## 4    1.50e9 4/16/2016 12:00:0~                2               340            367
## 5    1.50e9 4/17/2016 12:00:0~                1               700            712
## 6    1.50e9 4/19/2016 12:00:0~                1               304            320

As we can see, we have the Id and ActivityDate for the dailyActivities data frame, and the Id and sleepActivities for the sleepActivities data frame. Both of these columns will be used to make the merge but first we need to make a few changes. R has identified both date-related columns as a Factor or categorical variable (fct). Also, ActivityDate is representing a date, while sleepActivities is date-time. This date-time however is unecessary as they all show the exact same time (12 am). So the next steps will be to:

  1. Changing the format of ActivityDate and sleepActivities to a date-related format usable by R. (will be using lubridate package).
  2. Rename both date columns to have the same name (this is unecessary for the merge since you can specify the columns but it will be more clearer to do so.
  3. Merge both data frames.
  4. Quick look into the final merge result.

Step 1: Changing the format of columns ActivityDate and sleepActivities

Using lubridate library, we can use the mdy function in order to parse the date (mdy because the current column date format is in month/day/year).

dailyActivities <- mutate(dailyActivities, ActivityDate = mdy(ActivityDate))
sleepActivities <- mutate(sleepActivities, SleepDay = mdy_hms(SleepDay))

Now it is a date! and it’s in the default date format (and much more readable format: Year > month > day). Let’s quickly recheck:

head(dailyActivities)
## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <date>            <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 2016-04-12        13162          8.5             8.5                 0
## 2  1.50e9 2016-04-13        10735          6.97            6.97                0
## 3  1.50e9 2016-04-14        10460          6.74            6.74                0
## 4  1.50e9 2016-04-15         9762          6.28            6.28                0
## 5  1.50e9 2016-04-16        12669          8.16            8.16                0
## 6  1.50e9 2016-04-17         9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
head(sleepActivities)
## # A tibble: 6 x 5
##         Id SleepDay            TotalSleepRecords TotalMinutesAsl~ TotalTimeInBed
##      <dbl> <dttm>                          <dbl>            <dbl>          <dbl>
## 1   1.50e9 2016-04-12 00:00:00                 1              327            346
## 2   1.50e9 2016-04-13 00:00:00                 2              384            407
## 3   1.50e9 2016-04-15 00:00:00                 1              412            442
## 4   1.50e9 2016-04-16 00:00:00                 2              340            367
## 5   1.50e9 2016-04-17 00:00:00                 1              700            712
## 6   1.50e9 2016-04-19 00:00:00                 1              304            320

Step 2: Renaming both columns ActivityDate and sleepActivities to simply ’Date

dailyActivities <- rename(dailyActivities, Date = ActivityDate)
sleepActivities <- sleepActivities %>% rename(Date = SleepDay)
glimpse(dailyActivities)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date                     <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
glimpse(sleepActivities)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ Date               <dttm> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 20~
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~

Step 3: Merging both data frames.

Before merging data frames, let’s just see if all our costumers in one data frame are also present in the other. We can simply do this by counting the number of distinct Id’s in both data frames.

n_distinct(dailyActivities$Id)
## [1] 33
n_distinct(sleepActivities$Id)
## [1] 24

Apparently not all costumers have sleep data. In dailyActivities data frame there are 33 costumers, while in sleepActivities there are 24! In order to solve this. We’ve decided to use the dplyr library function, inner_join. This function will merge both data frames by Id and by Date and will only add observations or rows to the result if they are present in both data frames.

mergeActivity <- inner_join(dailyActivities, sleepActivities, by=c("Id", "Date"))

Step 4: Looking into the final merge result

Let’s now have an overview of our final data frame that we will work with using the skimr library. But first let’s just add one more column to our new data frame to see who’s been lazy or has insonia! Maybe it can have a correlation with the overall workout activity!

mergeActivity <- mutate(mergeActivity, TimeAwakeInBed =  TotalTimeInBed - TotalMinutesAsleep)
n_distinct(mergeActivity$Id)
## [1] 24
glimpse(mergeActivity)
## Rows: 413
## Columns: 19
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date                     <dttm> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-~
## $ TotalSteps               <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3~
## $ FairlyActiveMinutes      <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,~
## $ LightlyActiveMinutes     <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, ~
## $ SedentaryMinutes         <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, ~
## $ Calories                 <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177~
## $ TotalSleepRecords        <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep       <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, ~
## $ TotalTimeInBed           <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, ~
## $ TimeAwakeInBed           <dbl> 19, 23, 30, 27, 12, 16, 17, 39, 23, 19, 46, 2~
skim(mergeActivity)
Data summary
Name mergeActivity
Number of rows 413
Number of columns 19
_______________________
Column type frequency:
numeric 18
POSIXct 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1 5.000979e+09 2.06036e+09 1.50396e+09 3.977334e+09 4.702922e+09 6.962181e+09 8.79201e+09 ▆▆▇▅▃
TotalSteps 0 1 8.541140e+03 4.15693e+03 1.70000e+01 5.206000e+03 8.925000e+03 1.139300e+04 2.27700e+04 ▅▆▇▂▁
TotalDistance 0 1 6.040000e+00 3.05000e+00 1.00000e-02 3.600000e+00 6.290000e+00 8.030000e+00 1.75400e+01 ▅▇▇▁▁
TrackerDistance 0 1 6.030000e+00 3.05000e+00 1.00000e-02 3.600000e+00 6.290000e+00 8.020000e+00 1.75400e+01 ▅▇▇▁▁
LoggedActivitiesDistance 0 1 1.100000e-01 5.10000e-01 0.00000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.08000e+00 ▇▁▁▁▁
VeryActiveDistance 0 1 1.450000e+00 1.99000e+00 0.00000e+00 0.000000e+00 5.700000e-01 2.370000e+00 1.25400e+01 ▇▂▁▁▁
ModeratelyActiveDistance 0 1 7.500000e-01 1.00000e+00 0.00000e+00 0.000000e+00 4.200000e-01 1.040000e+00 6.48000e+00 ▇▂▁▁▁
LightActiveDistance 0 1 3.810000e+00 1.73000e+00 1.00000e-02 2.540000e+00 3.680000e+00 4.930000e+00 9.48000e+00 ▂▇▆▂▁
SedentaryActiveDistance 0 1 0.000000e+00 1.00000e-02 0.00000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.10000e-01 ▇▁▁▁▁
VeryActiveMinutes 0 1 2.519000e+01 3.63900e+01 0.00000e+00 0.000000e+00 9.000000e+00 3.800000e+01 2.10000e+02 ▇▂▁▁▁
FairlyActiveMinutes 0 1 1.804000e+01 2.24000e+01 0.00000e+00 0.000000e+00 1.100000e+01 2.700000e+01 1.43000e+02 ▇▂▁▁▁
LightlyActiveMinutes 0 1 2.168500e+02 8.71600e+01 2.00000e+00 1.580000e+02 2.080000e+02 2.630000e+02 5.18000e+02 ▂▇▇▂▁
SedentaryMinutes 0 1 7.121700e+02 1.65960e+02 0.00000e+00 6.310000e+02 7.170000e+02 7.830000e+02 1.26500e+03 ▁▁▇▃▁
Calories 0 1 2.397570e+03 7.62890e+02 2.57000e+02 1.850000e+03 2.220000e+03 2.926000e+03 4.90000e+03 ▁▇▇▃▁
TotalSleepRecords 0 1 1.120000e+00 3.50000e-01 1.00000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.00000e+00 ▇▁▁▁▁
TotalMinutesAsleep 0 1 4.194700e+02 1.18340e+02 5.80000e+01 3.610000e+02 4.330000e+02 4.900000e+02 7.96000e+02 ▁▂▇▃▁
TotalTimeInBed 0 1 4.586400e+02 1.27100e+02 6.10000e+01 4.030000e+02 4.630000e+02 5.260000e+02 9.61000e+02 ▁▃▇▁▁
TimeAwakeInBed 0 1 3.917000e+01 4.65700e+01 0.00000e+00 1.700000e+01 2.500000e+01 4.000000e+01 3.71000e+02 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 2016-04-12 2016-05-12 2016-04-27 31

Let’s just see if there’s any duplicates in the final data frame:

sum(duplicated(mergeActivity))
## [1] 3
# Removing duplicates from the data frame
mergeActivity <- distinct(mergeActivity)

Exploratory Visualizations

Let’s now do some obvious exploratory visualizations in R between variables that we are sure are correlated.

Relationship between TotalSteps and TotalDistance:

ggplot(data = mergeActivity, aes(x=TotalSteps, y=TotalDistance)) +
    geom_point(color = "red") + labs(title = "Relationship between TotalSteps and TotalDistance")

Relationship between TotalSteps and Calories

ggplot(data = mergeActivity, aes(x=TotalSteps, y=Calories)) +
    geom_point(color = "red") + labs(title = "Relationship between TotalSteps and Calories")

Deeper Analysis

Let’s now look at the actual behaviours of our costumers. We can analyse their sleep behaviours and if this has any impact with the overall activity. What type of activity they do the most, and in what time of the day or day of the week it’s more likely for the costumers to do activity.

Possible analysis:

  1. Sleep behaviours.
  2. Most frequent types of activity.
  3. Day of the week where there is more activity.
  4. Relationship between sleep and activity.

1. Sleep behaviours

ggplot(mergeActivity, aes(x=TotalMinutesAsleep)) + 
    geom_histogram(aes(y=..density..) ,binwidth = 30, alpha = 0.6, color = "red") +
    geom_density(alpha = 0.2, fill="blue") +
    labs(title = "Total Minute Asleep")

From this histogram, we can see that most of the costumers sleep around 420 minutes which equals to 7 hours a day.

ggplot(mergeActivity, aes(x=TimeAwakeInBed)) + 
    geom_histogram(aes(y=..density..), binwidth = 30, alpha = 0.6, color = "red") +
    geom_density(fill= "blue", alpha = 0.2) +
    geom_vline(aes(xintercept = mean(TimeAwakeInBed)), color="green", linetype = "dashed") +
    geom_vline(aes(xintercept = max(TimeAwakeInBed)), color="green", linetype = "dashed") +
    annotate(geom = "text", x = 39 + 45, y = 0.02, label = "Mean = 39", size = 5) +
    annotate(geom = "text", x = 371 - 30, y = 0.02, label = "Max = 371", size = 5 ) +
    labs(x = "Time Awake in Bed", y = "Density")

Using the same type of visualization, and using our calculated field TimeAwakeInBed, we can see if the costumers are spending too much time awake in bed or not, which could be a signal of bad sleeping habits. In the histogram, we see that most costumers have an average time of 39 minutes awake in bed, which is not particularly bad, but could be improved. Also, there’s still a small amount of costumers that are awake between 100 and 300 minutes, and the maximum being 371 minutes! Will leave further comments regarding this subject in the Share phase.

Most frequent types of activities

For this analysis, we’ll construct a data frame where for each date (or day) we’ll measure the mean for each activity, then plotting these measurments using lines with different colors.

activitySummary <- mergeActivity %>% group_by(Date) %>%
    summarise(Mean_VeryActiveMinutes = mean(VeryActiveMinutes),
              Mean_FairlyActiveMinutes = mean(FairlyActiveMinutes),
              Mean_LightlyActiveMinutes = mean(LightlyActiveMinutes),
              Mean_SedentaryMinutes = mean(SedentaryMinutes)) %>%
    arrange(Date)
cols <- c("Very" = "violetred3", "Fairly" = "black", "Lightly" = "orange", "Sedentary" = "red")
ggplot(data = activitySummary, aes(x=Date)) +
    geom_line(aes(y=Mean_VeryActiveMinutes, color = "Very"), size = 1) +
    geom_line(aes(y=Mean_FairlyActiveMinutes, color = "Fairly"), size = 1) +
    geom_line(aes(y=Mean_LightlyActiveMinutes, color = "Lightly"), size = 1) +
    geom_line(aes(y=Mean_SedentaryMinutes, color = "Sedentary"), size = 1) +
    labs(title = "Most Frequent Types of Activity", y = "Minutes (mean)", color = "Activity") +
    scale_color_manual(values = cols)

From the chart above, we can see that most of the activity is of Sedentary type, then Light activity, and finally the least amount are Very active and Fairly active. Nowadays, most people spend alot of their day working sitting on a chair, which may explain why most of the activity is Sedentary, and the Lightly activity being from simply doing diverse types of chores.

Days of the week with most activity

To analyse the most activity per weekday, we can do something similar as we did for the previous data frame (activityByDate) but this time grouping by weekday using the weekdays function.

activityByweekDay <- mergeActivity %>% group_by(weekday = weekdays(Date, abbreviate = T)) %>%
    summarise(Very = mean(VeryActiveMinutes),
              Fairly = mean(FairlyActiveMinutes),
              Lightly = mean(LightlyActiveMinutes),
              Sedentary = mean(SedentaryMinutes)) %>% 
    pivot_longer( cols = Very:Sedentary, names_to = "Activity", values_to = "Mean")
activityByweekDay$weekday <- factor(activityByweekDay$weekday, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
head(activityByweekDay,5)
## # A tibble: 5 x 3
##   weekday Activity   Mean
##   <fct>   <chr>     <dbl>
## 1 Fri     Very       21.2
## 2 Fri     Fairly     14.6
## 3 Fri     Lightly   223. 
## 4 Fri     Sedentary 743. 
## 5 Mon     Very       30.7
#plotting
ggplot(activityByweekDay, aes(x= weekday, y = Mean, fill = Activity)) +
    geom_bar(position = "dodge", stat = "identity") + # stat = "identity" important if i use geom_col() function then no need to use stat param
    labs(title = "Activity per weekday")

Not much we can take from this chart since it’s relatively well distributed, simply that Sedentary type of activity is more frequent on Fridays and Tuesdays. Lightly activity on Saturdays and Fridays. Fairly and Very are a bit hard to distinguish, so we’ll filter those and plot again:

activityByweekDay %>%
    filter(Activity == "Very" | Activity == "Fairly") %>%
    ggplot(aes(x= weekday, y = Mean, fill = Activity)) +
    geom_bar(position = "dodge", stat = "identity") + # stat = "identity" important if i use geom_col() function then no need to use stat param
    labs(title = "Activity per weekday Of Fairly & Very")

Fairly activity is more frequent on Mondays and Tuesdays, while Very activity is more frequent on Saturdays and Tuesdays.

Relationship between sleep and activity.

For this analysis, we’ll try to see if there’s any correlation between the sleep and activity.

ggplot(data = mergeActivity) + 
    geom_point(mapping = aes(x = TotalMinutesAsleep, y = TotalSteps, size = Calories, color = Calories)) + # aes here must imp
    geom_smooth(formula = y ~ x, method=lm, mapping = aes(x = TotalMinutesAsleep, y = TotalSteps), color = "red") +
    labs(title = "Sleep and Activity", x = "Total Minutes Asleep", y = "Total Steps")

Apparently, it’s more likely for our costumers to do more activity if they slept less than usual, rather than more. As noted in the previous analysis of sleep behaviours, most of the costumers have relatively good sleep (420 minutes = 7 hours).

Share

In the Share phase, the key tasks include determining the best way to share our findings in the analysis, create effective visualizations, present the findings and ensure the work is accessible. This step consists essentially in telling a compelling story about our data that hopefully answers the initial business tasks. Since the visualizations were already been presented while doing the analysis phase, we’ll now present the final recomendations and summary of our findings.

Let’s first recall the three business tasks:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

The insights gathered from our analysis:

The first two business questions are answered by these three insights, leaving us with the final third: How could these trends help influence Bellabeat marketing strategy? In short and generally speaking, a typical costumer from this analysis is someone who:

Act

In this phase, we take into consideration all the insights gathered and make a business decision based on our findings. The recommendations were presented in the previous phase and is thus now up to the stakeholders to make the desired decision. In this section we can also discuss possible data sets that could be used to expand this analysis for any future reference, as for example: