Data Analysis

Phase 1 - ASK

1. Identify the business task

○ Analyse smart product usage data.

○ Provide high quality recommendation for Bellabeat’s marketing strategy.

2. Guiding questions

○ What are some trends in smart device usage?

○ How could these trends apply to Bellabeat customers?

○ How could these trends help influence Bellabeat’s marketing strategy?

3. Key stakeholders

○ Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer.

○ Sando Mur: Mathematician and Bellabeat’s cofounder; a key member of the Bellabeat executive team

○ Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Phase 2 - PREPARE

1. Identify the data source

Data is available through the public domain on Kaggle (Dataset made available through Mobius): Fitbit Fitness Tracker Data

2. How the data is organized

The data is collected from 30 fitbit users. Thirty eligible Bellabeat users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. All the separate 18 files are in csv format.

3. Determine the credibility and integrity of data

○ Reliable - LOW - Lack of information on the margin of error and the usage of a small sample size will affect the amount of analysis that can be done.

○ Originality - LOW - These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between (Third-party provider).

○ Comprehensive - LOW - There is no information about the participants, such as gender, age, location, health state, etc. This could mean that data is not randomized and could pose bias in the analysis.

○ Current - LOW - The data set is collected in 2016 which is already outdated and not relevant to the current trend in smart device usage.

○ Cited - LOW - the data set was generated by the third-party provider Amazon Mechanical Turk which operates as the crowdsourcing website for businesses to hire remote personnel to do the survey collection. This is not clear whether they provide a credible source.

With the low integrity and credibility assessment on the data set, it is clearly not a good quality data source to provide reliable insights to Bellabeat. Therefore, I would recommend treating the following analysis based on this data set as the starting point. Further recommendations should be verified and produced through an analysis of a larger and more reliable dataset.

Phase 3 - PROCESS

Installing and loading packages

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(janitor)

## Warning: package 'janitor' was built under R version 4.1.2

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ggplot2)
library(tidyr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(dplyr)

Loading dataset

The files were imported into R Studio and the data frames were created with simplified names.

daily_activity <- read_csv("dailyActivity_merged.csv")

## Rows: 940 Columns: 15

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Daily_Calories <- read_csv("dailyCalories_merged.csv")

## Rows: 940 Columns: 3

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Daily_Intensities <- read_csv("dailyIntensities_merged.csv")

## Rows: 940 Columns: 10

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Daily_Steps <- read_csv("dailySteps_merged.csv")

## Rows: 940 Columns: 3

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Heart_rate_seconds <- read_csv("heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Minute_METs <- read_csv("minuteMETsNarrow_merged.csv")

## Rows: 1325580 Columns: 3

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityMinute
## dbl (2): Id, METs

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Sleep_Day <- read_csv("sleepDay_merged.csv")

## Rows: 413 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Weight_log_info <- read_csv("weightLogInfo_merged.csv")

## Rows: 67 Columns: 8

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Viewing the dataset

head(daily_activity)

## # A tibble: 6 x 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1 1503960366 4/12/2016         13162          8.5             8.5                 0
## 2 1503960366 4/13/2016         10735          6.97            6.97                0
## 3 1503960366 4/14/2016         10460          6.74            6.74                0
## 4 1503960366 4/15/2016          9762          6.28            6.28                0
## 5 1503960366 4/16/2016         12669          8.16            8.16                0
## 6 1503960366 4/17/2016          9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~

colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

head(Daily_Calories)

## # A tibble: 6 x 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728

glimpse(Daily_Calories)

## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ Calories    <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775~

colnames(Daily_Calories)

## [1] "Id"          "ActivityDay" "Calories"

head(Daily_Intensities)

## # A tibble: 6 x 10
##           Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinu~
##        <dbl> <chr>                  <dbl>                <dbl>             <dbl>
## 1 1503960366 4/12/2016                728                  328                13
## 2 1503960366 4/13/2016                776                  217                19
## 3 1503960366 4/14/2016               1218                  181                11
## 4 1503960366 4/15/2016                726                  209                34
## 5 1503960366 4/16/2016                773                  221                10
## 6 1503960366 4/17/2016                539                  164                20
## # ... with 5 more variables: VeryActiveMinutes <dbl>,
## #   SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>

glimpse(Daily_Intensities)

## Rows: 940
## Columns: 10
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDay              <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~

colnames(Daily_Intensities)

##  [1] "Id"                       "ActivityDay"             
##  [3] "SedentaryMinutes"         "LightlyActiveMinutes"    
##  [5] "FairlyActiveMinutes"      "VeryActiveMinutes"       
##  [7] "SedentaryActiveDistance"  "LightActiveDistance"     
##  [9] "ModeratelyActiveDistance" "VeryActiveDistance"

head(Daily_Steps)

## # A tibble: 6 x 3
##           Id ActivityDay StepTotal
##        <dbl> <chr>           <dbl>
## 1 1503960366 4/12/2016       13162
## 2 1503960366 4/13/2016       10735
## 3 1503960366 4/14/2016       10460
## 4 1503960366 4/15/2016        9762
## 5 1503960366 4/16/2016       12669
## 6 1503960366 4/17/2016        9705

glimpse(Daily_Steps)

## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366~
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/~
## $ StepTotal   <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054~

colnames(Daily_Steps)

## [1] "Id"          "ActivityDay" "StepTotal"

head(Heart_rate_seconds)

## # A tibble: 6 x 3
##           Id Time                 Value
##        <dbl> <chr>                <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM    97
## 2 2022484408 4/12/2016 7:21:05 AM   102
## 3 2022484408 4/12/2016 7:21:10 AM   105
## 4 2022484408 4/12/2016 7:21:20 AM   103
## 5 2022484408 4/12/2016 7:21:25 AM   101
## 6 2022484408 4/12/2016 7:22:05 AM    95

glimpse(Heart_rate_seconds)

## Rows: 2,483,658
## Columns: 3
## $ Id    <dbl> 2022484408, 2022484408, 2022484408, 2022484408, 2022484408, 2022~
## $ Time  <chr> "4/12/2016 7:21:00 AM", "4/12/2016 7:21:05 AM", "4/12/2016 7:21:~
## $ Value <dbl> 97, 102, 105, 103, 101, 95, 91, 93, 94, 93, 92, 89, 83, 61, 60, ~

colnames(Heart_rate_seconds)

## [1] "Id"    "Time"  "Value"

head(Minute_METs)

## # A tibble: 6 x 3
##           Id ActivityMinute         METs
##        <dbl> <chr>                 <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM    10
## 2 1503960366 4/12/2016 12:01:00 AM    10
## 3 1503960366 4/12/2016 12:02:00 AM    10
## 4 1503960366 4/12/2016 12:03:00 AM    10
## 5 1503960366 4/12/2016 12:04:00 AM    10
## 6 1503960366 4/12/2016 12:05:00 AM    12

glimpse(Minute_METs)

## Rows: 1,325,580
## Columns: 3
## $ Id             <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960~
## $ ActivityMinute <chr> "4/12/2016 12:00:00 AM", "4/12/2016 12:01:00 AM", "4/12~
## $ METs           <dbl> 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 10, 10,~

colnames(Minute_METs)

## [1] "Id"             "ActivityMinute" "METs"

head(Sleep_Day)

## # A tibble: 6 x 5
##           Id SleepDay              TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
##        <dbl> <chr>                            <dbl>            <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                1              327            346
## 2 1503960366 4/13/2016 12:00:00 AM                2              384            407
## 3 1503960366 4/15/2016 12:00:00 AM                1              412            442
## 4 1503960366 4/16/2016 12:00:00 AM                2              340            367
## 5 1503960366 4/17/2016 12:00:00 AM                1              700            712
## 6 1503960366 4/19/2016 12:00:00 AM                1              304            320

glimpse(Sleep_Day)

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~

colnames(Sleep_Day)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

head(Weight_log_info)

## # A tibble: 6 x 8
##           Id Date      WeightKg WeightPounds   Fat   BMI IsManualReport    LogId
##        <dbl> <chr>        <dbl>        <dbl> <dbl> <dbl> <lgl>             <dbl>
## 1 1503960366 5/2/2016~     52.6         116.    22  22.6 TRUE            1.46e12
## 2 1503960366 5/3/2016~     52.6         116.    NA  22.6 TRUE            1.46e12
## 3 1927972279 4/13/201~    134.          294.    NA  47.5 FALSE           1.46e12
## 4 2873212765 4/21/201~     56.7         125.    NA  21.5 TRUE            1.46e12
## 5 2873212765 5/12/201~     57.3         126.    NA  21.7 TRUE            1.46e12
## 6 4319703577 4/17/201~     72.4         160.    25  27.5 TRUE            1.46e12

glimpse(Weight_log_info)

## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212~
## $ Date           <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2~
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, ~
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6~
## $ Fat            <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,~
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ~
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,~

colnames(Weight_log_info)

## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Cleaning the dataset

Daily_Activity

From a quick overview of column names, the daily_activity data frame appears to incorporate data from Daily_Calories, Daily_Intensities, Daily_Steps data frames. All 4 data frames contain ‘Id’ columns which we can use to compare and determine whether the same number of observations match those Ids.

The sqldf package is loaded to determine whether the values of Daily_Calories, Daily_Intensities, and Daily_Steps are contained in daily_activity using SQL syntax. However, in order to compare those data frames, it is required that the number of columns in the data frames must be the same. Therefore, a temporary data frame with the important columns would be created first.

library(sqldf)

## Warning: package 'sqldf' was built under R version 4.1.2

## Loading required package: gsubfn

## Warning: package 'gsubfn' was built under R version 4.1.2

## Loading required package: proto

## Warning: package 'proto' was built under R version 4.1.2

## Loading required package: RSQLite

## Warning: package 'RSQLite' was built under R version 4.1.2

daily_activity_2 <- daily_activity %>% 
  select(Id,ActivityDate,Calories) 

head(daily_activity_2)

## # A tibble: 6 x 3
##           Id ActivityDate Calories
##        <dbl> <chr>           <dbl>
## 1 1503960366 4/12/2016        1985
## 2 1503960366 4/13/2016        1797
## 3 1503960366 4/14/2016        1776
## 4 1503960366 4/15/2016        1745
## 5 1503960366 4/16/2016        1863
## 6 1503960366 4/17/2016        1728

Check_Activity_Calories <- sqldf('SELECT * FROM daily_activity_2 INTERSECT SELECT * FROM Daily_Calories')

head(Check_Activity_Calories)

##           Id ActivityDate Calories
## 1 1503960366    4/12/2016     1985
## 2 1503960366    4/13/2016     1797
## 3 1503960366    4/14/2016     1776
## 4 1503960366    4/15/2016     1745
## 5 1503960366    4/16/2016     1863
## 6 1503960366    4/17/2016     1728

nrow(Check_Activity_Calories)

## [1] 940

daily_activity_3 <- daily_activity %>% 
  select(Id, ActivityDate,SedentaryMinutes,LightlyActiveMinutes,FairlyActiveMinutes,VeryActiveMinutes,SedentaryActiveDistance,LightActiveDistance,ModeratelyActiveDistance,VeryActiveDistance)

head(daily_activity_3)

## # A tibble: 6 x 10
##           Id ActivityDate SedentaryMinutes LightlyActiveMinutes FairlyActiveMin~
##        <dbl> <chr>                   <dbl>                <dbl>            <dbl>
## 1 1503960366 4/12/2016                 728                  328               13
## 2 1503960366 4/13/2016                 776                  217               19
## 3 1503960366 4/14/2016                1218                  181               11
## 4 1503960366 4/15/2016                 726                  209               34
## 5 1503960366 4/16/2016                 773                  221               10
## 6 1503960366 4/17/2016                 539                  164               20
## # ... with 5 more variables: VeryActiveMinutes <dbl>,
## #   SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>

Check_Activity_Intensities <- sqldf('SELECT * FROM daily_activity_3 INTERSECT SELECT * FROM Daily_Intensities')

head(Check_Activity_Intensities)

##           Id ActivityDate SedentaryMinutes LightlyActiveMinutes
## 1 1503960366    4/12/2016              728                  328
## 2 1503960366    4/13/2016              776                  217
## 3 1503960366    4/14/2016             1218                  181
## 4 1503960366    4/15/2016              726                  209
## 5 1503960366    4/16/2016              773                  221
## 6 1503960366    4/17/2016              539                  164
##   FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1                  13                25                       0
## 2                  19                21                       0
## 3                  11                30                       0
## 4                  34                29                       0
## 5                  10                36                       0
## 6                  20                38                       0
##   LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1                6.06                     0.55               1.88
## 2                4.71                     0.69               1.57
## 3                3.91                     0.40               2.44
## 4                2.83                     1.26               2.14
## 5                5.04                     0.41               2.71
## 6                2.51                     0.78               3.19

nrow(Check_Activity_Intensities)

## [1] 940

daily_activity_4 <- daily_activity %>% 
  select(Id,ActivityDate,TotalSteps)

head(daily_activity_4)

## # A tibble: 6 x 3
##           Id ActivityDate TotalSteps
##        <dbl> <chr>             <dbl>
## 1 1503960366 4/12/2016         13162
## 2 1503960366 4/13/2016         10735
## 3 1503960366 4/14/2016         10460
## 4 1503960366 4/15/2016          9762
## 5 1503960366 4/16/2016         12669
## 6 1503960366 4/17/2016          9705

Check_Activity_Steps <- sqldf('SELECT * FROM daily_activity_4 INTERSECT SELECT * FROM Daily_Steps')

head(Check_Activity_Steps)

##           Id ActivityDate TotalSteps
## 1 1503960366    4/12/2016      13162
## 2 1503960366    4/13/2016      10735
## 3 1503960366    4/14/2016      10460
## 4 1503960366    4/15/2016       9762
## 5 1503960366    4/16/2016      12669
## 6 1503960366    4/17/2016       9705

nrow(Check_Activity_Steps)

## [1] 940

Note

The outputs (observations & variables) of all temporary data frame created match the outputs (observations & variables) of original daily_activity data frame (using head() function to show).
The outputs of all temporary data frame created match the outputs of SQL data frames. The observation of all SQL data frames is equal to 940, which can be concluded that the values of Daily_Calories, Daily_Intensities, and Daily_Steps are contained in daily_activity. Therefore, we will use daily_activity for analysis later in place for those three mentioned data frames.

nrow(daily_activity[duplicated(daily_activity),])

## [1] 0

0 duplicate row was found.

colSums(daily_activity != 0)

##                       Id             ActivityDate               TotalSteps 
##                      940                      940                      863 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                      862                      862                       32 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                      527                      554                      855 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                       82                      531                      556 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                      856                      939                      936

The daily_activity data frame appears to contain a lot of cell equal to ‘0’. In order to prevent skewed results, the filter() function will be used to omit those zero cells.

daily_activity_new <- daily_activity %>% 
  filter(TotalSteps !=0)
daily_activity_new <- daily_activity %>%
  filter(TotalDistance != 0)

Sleep_Day

Sleep_Day data frame has the column SleepDay with combined data of Date and Time. The function ‘separate()’ will be used to divide the data into 2 columns to use later in analysis.

Sleep_Day_New <- Sleep_Day %>%
  separate(SleepDay,c("Date","Time")," ")

## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

head(Sleep_Day_New)

## # A tibble: 6 x 6
##           Id Date      Time     TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed
##        <dbl> <chr>     <chr>               <dbl>            <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:00:00                1              327            346
## 2 1503960366 4/13/2016 12:00:00                2              384            407
## 3 1503960366 4/15/2016 12:00:00                1              412            442
## 4 1503960366 4/16/2016 12:00:00                2              340            367
## 5 1503960366 4/17/2016 12:00:00                1              700            712
## 6 1503960366 4/19/2016 12:00:00                1              304            320

nrow(Sleep_Day_New[duplicated(Sleep_Day_New),])

## [1] 3

3 duplicate rows were found and should be removed.

Sleep_Day_New  <- unique(Sleep_Day_New)
nrow(Sleep_Day_New)

## [1] 410

Phase 4 - ANALYZE

Summarizing the data

n_distinct(daily_activity_new$Id)

## [1] 33

n_distinct(Heart_rate_seconds$Id)

## [1] 14

n_distinct(Minute_METs$Id)

## [1] 33

n_distinct(Sleep_Day_New$Id)

## [1] 24

n_distinct(Weight_log_info$Id)

## [1] 8

nrow(daily_activity_new)

## [1] 862

nrow(Heart_rate_seconds)

## [1] 2483658

nrow(Minute_METs)

## [1] 1325580

nrow(Sleep_Day_New)

## [1] 410

nrow(Weight_log_info)

## [1] 67

The Heart_rate_seconds and the Weight_log_info data frames appears to show a low number of distinct participants compared to other data frames (using n_distinct() function). Therefore, it would not be sensible to use these two data frames to draw conclusions and recommendations for this project.

Key Statistics

daily_activity_new %>% 
  select(TotalSteps,TotalDistance,SedentaryMinutes,LightlyActiveMinutes,FairlyActiveMinutes,VeryActiveMinutes,SedentaryActiveDistance,LightActiveDistance,ModeratelyActiveDistance,VeryActiveDistance,Calories) %>% 
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes LightlyActiveMinutes
##  Min.   :    8   Min.   : 0.010   Min.   :   0.0   Min.   :  0.0       
##  1st Qu.: 4927   1st Qu.: 3.373   1st Qu.: 721.2   1st Qu.:147.0       
##  Median : 8054   Median : 5.590   Median :1020.5   Median :208.5       
##  Mean   : 8329   Mean   : 5.986   Mean   : 955.2   Mean   :210.3       
##  3rd Qu.:11096   3rd Qu.: 7.905   3rd Qu.:1189.0   3rd Qu.:272.0       
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :518.0       
##  FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
##  Min.   :  0.00      Min.   :  0.00    Min.   :0.000000       
##  1st Qu.:  0.00      1st Qu.:  0.00    1st Qu.:0.000000       
##  Median :  8.00      Median :  7.00    Median :0.000000       
##  Mean   : 14.79      Mean   : 23.04    Mean   :0.001752       
##  3rd Qu.: 21.00      3rd Qu.: 35.00    3rd Qu.:0.000000       
##  Max.   :143.00      Max.   :210.00    Max.   :0.110000       
##  LightActiveDistance ModeratelyActiveDistance VeryActiveDistance    Calories   
##  Min.   : 0.000      Min.   :0.0000           Min.   : 0.000     Min.   :  52  
##  1st Qu.: 2.350      1st Qu.:0.0000           1st Qu.: 0.000     1st Qu.:1857  
##  Median : 3.580      Median :0.3100           Median : 0.410     Median :2220  
##  Mean   : 3.643      Mean   :0.6189           Mean   : 1.639     Mean   :2362  
##  3rd Qu.: 4.897      3rd Qu.:0.8675           3rd Qu.: 2.277     3rd Qu.:2832  
##  Max.   :10.710      Max.   :6.4800           Max.   :21.920     Max.   :4900

Note

Total Steps: The average daily steps recorded from the participants are 8,329 steps. It is believed through decades that the daily 10,000-step meter is recommended for health and longevity. Even though the participants do not maintain the optimal step count, new studies showed that walking at least 7,000 steps daily still yield health benefits. Therefore, Bellabeat users are doing well in this area. They can easily achieve the 10,000 daily steps goal if they can connect with a walking partner or walking group on BellaBeat app.
Sedentary Minutes: The average sedentary time is 955.2 minutes or 16 hours. The BellaBeat users have spent a significant time for little or no exercise,probably involving lots of lying down and sitting which poses more health risks, such as metabolic syndrome.
Very Active Minutes (full-time athlete) and Fairly Active Minutes (active lifestyle): Average very active minutes and fairly active minutes were at 23.04 minutes and 14.79 minutes respectively. In term of weekly time (5 days/week), average very active minutes and fairly active minutes were 115.2 minutes (~1.92 hours) and 73.95 minutes(~1.2 hours). According to CDC, adults should do 75 minutes (1 hour and 15 minutes) to 150 minutes (2 hours and 30 minutes) a week of vigorous-intensity physical activity or at least 150 minutes (2 hours and 30 minutes) to 300 minutes (5 hours) a week of moderate-intensity. The participants have been doing relatively well in this area.
Calories: The average amount of daily calories burned is 2,362 kcal. Studies suggested that individuals need to burn 3500 calories a day to lose one pound a day, though this 3,500-calorie-burning workout in one day may not realistic for some people. It is said that the average person burns around 1800 calories a day doing absolutely nothing. The BellaBeat users are burning calories more than normal norms, but they can shed more pounds if they can achieve the daily calories burnt target.

Heart_rate_seconds %>% 
  select(Value) %>% 
  summary()

##      Value       
##  Min.   : 36.00  
##  1st Qu.: 63.00  
##  Median : 73.00  
##  Mean   : 77.33  
##  3rd Qu.: 88.00  
##  Max.   :203.00

Note

According to Mayoclinic, a normal resting heart rate for adults ranges from 60 to 100 beats per minute. Given the low participant in the data frame, the average heart rate is recorded at 77 beats per minute (bpm) from the BellaBeat users which is normal in recommended range. Heart rate will vary depending on age, emotions, gender, fitness activity, stress level, medications, body mass, body position. It is said that the more fit users are, the lower heart rate is.

Minute_METs %>% 
  select(METs) %>% 
  summary()

##       METs       
##  Min.   :  0.00  
##  1st Qu.: 10.00  
##  Median : 10.00  
##  Mean   : 14.69  
##  3rd Qu.: 11.00  
##  Max.   :157.00

METs is known as metabolic equivalent to task, the ratio of the metabolic rate during exercise to the metabolic rate at rest. A MET score of 1 represents the amount of energy used when a person is at rest. Light activity requires less than 3 METs, moderate activities require 3 - 6 METs whereas vigorous activitues require more than 6 METs. The average METs recorded from the BellaBeat participant throughout the day is 14.69 METs which is unbelievably high. Given that the participants are low active (high sedentary minutes), this METs figure is conflicting with other results and raises concerns for its data accuracy.

Sleep_Day_New %>% 
  select(TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed) %>% 
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

The average total minutes being asleep is recorded as 419.2 minutes equivalent to 7 hours. The average total time BellaBeat users spending in bed is 458.5 minutes, equivalent to 7.64 hours. That said, 38 minutes is the average time BellaBeat users spent restless or awake in bed. Adults are recommended to aim for 7 - 9 hours of sleep to function at their best. BellaBeat users are doing well in this area. However, regularly waking up for at least 20 to 30 minutes at night or taking more than 30 minutes to drift off to sleep might be a sign insomnia. Further data should be collected to understand more about sleep phases and other factors contributing to a good sleep.

Weight_log_info %>% 
  select(WeightKg,WeightPounds,BMI) %>% 
  summary()

##     WeightKg       WeightPounds        BMI       
##  Min.   : 52.60   Min.   :116.0   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:135.4   1st Qu.:23.96  
##  Median : 62.50   Median :137.8   Median :24.39  
##  Mean   : 72.04   Mean   :158.8   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:187.5   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :294.3   Max.   :47.54

According to CDC, healthy weight has BMI values in the range of 18.5 - 24.9. Overweight has BMI values in the range of 25.0 - 29.9. Despite the low number of participants in the data frame, the average BMI is recorded at 25.19 which is an overweight value. Even though BMI gives a good estimate of total body fat for most people, it does not apply for everybody. For example, very muscular people can have a high BMI because of their muscle mass, even though they’re not necessarily overweight. A person with a high BMI should be evaluated by a health care provider, who might use other factors such as skin fold thickness (a measure of body fat), waist size, evaluations of diet and family health problems, and other factors to find out if a person’s weight might pose a health risk.

Phase 5 - SHARE

Total Steps and Total Calories Burned

sp <- ggplot(data=daily_activity_new, aes(x=TotalSteps, y = Calories,color = VeryActiveMinutes))

sp + geom_point() + geom_smooth()+ theme_minimal() + scale_color_gradient(low="steelblue", high="orange")+ labs(title = "The relationship between Total Steps and Total Calories Burnt",x = "Total Steps", y = "Total Calories Burned (kcal)")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 1: The relationship between Total Steps and Total Calories Burnt

Figure 1 shows a positive relationship between Total Steps and Total Calories Burnt. The more steps the BellaBeat users took, the higher amount of calories burnt during this process.

Very Active Minutes and Total Calories Burned

sp1 <- ggplot(data=daily_activity_new, aes(x=VeryActiveMinutes, y = Calories))

sp1 + geom_point(color = "darksalmon" ) + geom_smooth()+ theme_minimal() + labs(title = "The relationship between Very Active Minutes and Total Calories Burned",x = "Very Active Minutes", y = "Total Calories Burned (kcal)")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 2: The relationship between Very Active Minutes and Total Calories Burnt

Figure 2 shows a positive relationship between Very Active Minutes and Total Calories Burnt. The more active the users are, the more calories the users burn.

Fairly Active Minutes and Total Calories Burned

sp2 <- ggplot(data = daily_activity_new, aes(x = FairlyActiveMinutes, y = Calories))
sp2 + geom_point(color = "darkorchid4") + geom_smooth() + theme_minimal() + labs(title = "The relationship between Fairly Active Minutes and Total Calories Burned", x = "Fairly Active Minutes", y = "Total Calories Burned (kcal)")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 3: The relationship between Fairly Active Minutes and Total Calories Burnt

Figure 3 shows a negative relationship between Fairly Active Minutes and Total Calories Burnt. It is quite different from what I expected from the level of being fairly active.

Light Active Minutes and Total Calories Burned

sp3 <- ggplot(data = daily_activity_new, aes(x = LightlyActiveMinutes, y = Calories))
sp3 + geom_point(color = "orange" ) + geom_smooth()+ theme_minimal() + labs(title = "The relationship between Lightly Active Minutes and Total Calories Burned",x = "Lighty Active Minutes", y = "Total Calories Burned (kcal)")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 4: The relationship between Lightly Active Minutes and Total Calories Burnt

Figure 4 shows a positive relationship between Lightly Active Minutes and Total Calories Burnt. It raises questions about whether it is sensible to believe that light exercise can result in high calories burnt.

Total distance

sp4 <- ggplot(data = daily_activity_new, aes(x = TotalDistance, y = Calories))

sp4 + geom_point(color = "pink" ) + geom_smooth()+ theme_minimal() + labs(title = "The relationship between Total Distance and Total Calories Burned",x = "Total Distance", y = "Total Calories Burned (kcal)")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 5: The relationship between Total Distance and Total Calories Burnt

Figure 5 shows a positive relationship between Total Distance and Total Calories Burnt. The greater distance the Bellabeat users have moved, the higher amount of Calories they burn.

Total Minutes Asleep and Total Time in Bed

sp5 <- ggplot(data = Sleep_Day_New, aes(x = TotalMinutesAsleep, y = TotalTimeInBed))

sp5 + geom_point() + geom_smooth()+ theme_minimal() + labs(title = "The relationship between Total Minutes Asleep and Total Time in Bed",x = "Total Minutes Asleep", y = "Total Time in Bed")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 6: The relationship between Total Minutes Asleep and Total Time in Bed

Figure 6 depicts a positive relationship between Total Minutes Asleep and Total Time In Bed. It shows that the participants have relatively good sleep quality when they go to bed without any sign of sleep troubles.

Intensity type

mean_sedentary <- mean(daily_activity_new$SedentaryMinutes)
mean_lightly <- mean(daily_activity_new$LightlyActiveMinutes)
mean_fairly <- mean(daily_activity_new$FairlyActiveMinutes)
mean_very <- mean(daily_activity_new$VeryActiveMinutes)

mean_intensity = c(mean_sedentary, mean_lightly, 
                   mean_fairly, mean_very)
intensity_labels = c("Sedentary", "Lightly Active", "Fairly Active", 
                     "Very Active")
intensitydf <- data.frame(intensity_labels, mean_intensity)

ggplot(data = intensitydf, aes(x = mean_intensity, y = intensity_labels)) +
  geom_col(fill = 'darkblue') + theme_minimal() +
  labs(title = "Average Intensity Minutes by Intensity Type",
       x = "Average Intensity [min]", y = "Intensity Type")

Figure 7: The relationship between Average Intensity Minutes and Intensity Type

Figure 7 shows the comparison across Intensity Type. The sedentary activity is undeniably the highest among other categories. It raises concerns about the low physical lifestyle that BellaBeat users are living.

Phase 6 - ACT

Since its launch, BellaBeat empowers women by providing insights on their activity, sleep, stress, hydration, and reproductive health. Based on an analysis of BellaBeat users’ experiences with features and products, some recommendations for Bellabeat’s future growth will be suggested.

Bellabeat app:

Revamp the app to allow social networking where women can connect and socialize with friends and like-minded people. They can post/ suggest their favorite workout, fitness tips, meal plans, ask questions or even plan their workout schedule together. It will motivate users to keep to their goals with a circle of positive encouragement from fellow sisters.
Allow nutritionist coach and wellness coach to open live group discussion for fitness advice and also possible paid 1-on-1 session.
Create weekly challenge or workout plans and point reward to encourage use. The weekly challenge can be completed by self or with other people to get higher rewards. The point reward can later be transferred to discounts/coupon to use at fitness stores or grocery stores. It will benefit BellaBeat as potential new local businesses approach BellaBeat to pay for advertising to display on the app.
Alert notifications to encourage users to achieve 10,000 steps goal.
Notifications to alert users if they have been idle or low active for an extended period of time throughout the day. These notifications can be sent through BellaBeat Leaf.
Encourage users to input weights and height to track BMI and suggest workout routines for different weight categories.
Give virtual badges of completion for any workout routines or challenges they have completed on their profiles to encourage users to invest more time and effort.
Create bedtime schedules to help you meet the sleep goals,ideally increase more than 7 hours.

Bellabeat membership:

Offer 30-day trial subscription-based membership.
Referrals: when users invite friends to use membership with their invite code, they can get point rewards and their friends can get discount on the membership.
Bundle discount: when users register for membership, they can have discount on the line of smart devices.insta

BellaBeat

Chi Tran

12/28/2021

Introduction

Data Analysis

Phase 1 - ASK

1. Identify the business task

2. Guiding questions

3. Key stakeholders

Phase 2 - PREPARE

1. Identify the data source

2. How the data is organized

3. Determine the credibility and integrity of data

Phase 3 - PROCESS

Installing and loading packages

Loading dataset

Viewing the dataset

Cleaning the dataset

Note

Phase 4 - ANALYZE

Note

Note

Phase 6 - ACT