How Can a Wellness Company Play it Smart?

How Can a Wellness Company Play it Smart?

The Scenario

Bellabeat is a high-tech manufacturer of health-focused products for women. Urška Sršen and Sando Mur founded Bellabeat in 2013. Urška Sršen believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. I will present my analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy.

Step 1. Ask

Step 2. Prepare

Step 3. Process

daily_activity <- read.csv("image/dailyActivity_merged.csv")
daily_calories <- read.csv("image/dailyCalories_merged.csv")
sleep_day <- read.csv("image/sleepDay_merged.csv")
daily_intensities <- read.csv("image/dailyIntensities_merged.csv")
weight_log <- read.csv("image/weightLogInfo_merged.csv")

Installing and loading tidyverse, skimr, janitor, lubridate, dplyr, and sqldf package:

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("sqldf")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
## Loading required package: RSQLite
library(dplyr)

daily_activity

head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
str(daily_activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
  • Daily_activity file has most of the tracker recorded data, such as calories, intensities, and steps information.
  • Above information shows that we have 940 observations and 15 variables and 33 different people logged their daily activities, calories expenditure and steps in 31 days.

daily_calories

str(daily_calories)
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
  • daily_calories file has the same information included in daily_activity file with the same observation.

daily_intensities

str(daily_intensities)
## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay             : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
  • daily_intensities file has the same information included in daily_activity file with the same observation.

daily_intensities

str(sleep_day)
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

weight_log

str(weight_log)
## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
  • weight_log file has less observation (67) and there is a Boolean field (Fat).

Observation

  • All the files have Id as a common field.
  • The daily_activity table have the same observation and values with calories and intensities, so we should confirm that the values actually match for each ‘Id’ number that can be our primary key. First we make a temporary table from daily_activity with the same columns as daily_calories. Let’s check it out:
daily_activity_df <- daily_activity %>%
  select(Id, ActivityDate, Calories)

head(daily_activity_df)
##           Id ActivityDate Calories
## 1 1503960366    4/12/2016     1985
## 2 1503960366    4/13/2016     1797
## 3 1503960366    4/14/2016     1776
## 4 1503960366    4/15/2016     1745
## 5 1503960366    4/16/2016     1863
## 6 1503960366    4/17/2016     1728
check1 <- sqldf('SELECT * FROM daily_activity_df INTERSECT SELECT * FROM daily_calories')
head(check1)
##           Id ActivityDate Calories
## 1 1503960366    4/12/2016     1985
## 2 1503960366    4/13/2016     1797
## 3 1503960366    4/14/2016     1776
## 4 1503960366    4/15/2016     1745
## 5 1503960366    4/16/2016     1863
## 6 1503960366    4/17/2016     1728
str(check1)
## 'data.frame':    940 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories    : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

Now we are sure the daily_calories table is the same as the daily_activity. We do the same with the daily_intensities.

daily_activity_df2 <- daily_activity %>%
  select(Id, ActivityDate, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, SedentaryActiveDistance, LightActiveDistance, ModeratelyActiveDistance, VeryActiveDistance)

head(daily_activity_df2)
##           Id ActivityDate SedentaryMinutes LightlyActiveMinutes
## 1 1503960366    4/12/2016              728                  328
## 2 1503960366    4/13/2016              776                  217
## 3 1503960366    4/14/2016             1218                  181
## 4 1503960366    4/15/2016              726                  209
## 5 1503960366    4/16/2016              773                  221
## 6 1503960366    4/17/2016              539                  164
##   FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1                  13                25                       0
## 2                  19                21                       0
## 3                  11                30                       0
## 4                  34                29                       0
## 5                  10                36                       0
## 6                  20                38                       0
##   LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1                6.06                     0.55               1.88
## 2                4.71                     0.69               1.57
## 3                3.91                     0.40               2.44
## 4                2.83                     1.26               2.14
## 5                5.04                     0.41               2.71
## 6                2.51                     0.78               3.19
check_2 <- sqldf('SELECT * FROM daily_activity_df2 INTERSECT SELECT * FROM daily_intensities')
head(check_2)
##           Id ActivityDate SedentaryMinutes LightlyActiveMinutes
## 1 1503960366    4/12/2016              728                  328
## 2 1503960366    4/13/2016              776                  217
## 3 1503960366    4/14/2016             1218                  181
## 4 1503960366    4/15/2016              726                  209
## 5 1503960366    4/16/2016              773                  221
## 6 1503960366    4/17/2016              539                  164
##   FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1                  13                25                       0
## 2                  19                21                       0
## 3                  11                30                       0
## 4                  34                29                       0
## 5                  10                36                       0
## 6                  20                38                       0
##   LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1                6.06                     0.55               1.88
## 2                4.71                     0.69               1.57
## 3                3.91                     0.40               2.44
## 4                2.83                     1.26               2.14
## 5                5.04                     0.41               2.71
## 6                2.51                     0.78               3.19
str(check_2)
## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...

Now we are sure the daily_activity table is the same as the daily_intensities. We can now work on daily_activity, sleep_day and weight_log.

  • Now we want to check the Id section in all three tables:
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(weight_log$Id)
## [1] 8

Then we check the number of observation in each table:

nrow(daily_activity)
## [1] 940
nrow(sleep_day)
## [1] 413
nrow(weight_log)
## [1] 67
  • Looking at the daily_activity_csv file and using filter, I noticed there are lots of missing data. There are many nulls and missing pieces of information altogether.

  • After filtering I found out that some days there is no record of steps and activity and even no calories, however the sedentary time has been recorded 1440 minutes. After searching Google, I found fitbit community center and understood that the sedentary time only is calculated on days in which the tracker is worn. Sedentary/Active time is calculated by movement, and you need to be inactive for 10 consecutive minutes before the period is considered stationary. There is a setting on some fitbit trackers that will log a day where the device is not worn as 100% or 1440 minutes of sedentary time. Since this is a case study and I do not have contact with stakeholders, I decided to move on with the data.

  • The ActivityDate column in daily_activity structure is in Character and we change it to the date format.

daily_activity$ActivityDate <- as.Date(daily_activity$ActivityDate, "%m/%d/%Y")
head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-04-12      13162          8.50            8.50
## 2 1503960366   2016-04-13      10735          6.97            6.97
## 3 1503960366   2016-04-14      10460          6.74            6.74
## 4 1503960366   2016-04-15       9762          6.28            6.28
## 5 1503960366   2016-04-16      12669          8.16            8.16
## 6 1503960366   2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
  • We do the same with the SleepDay column in sleep_day and Date column in weight_log:
sleep_day$SleepDay <- as.Date(sleep_day$SleepDay, "%m/%d/%Y")
head(sleep_day)
##           Id   SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
weight_log$Date <- as.Date(weight_log$Date, "%m/%d/%Y")
head(weight_log)
##           Id       Date WeightKg WeightPounds Fat   BMI IsManualReport
## 1 1503960366 2016-05-02     52.6     115.9631  22 22.65           True
## 2 1503960366 2016-05-03     52.6     115.9631  NA 22.65           True
## 3 1927972279 2016-04-13    133.5     294.3171  NA 47.54          False
## 4 2873212765 2016-04-21     56.7     125.0021  NA 21.45           True
## 5 2873212765 2016-05-12     57.3     126.3249  NA 21.69           True
## 6 4319703577 2016-04-17     72.4     159.6147  25 27.45           True
##          LogId
## 1 1.462234e+12
## 2 1.462320e+12
## 3 1.460510e+12
## 4 1.461283e+12
## 5 1.463098e+12
## 6 1.460938e+12
  • Renaming the data columns:
daily_activity <- daily_activity %>% 
  rename(activity_date = ActivityDate,
         total_steps = TotalSteps,
         total_distance = TotalDistance,
         tracker_distance = TrackerDistance,
         logged_activities_distance = LoggedActivitiesDistance,
         very_active_distance = VeryActiveDistance,
         moderately_active_distance = ModeratelyActiveDistance,
         light_active_distance = LightActiveDistance,
         sedentary_active_distance = SedentaryActiveDistance,
         very_active_minutes = VeryActiveMinutes,
         fairly_active_minutes = FairlyActiveMinutes,
         lightly_active_minutes = LightlyActiveMinutes,
         sedentary_minutes = SedentaryMinutes        
         )
head(daily_activity)
##           Id activity_date total_steps total_distance tracker_distance
## 1 1503960366    2016-04-12       13162           8.50             8.50
## 2 1503960366    2016-04-13       10735           6.97             6.97
## 3 1503960366    2016-04-14       10460           6.74             6.74
## 4 1503960366    2016-04-15        9762           6.28             6.28
## 5 1503960366    2016-04-16       12669           8.16             8.16
## 6 1503960366    2016-04-17        9705           6.48             6.48
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 1.88                       0.55
## 2                          0                 1.57                       0.69
## 3                          0                 2.44                       0.40
## 4                          0                 2.14                       1.26
## 5                          0                 2.71                       0.41
## 6                          0                 3.19                       0.78
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  6.06                         0                  25
## 2                  4.71                         0                  21
## 3                  3.91                         0                  30
## 4                  2.83                         0                  29
## 5                  5.04                         0                  36
## 6                  2.51                         0                  38
##   fairly_active_minutes lightly_active_minutes sedentary_minutes Calories
## 1                    13                    328               728     1985
## 2                    19                    217               776     1797
## 3                    11                    181              1218     1776
## 4                    34                    209               726     1745
## 5                    10                    221               773     1863
## 6                    20                    164               539     1728
sleep_day <- sleep_day %>% 
  rename(total_sleep_records = TotalSleepRecords,
         total_minutes_asleep = TotalMinutesAsleep,
         total_time_in_bed = TotalTimeInBed
         )
head(sleep_day)
##           Id   SleepDay total_sleep_records total_minutes_asleep
## 1 1503960366 2016-04-12                   1                  327
## 2 1503960366 2016-04-13                   2                  384
## 3 1503960366 2016-04-15                   1                  412
## 4 1503960366 2016-04-16                   2                  340
## 5 1503960366 2016-04-17                   1                  700
## 6 1503960366 2016-04-19                   1                  304
##   total_time_in_bed
## 1               346
## 2               407
## 3               442
## 4               367
## 5               712
## 6               320
weight_log <- weight_log %>% 
  rename(weight_kg = WeightKg,
         weight_pounds = WeightPounds,
         is_manual_report = IsManualReport,
         log_id = LogId
         )
head(weight_log)
##           Id       Date weight_kg weight_pounds Fat   BMI is_manual_report
## 1 1503960366 2016-05-02      52.6      115.9631  22 22.65             True
## 2 1503960366 2016-05-03      52.6      115.9631  NA 22.65             True
## 3 1927972279 2016-04-13     133.5      294.3171  NA 47.54            False
## 4 2873212765 2016-04-21      56.7      125.0021  NA 21.45             True
## 5 2873212765 2016-05-12      57.3      126.3249  NA 21.69             True
## 6 4319703577 2016-04-17      72.4      159.6147  25 27.45             True
##         log_id
## 1 1.462234e+12
## 2 1.462320e+12
## 3 1.460510e+12
## 4 1.461283e+12
## 5 1.463098e+12
## 6 1.460938e+12
  • Organizing and grouping the data:
## Sort by total steps and group by Id
daily_activity <- daily_activity %>% 
  arrange(-total_steps) %>% 
  group_by(Id) %>% 
  drop_na()
head(daily_activity)
## # A tibble: 6 × 15
## # Groups:   Id [3]
##           Id activity_date total_steps total_distance tracker_distance
##        <dbl> <date>              <int>          <dbl>            <dbl>
## 1 1624580081 2016-05-01          36019           28.0             28.0
## 2 8877689391 2016-04-16          29326           25.3             25.3
## 3 8877689391 2016-04-30          27745           26.7             26.7
## 4 8877689391 2016-04-27          23629           20.6             20.6
## 5 8877689391 2016-04-12          23186           20.4             20.4
## 6 8053475328 2016-04-24          22988           18.0             18.0
## # … with 10 more variables: logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <int>, fairly_active_minutes <int>,
## #   lightly_active_minutes <int>, sedentary_minutes <int>, Calories <int>
  • Checking for duplicates:
daily_activity
## # A tibble: 940 × 15
## # Groups:   Id [33]
##            Id activity_date total_steps total_distance tracker_distance
##         <dbl> <date>              <int>          <dbl>            <dbl>
##  1 1624580081 2016-05-01          36019           28.0             28.0
##  2 8877689391 2016-04-16          29326           25.3             25.3
##  3 8877689391 2016-04-30          27745           26.7             26.7
##  4 8877689391 2016-04-27          23629           20.6             20.6
##  5 8877689391 2016-04-12          23186           20.4             20.4
##  6 8053475328 2016-04-24          22988           18.0             18.0
##  7 4388161847 2016-05-07          22770           17.5             17.5
##  8 8053475328 2016-04-23          22359           17.2             17.2
##  9 2347167796 2016-04-16          22244           15.1             15.1
## 10 8053475328 2016-05-08          22026           17.6             17.6
## # … with 930 more rows, and 10 more variables:
## #   logged_activities_distance <dbl>, very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <int>,
## #   fairly_active_minutes <int>, lightly_active_minutes <int>,
## #   sedentary_minutes <int>, Calories <int>
sum(duplicated(daily_activity))
## [1] 0

Step 4. Analyze

## A quick summary statistics
daily_activity %>%  
  select(total_steps,
         total_distance,
         sedentary_minutes,
         very_active_minutes) %>%
  summary()
## Adding missing grouping variables: `Id`
##        Id             total_steps    total_distance   sedentary_minutes
##  Min.   :1.504e+09   Min.   :    0   Min.   : 0.000   Min.   :   0.0   
##  1st Qu.:2.320e+09   1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   
##  Median :4.445e+09   Median : 7406   Median : 5.245   Median :1057.5   
##  Mean   :4.855e+09   Mean   : 7638   Mean   : 5.490   Mean   : 991.2   
##  3rd Qu.:6.962e+09   3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   
##  Max.   :8.878e+09   Max.   :36019   Max.   :28.030   Max.   :1440.0   
##  very_active_minutes
##  Min.   :  0.00     
##  1st Qu.:  0.00     
##  Median :  4.00     
##  Mean   : 21.16     
##  3rd Qu.: 32.00     
##  Max.   :210.00

For daily sleep:

sleep_day %>%  
  select(total_sleep_records,
  total_minutes_asleep,
  total_time_in_bed) %>%
  summary()
##  total_sleep_records total_minutes_asleep total_time_in_bed
##  Min.   :1.000       Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:1.000       1st Qu.:361.0        1st Qu.:403.0    
##  Median :1.000       Median :433.0        Median :463.0    
##  Mean   :1.119       Mean   :419.5        Mean   :458.6    
##  3rd Qu.:1.000       3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :3.000       Max.   :796.0        Max.   :961.0

For weight log:

weight_log %>%  
  select(weight_kg,
  weight_pounds,
  BMI) %>%
  summary()
##    weight_kg      weight_pounds        BMI       
##  Min.   : 52.60   Min.   :116.0   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:135.4   1st Qu.:23.96  
##  Median : 62.50   Median :137.8   Median :24.39  
##  Mean   : 72.04   Mean   :158.8   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:187.5   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :294.3   Max.   :47.54

Step 5.Data Visualization and Share:

ggplot(data=daily_activity, aes(x=total_steps, y=sedentary_minutes, color = Calories)) + geom_point()

ggplot(data=daily_activity, aes(x=total_steps, y=Calories)) + geom_point() + geom_smooth() +
  labs(title = "Daily Activity: Calories vs. Total Steps")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

+ We can say that physical activity, such as walking, is important for burning calories.

ggplot(data=sleep_day, aes(x=total_minutes_asleep, y=total_time_in_bed)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

+ We could definitely conclude that by tracking the time we’re inactive, the devices can record when you fall asleep at night and when you stir in the morning.

sleep_activity_combined <- merge(sleep_day, daily_activity, by="Id")
head(sleep_activity_combined)
##           Id   SleepDay total_sleep_records total_minutes_asleep
## 1 1503960366 2016-04-12                   1                  327
## 2 1503960366 2016-04-12                   1                  327
## 3 1503960366 2016-04-12                   1                  327
## 4 1503960366 2016-04-12                   1                  327
## 5 1503960366 2016-04-12                   1                  327
## 6 1503960366 2016-04-12                   1                  327
##   total_time_in_bed activity_date total_steps total_distance tracker_distance
## 1               346    2016-04-20       10544           6.68             6.68
## 2               346    2016-05-01       10602           6.81             6.81
## 3               346    2016-04-19       15506           9.88             9.88
## 4               346    2016-04-15        9762           6.28             6.28
## 5               346    2016-05-02       14727           9.71             9.71
## 6               346    2016-05-10       12207           7.77             7.77
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 1.96                       0.48
## 2                          0                 2.29                       1.60
## 3                          0                 3.53                       1.32
## 4                          0                 2.14                       1.26
## 5                          0                 3.21                       0.57
## 6                          0                 3.35                       1.16
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  4.24                         0                  28
## 2                  2.92                         0                  33
## 3                  5.03                         0                  50
## 4                  2.83                         0                  29
## 5                  5.92                         0                  41
## 6                  3.26                         0                  46
##   fairly_active_minutes lightly_active_minutes sedentary_minutes Calories
## 1                    12                    205               818     1786
## 2                    35                    246               730     1820
## 3                    31                    264               775     2035
## 4                    34                    209               726     1745
## 5                    15                    277               798     2004
## 6                    31                    214               746     1859
sedentary.lm <- lm(sedentary_minutes ~ total_time_in_bed, data = sleep_activity_combined)
sedentary.lm
## 
## Call:
## lm(formula = sedentary_minutes ~ total_time_in_bed, data = sleep_activity_combined)
## 
## Coefficients:
##       (Intercept)  total_time_in_bed  
##          921.9598            -0.2678
ggplot(data = sleep_activity_combined, aes(x=very_active_minutes, y=Calories)) + geom_point() + stat_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

+ There’s a strong positive correlation between very active minutes and calories burned.

Step 6. Act and Recommendation

Final Recommendation to Bellabeat:

1. In order to have a high quality analysis, the data needs to be more accurate and complete with higher sample size and longer period of time frame.

2. Membership as motivation is very critical in Bellabeat to make sure the users participate in activities and records data.

3. Including functions that can alert users who tend to have a high number to sedentary minutes would be a good idea in the device so that they will be notified and start moving.

4. Sleeping pattern is another feature that Bellabeat can use in the devices to provide individual sleep need, which is necessary to help poeple determine if they had enough sleep.

6. A motivational marketing strategy of body positivity in the media and Bellabeat website might empower the customers to enter their weight into the app.