Introduction

Prepare and Process

First, let’s make sure all our necessary packages are installed in order to manipulate, clean and organize the data sets.

library(readr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(rmarkdown)
library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ purrr   0.3.4     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

Now, let’s import the dataset provided.

The data was downloaded through Kaggle and contains information about personal fitness from 30 eligible fitbit users that consented to the submission of personal tracker data. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Daily_Activity <- read_csv("Bellabeat/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Sleep <- read_csv("Bellabeat/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Once the data is imported we can take a look to determine if the data is credible. The ROCCC method will be used to make sure all the bases are covered.

  • Reliable - No, data is not reliable due to it’s small sample size.
  • Original - No , Bellabeat did not collect this data.
  • Comprehensive - No, there is not any information on the parameters (Users weight, age, etc.)
  • Current - No, the data is from 2016 making it 6 years old.
  • Cited - Amazon Mechanical Turk is cited, but is not a reliable source.

Let’s figure out how many unique users tracked their information in each of the data sets.

n_distinct(Daily_Activity$Id)
## [1] 33
n_distinct(Sleep$Id)
## [1] 24

More users logged their daily activity than their sleep.

To ensure the data is clean, I checked for duplicates.

sum(duplicated(Daily_Activity))
## [1] 0
sum(duplicated(Sleep))
## [1] 3

Since the “Sleep” data set has 3 duplicates, let’s make sure to remove them and double check for duplicates again.

Sleep <- Sleep %>% 
drop_na() %>% 
unique()

sum(duplicated(Sleep))
## [1] 0

The dates are not set in the same format, so I will re-format them to make analysis and plotting easier.

Daily_Activity$Date <- mdy(Daily_Activity$ActivityDate)
Sleep$Date <-  mdy_hms(Sleep$SleepDay)

Analysis

Now that the data is clean, it’s ready for analysis.

Taking a look at the summary of the data

Daily_Activity %>%  
select(Calories, TotalSteps)%>% 
summary()
##     Calories      TotalSteps   
##  Min.   :   0   Min.   :    0  
##  1st Qu.:1828   1st Qu.: 3790  
##  Median :2134   Median : 7406  
##  Mean   :2304   Mean   : 7638  
##  3rd Qu.:2793   3rd Qu.:10727  
##  Max.   :4900   Max.   :36019

Observations:

  • The median amount of steps taken daily are less than the recommended 10,000 daily steps.
  • Most participants burned approx. 2,100 calories.
Sleep %>% 
select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>% 
summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

Observations:

  • Most participants only logged sleeping once a day, meaning no naps during the day.
  • On average the participants are sleeping about 7 hours a night.

Merging the data together will make creating graphs easier

Activity_and_Sleep <- merge(Daily_Activity, Sleep, by= c("Id" = "Id", "Date" = "Date"), all = TRUE)
head(Activity_and_Sleep)
##           Id       Date ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12    4/12/2016      13162          8.50            8.50
## 2 1503960366 2016-04-13    4/13/2016      10735          6.97            6.97
## 3 1503960366 2016-04-14    4/14/2016      10460          6.74            6.74
## 4 1503960366 2016-04-15    4/15/2016       9762          6.28            6.28
## 5 1503960366 2016-04-16    4/16/2016      12669          8.16            8.16
## 6 1503960366 2016-04-17    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
##                SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 4/12/2016 12:00:00 AM                 1                327            346
## 2 4/13/2016 12:00:00 AM                 2                384            407
## 3                  <NA>                NA                 NA             NA
## 4 4/15/2016 12:00:00 AM                 1                412            442
## 5 4/16/2016 12:00:00 AM                 2                340            367
## 6 4/17/2016 12:00:00 AM                 1                700            712

Relationship between calories burned and steps

ggplot(data=Activity_and_Sleep)+ geom_point(mapping=aes(x=TotalSteps, y=Calories), color="sienna1") +labs(title= "Calories burned vs Total Steps", x="Total Steps", y="Calories") +geom_smooth(mapping=aes(x=TotalSteps, y=Calories))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As expected, people who get more steps in also happen to burn more calories

Relationship between high activity and minutes slept

ggplot(data=Activity_and_Sleep)+ geom_point(mapping=aes(x=TotalMinutesAsleep, y=VeryActiveMinutes), color="deeppink") +labs(title= "High Acvtivity Minutes vs Minutes Slept", x="Total Minutes Asleep", y="Very Active Minutes") +geom_smooth(mapping=aes(x=TotalMinutesAsleep, y=VeryActiveMinutes))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 530 rows containing non-finite values (stat_smooth).
## Warning: Removed 530 rows containing missing values (geom_point).

We can see that those who had a lot minutes of “high activity” had an average amount of sleep, as did those that did not have as much “high activity” minutes.

Recommendations

1. Implementing habit reminders through the app could help remind users to walk and move more, encouraging them to get more daily steps in.

3. To improve tracking sleep, Bellabeat could integrate a reminder option in their app to encourage better tracking habits.