Introduction:

In this case study, I will perform a real-world task of a junior data analyst on the marketing analyst team at Bellabeat, a high-tech manufacturer of smart products. Urska Srsen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

I have been tasked with analyzing smart device data to gain insight into how consumers are using their smart devices. The insights discovered will then help guide marketing strategy for the company.

Company Summary:

Bellabeat is a high-tech manufacturer of health-focused smart products for women. Founded in 2013 by Urska Srsen and Sando Mur, the company combines technology with beautiful design to empower women with knowledge about their health. Bellabeat’s products collect data on activity, sleep, stress, and reproductive health to help users make informed decisions about their wellness. Over the years, Bellabeat has grown rapidly and positioned itself as a tech-driven wellness company for women. The company sells its products through its own e-commerce channel and various online retailers. Bellabeat utilizes traditional and digital marketing strategies to engage with consumers.

Phase 1: Ask

1.1 Questions for analysis:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

1.2 Business task:

Analyze consumer data on non-Bellabeat smart device usage to spot trends, identify opportunities growth, and improve Bellabeat marketing strategy.

1.3 Key stakeholders:

  1. Urska Srsen - Bellabeat’s co-founder and Chief Creative Officer
  2. Sando Mur - Mathematician and Bellabeat’s co-founder, member of the Bellabeat executive team
  3. Bellabeat marketing analytics team - Responsible for collecting, analyzing, and reporting data to guide Bellabeat’s marketing strategy.

Phase 2: Prepare

2.1 Data source:

Srsen provided the following dataset: Fitbit Fitness Tracker Data (CCO: Public Domain, dataset made available via Kaggle user Möbius).

2.2 Data privacy and accessibility:

This Kaggle data set contains personal fitness tracker data from thirty Fitbit users. According to Möbius, the dataset was generated by respondents to a distributed survey by Amazon Mechanical Turk between 03.12.2016-05.12.2016. The thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

2.3 ROCCC Evaluation:

Reliability: 30 persons responded to a survey. Demographics of participants are unknown. The potential for sampling bias is high, while the sample size is too small.

Original: 30 persons responded to a survey distributed by Amazon Mechanical Turk between 03.12.2016 - 05.12.2016 to generate the data.

Comprehensive: The collection of information includes data on, daily activity, daily steps, daily calorie burn, weight log, and sleep day records.

Current: Since the data was gathered between March 2016 and May 2016, it is not current; as a result, current user habits and behavior may vary.

Cited: The data has a cited source.

2.4 Data limitations:

  • Small sample size
  • Location data unknown (for market context)
  • No demographic information (i.e. sex, age, ethnicity, etc.)
    • demographic information is essential considering Bellabeat’s target market
  • Data is not current (2016-04-12 – 2016-05-12)

I would have liked to have access to more raw data to enhance data integrity. However, most of the raw data that was available was secured behind pay walls in order to access it.

2.5 Data organization:

Data is organized in 18 CSV files. It has both long and wide formats.

The data folder (Fitbit Data 4.12.16-5.12.16 2) was downloaded and saved on my desktop.

All 18 CSV files were opened in RStudio and checked for unique participant IDs using ‘n_distinct()’ function:

First, I set the working directory using the ‘setwd()’ function and upload the data to global environment.

Results of unique participant IDs

  1. 33 ID: daily_activity, daily_calories, daily_intensities, daily_steps, heartrate_seconds, hourly_calories, hourly_intensities, hourly_steps, minute_calories_narrow, minute_calories_wide, minute_intensities_narrow, minute_intensities_wide, minute_mets_narrow

  2. 24 ID: minute_sleep, sleep_day

  3. 8 ID: weight_log

For our analysis we will use the following CSV files:

  1. 33 ID: daily_activity, daily_calories, daily_intensities, daily_steps, hourly_calories, hourly_steps.
  2. 24 ID: sleep_day
  3. We will not use weight_log data as only 8 participants logged information, sample size is to small to draw any conclusions

We want to narrow our focus of this analysis to daily activity, daily calories burned, daily intensities, daily steps, duration of activities and time of day. We hypothesis that our customer base have an interest in their health and try to get in moments of brisk exercise during the work week either before work, and/or on lunch breaks, and/or after work. We also expect to see that participants in this study likely have more intense work outs or distance covered on the weekend when they are not working.

Remove unused dataframes from Global Environment

rm(heartrate_seconds, hourly_intensities, minute_calories_narrow, minute_calories_wide, minute_intensities_narrow, minute_intensities_wide, minute_mets_narrow, minute_steps_narrow, minute_steps_wide, minute_sleep, weight_log)

Phase 3: Process

Data exploration:

The analysis will be done in R and shared with key stakeholders.

First, we will install and load the packages. Then we will explore, transform, and analyze.

Install & Load Packages

library(lubridate)
library(tidyverse)
library(janitor)
library(ggplot2)
library(readr)
library(sqldf)
library(skimr)
library(dplyr)
library(deeptime)

View daily_activity, daily_calories, daily_intensities, daily_steps

glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(daily_calories)
## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ Calories    <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775…
glimpse(daily_intensities)
## Rows: 940
## Columns: 10
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDay              <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
glimpse(daily_steps)
## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ StepTotal   <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054…

Upon viewing the ‘daily_activity’ data frame, I noticed that the data type is incorrectly set for both ‘Id’ and ‘ActivityDay’. ‘Id’ should be set as character (chr), and ‘ActivityDay’ should be set as date.

Clean column names | change date format

# Cleaning: clean column names
daily_activity <- daily_activity %>%
  clean_names()
daily_calories <- daily_calories %>%
  clean_names()
daily_intensities <- daily_intensities %>%
  clean_names()
daily_steps <- daily_steps %>%
  clean_names()


# Changing data type for 'id' to character (chr)
daily_activity$id= as.character(daily_activity$id)
daily_calories$id= as.character(daily_calories$id)
daily_intensities$id= as.character(daily_intensities$id)
daily_steps$id= as.character(daily_steps$id)

# Cleaning: change column name from activity_date to date and edit date format
daily_activity <- daily_activity %>%
   rename(date=activity_date)

daily_calories <- daily_calories %>%
  rename(date=activity_day)

daily_intensities <- daily_intensities %>%
   rename(date=activity_day)

daily_steps <- daily_steps %>%
  rename(date=activity_day)
  

# Changing date format
daily_activity$date= as.Date(daily_activity$date, format= "%m/%d/%Y")
daily_calories$date= as.Date(daily_calories$date, format= "%m/%d/%Y")
daily_intensities$date= as.Date(daily_intensities$date, format= "%m/%d/%Y")
daily_steps$date= as.Date(daily_steps$date, format= "%m/%d/%Y")


# Check to see format changed
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ id                         <chr> "1503960366", "1503960366", "1503960366", "…
## $ date                       <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
## $ total_steps                <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…

Checking for redundancy

Using the ‘sqldf()’ function with the INTERSECT operator

Upon exploration of the ‘daily_activity’ dataset it appears that the dataset provides the same information as the ‘daily_steps’ df, ‘daily_intensities’ df, and ‘daily_calories’ df.

*Here is side-by-side look at ‘daily_activity’ df vs ‘daily_steps’ df

as_tibble(daily_activity)
## # A tibble: 940 × 15
##    id         date       total_steps total_distance tracker_distance
##    <chr>      <date>           <int>          <dbl>            <dbl>
##  1 1503960366 2016-04-12       13162           8.5              8.5 
##  2 1503960366 2016-04-13       10735           6.97             6.97
##  3 1503960366 2016-04-14       10460           6.74             6.74
##  4 1503960366 2016-04-15        9762           6.28             6.28
##  5 1503960366 2016-04-16       12669           8.16             8.16
##  6 1503960366 2016-04-17        9705           6.48             6.48
##  7 1503960366 2016-04-18       13019           8.59             8.59
##  8 1503960366 2016-04-19       15506           9.88             9.88
##  9 1503960366 2016-04-20       10544           6.68             6.68
## 10 1503960366 2016-04-21        9819           6.34             6.34
## # ℹ 930 more rows
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <int>, fairly_active_minutes <int>,
## #   lightly_active_minutes <int>, sedentary_minutes <int>, calories <int>
as_tibble(daily_steps)
## # A tibble: 940 × 3
##    id         date       step_total
##    <chr>      <date>          <int>
##  1 1503960366 2016-04-12      13162
##  2 1503960366 2016-04-13      10735
##  3 1503960366 2016-04-14      10460
##  4 1503960366 2016-04-15       9762
##  5 1503960366 2016-04-16      12669
##  6 1503960366 2016-04-17       9705
##  7 1503960366 2016-04-18      13019
##  8 1503960366 2016-04-19      15506
##  9 1503960366 2016-04-20      10544
## 10 1503960366 2016-04-21       9819
## # ℹ 930 more rows

Next we will test to see if the three data frames- ‘daily_calories’, ‘daily_intensities’, and ‘daily_steps’ are identical to the ‘daily_activity’ df. If they are we can remove the redundant data frames and keep a clean global environment.

We can confirm if they are identical by using the ‘sqldf()’ function with an INTERSECT and then counting distinct values using ‘n_distinct()’ function, see below:

daily_activity_2 <- daily_activity %>%
  select(id, date, calories)


head(daily_activity_2)
##           id       date calories
## 1 1503960366 2016-04-12     1985
## 2 1503960366 2016-04-13     1797
## 3 1503960366 2016-04-14     1776
## 4 1503960366 2016-04-15     1745
## 5 1503960366 2016-04-16     1863
## 6 1503960366 2016-04-17     1728
check_calories <- sqldf("SELECT *
                        FROM daily_activity_2
                        INTERSECT 
                        SELECT *
                        FROM daily_calories")
head(check_calories)
##           id       date calories
## 1 1503960366 2016-04-12     1985
## 2 1503960366 2016-04-13     1797
## 3 1503960366 2016-04-14     1776
## 4 1503960366 2016-04-15     1745
## 5 1503960366 2016-04-16     1863
## 6 1503960366 2016-04-17     1728
n_distinct(daily_activity_2$id)
## [1] 33
n_distinct(check_calories$id)
## [1] 33
nrow(daily_activity_2)
## [1] 940
nrow(daily_calories)
## [1] 940

Confirmed redundancy

INTERSECT Operator

The Intersect operator retrieves the common records from both left and the right query of the Intersect operator - The data types must be same or at least compatible - The number and the order of the columns must be same in both the queries - INTERSECT filters duplicates and returns only distinct rows that are common

Thus we can conclude that ‘daily_activity’ df contains all the information from the ‘daily_calories’ df. I use the same process to confirm if the other two data frames are also redundant.

daily_activity_3 <- daily_activity %>%
  select(id, date, sedentary_minutes, lightly_active_minutes, fairly_active_minutes, 
         very_active_minutes, sedentary_active_distance, light_active_distance, 
         moderately_active_distance, very_active_distance)

head(daily_activity_3)
##           id       date sedentary_minutes lightly_active_minutes
## 1 1503960366 2016-04-12               728                    328
## 2 1503960366 2016-04-13               776                    217
## 3 1503960366 2016-04-14              1218                    181
## 4 1503960366 2016-04-15               726                    209
## 5 1503960366 2016-04-16               773                    221
## 6 1503960366 2016-04-17               539                    164
##   fairly_active_minutes very_active_minutes sedentary_active_distance
## 1                    13                  25                         0
## 2                    19                  21                         0
## 3                    11                  30                         0
## 4                    34                  29                         0
## 5                    10                  36                         0
## 6                    20                  38                         0
##   light_active_distance moderately_active_distance very_active_distance
## 1                  6.06                       0.55                 1.88
## 2                  4.71                       0.69                 1.57
## 3                  3.91                       0.40                 2.44
## 4                  2.83                       1.26                 2.14
## 5                  5.04                       0.41                 2.71
## 6                  2.51                       0.78                 3.19
check_intensities <- sqldf("SELECT *
                           FROM daily_activity_3
                           INTERSECT
                           SELECT *
                           FROM daily_intensities")
head(check_intensities)
##           id       date sedentary_minutes lightly_active_minutes
## 1 1503960366 2016-04-12               728                    328
## 2 1503960366 2016-04-13               776                    217
## 3 1503960366 2016-04-14              1218                    181
## 4 1503960366 2016-04-15               726                    209
## 5 1503960366 2016-04-16               773                    221
## 6 1503960366 2016-04-17               539                    164
##   fairly_active_minutes very_active_minutes sedentary_active_distance
## 1                    13                  25                         0
## 2                    19                  21                         0
## 3                    11                  30                         0
## 4                    34                  29                         0
## 5                    10                  36                         0
## 6                    20                  38                         0
##   light_active_distance moderately_active_distance very_active_distance
## 1                  6.06                       0.55                 1.88
## 2                  4.71                       0.69                 1.57
## 3                  3.91                       0.40                 2.44
## 4                  2.83                       1.26                 2.14
## 5                  5.04                       0.41                 2.71
## 6                  2.51                       0.78                 3.19
n_distinct(daily_activity_3$id)
## [1] 33
n_distinct(check_intensities$id)
## [1] 33
nrow(daily_activity_3)
## [1] 940
nrow(check_intensities)
## [1] 940

Confirmed redundancy

daily_activity_4 <- daily_activity %>%
  select(id, date, total_steps)


head(daily_activity_4)
##           id       date total_steps
## 1 1503960366 2016-04-12       13162
## 2 1503960366 2016-04-13       10735
## 3 1503960366 2016-04-14       10460
## 4 1503960366 2016-04-15        9762
## 5 1503960366 2016-04-16       12669
## 6 1503960366 2016-04-17        9705
check_daily_steps <- sqldf("SELECT *
                           FROM daily_activity_4
                           INTERSECT
                           SELECT *
                           FROM daily_steps")

head(check_daily_steps)
##           id       date total_steps
## 1 1503960366 2016-04-12       13162
## 2 1503960366 2016-04-13       10735
## 3 1503960366 2016-04-14       10460
## 4 1503960366 2016-04-15        9762
## 5 1503960366 2016-04-16       12669
## 6 1503960366 2016-04-17        9705
n_distinct(daily_activity_4$id)
## [1] 33
n_distinct(check_daily_steps$id)
## [1] 33
nrow(daily_activity_4)
## [1] 940
nrow(check_daily_steps)
## [1] 940

Confirmed redundancy

Each SQL data frame check output is identical with 940 observations and 33 distinct ID indicating that the daily_activity df contains all the information from the daily_calories, daily_intensities, and daily_steps data frames.

Remove redundant dataframes

rm(daily_calories, daily_intensities, daily_steps, daily_activity_2, daily_activity_3, daily_activity_4, check_calories, check_daily_steps, check_intensities)

Cleaning: check for duplicates

sum(duplicated(daily_activity))
## [1] 0
# Check unique ID numbers
n_unique(daily_activity$id)
## [1] 33

Phase 4-5: Analyze & Share Data

Daily activity mean values

##       id                 date             total_steps    total_distance  
##  Length:940         Min.   :2016-04-12   Min.   :    0   Min.   : 0.000  
##  Class :character   1st Qu.:2016-04-19   1st Qu.: 3790   1st Qu.: 2.620  
##  Mode  :character   Median :2016-04-26   Median : 7406   Median : 5.245  
##                     Mean   :2016-04-26   Mean   : 7638   Mean   : 5.490  
##                     3rd Qu.:2016-05-04   3rd Qu.:10727   3rd Qu.: 7.713  
##                     Max.   :2016-05-12   Max.   :36019   Max.   :28.030  
##  tracker_distance logged_activities_distance very_active_distance
##  Min.   : 0.000   Min.   :0.0000             Min.   : 0.000      
##  1st Qu.: 2.620   1st Qu.:0.0000             1st Qu.: 0.000      
##  Median : 5.245   Median :0.0000             Median : 0.210      
##  Mean   : 5.475   Mean   :0.1082             Mean   : 1.503      
##  3rd Qu.: 7.710   3rd Qu.:0.0000             3rd Qu.: 2.053      
##  Max.   :28.030   Max.   :4.9421             Max.   :21.920      
##  moderately_active_distance light_active_distance sedentary_active_distance
##  Min.   :0.0000             Min.   : 0.000        Min.   :0.000000         
##  1st Qu.:0.0000             1st Qu.: 1.945        1st Qu.:0.000000         
##  Median :0.2400             Median : 3.365        Median :0.000000         
##  Mean   :0.5675             Mean   : 3.341        Mean   :0.001606         
##  3rd Qu.:0.8000             3rd Qu.: 4.782        3rd Qu.:0.000000         
##  Max.   :6.4800             Max.   :10.710        Max.   :0.110000         
##  very_active_minutes fairly_active_minutes lightly_active_minutes
##  Min.   :  0.00      Min.   :  0.00        Min.   :  0.0         
##  1st Qu.:  0.00      1st Qu.:  0.00        1st Qu.:127.0         
##  Median :  4.00      Median :  6.00        Median :199.0         
##  Mean   : 21.16      Mean   : 13.56        Mean   :192.8         
##  3rd Qu.: 32.00      3rd Qu.: 19.00        3rd Qu.:264.0         
##  Max.   :210.00      Max.   :143.00        Max.   :518.0         
##  sedentary_minutes    calories   
##  Min.   :   0.0    Min.   :   0  
##  1st Qu.: 729.8    1st Qu.:1828  
##  Median :1057.5    Median :2134  
##  Mean   : 991.2    Mean   :2304  
##  3rd Qu.:1229.5    3rd Qu.:2793  
##  Max.   :1440.0    Max.   :4900

Mean sedentary time is very high at 991 minutes approximately 16.5hrs

According to the Centers for Disease Control and Prevention (CDC), Adults younger than 60 years of age can help lower their risk of premature death from all causes, by taking 8,000 to 10,000 steps per day; the CDC recommended activity level is 10,000 steps per day (CDC.GOV).

Activity level by category

I want to see the distribution of users based on their daily step count. With the CDC recommended 10,000 steps per day as our ultimate goal for our customers, we first group our participants into Physical activity level categories based on research by (Tudor-Locke C, 2009- How many steps are enough).

## # A tibble: 6 × 2
##   id         mean_total_steps
##   <chr>                 <dbl>
## 1 1503960366            12117
## 2 1624580081             5744
## 3 1644430081             7283
## 4 1844505072             2580
## 5 1927972279              916
## 6 2022484408            11371
## # A tibble: 6 × 3
##   id         mean_total_steps activity_type
##   <chr>                 <dbl> <chr>        
## 1 1503960366            12117 active       
## 2 1624580081             5744 low activity 
## 3 1644430081             7283 low activity 
## 4 1844505072             2580 sedentary    
## 5 1927972279              916 sedentary    
## 6 2022484408            11371 active

Let’s see the percentage of participants by activity type

activity_user_type_percent <- activity_user_type %>%
  group_by(activity_type) %>%
  summarize(total=n()) %>%
  mutate(totals= sum(total)) %>%
  group_by(activity_type) %>%
  summarise(total_percent = total/totals) %>%
  mutate(percent= scales :: percent(total_percent)) %>%
  arrange(desc(total_percent))

activity_user_type_percent$activity_type <- factor(activity_user_type_percent$activity_type, 
levels= c("sedentary", "low activity", "somewhat active", "active", "highly active"))

head(activity_user_type_percent)
## # A tibble: 5 × 3
##   activity_type   total_percent percent
##   <fct>                   <dbl> <chr>  
## 1 low activity           0.273  27.3%  
## 2 somewhat active        0.273  27.3%  
## 3 sedentary              0.242  24.2%  
## 4 active                 0.152  15.2%  
## 5 highly active          0.0606 6.1%

User type by activity

ggplot(data= activity_user_type_percent, aes(x="", y= total_percent, fill= activity_type)) +
  geom_bar(stat="identity", width=1, color= "white") +
  coord_polar("y", start=0) +
  scale_fill_brewer(palette = 'Blues') +
  theme_void() + #removes background, grid, numeric labels
  theme(plot.title = element_text(hjust= 0.5, vjust= 0, size= 22, face= "bold")) +
  geom_text(aes(label= percent, x=1.2), position= position_stack(vjust= 0.5)) +
  #adds text in aes pie chart
  labs(title="User type by activity") + 
  guides(fill= guide_legend(title="activity type"))

The data reveals the following:

  • 24.2% of participants have a sedentary lifestyle;
  • 51.5% of participants can be categorized as sedentary or low activity
    • meaning they take less than 7,500 steps per day
  • 21.3% of participants can be categorized as active or highly active
    • meaning they take more than 10,000 (active) or more than 12,500 (highly active) steps per day

Visualize the data

Relationship between ‘totalsteps’ and ‘calories’:

ggplot(data= daily_activity, aes(x= total_steps, y= calories)) +
  geom_point(size=3, alpha=.5) +
  geom_smooth(method= lm, se= F) +
  theme_bw() +
  labs(title= "Calories burned explained by Daily Activity Total Steps",
       x= "Total Steps",
       y= "Calories burned")
## `geom_smooth()` using formula = 'y ~ x'

Each point in the plot above represents a relationship between daily calories burned and total steps. As daily total steps increase, so does daily calorie burn (seemingly in a linear fashion). The blue line represents the best fit line- it is our model. Next we will create a linear model and run a summary analysis to quantify how much of the change in calories burned can be explained by total steps.

Simple linear regression

Create a model

A linear model with this data can be used in two ways: 1. determine how much of the change in daily caloric burn can be explained by change in daily total steps? 2. predict the daily caloric burn, given the daily total steps (even for heights not included in the model)

To create a model we simply use the ‘lm()’ function. Once we have a model we can take a look at some of the details there within by using the ‘summary()’ function. Let’s take a look.

Create the model

Generate the summary

## 
## Call:
## lm(formula = calories ~ total_steps, data = daily_activity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1983.81  -373.52   -10.63   431.50  1864.81 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.666e+03  3.410e+01   48.85   <2e-16 ***
## total_steps 8.351e-02  3.716e-03   22.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 579.3 on 938 degrees of freedom
## Multiple R-squared:   0.35,  Adjusted R-squared:  0.3493 
## F-statistic:   505 on 1 and 938 DF,  p-value: < 2.2e-16

Interpreting the results: here you’ll see p-value < 2.2e-16. This means that the p-value is a lot less that 0.05, which means we can reject the null hypothesis that there is no relationship or \(\beta\) = 0. This means that there is a significant relationship between the variables ‘calories’ and ‘total_steps’ in this linear model.

Moreover, the Multiple R-squared: 0.35, tells us that 35% of the change in calories can be explained by the change in total steps.

Multiple linear regression

ggplot(data= daily_activity, aes(x= total_steps, y= calories, color= total_distance)) +
  geom_point(size=3, alpha=.5) +
  geom_smooth(method= lm, se= F) +
  theme_bw() +
  labs(title= "Calories burned explained by Total Steps and Total Distance")
## `geom_smooth()` using formula = 'y ~ x'

The above scatter plot shows that there is a positive correlation between calories burned and total steps and total distance covered. Next we will create a linear model and run a summary analysis to quantify how much of the change in calories burned can be explained by total steps and total distance covered.

Create the model

Generate the summary

## 
## Call:
## lm(formula = calories ~ total_steps + total_distance, data = daily_activity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2361.15  -293.47   -51.04   323.07  2209.25 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1789.47073   31.47189   56.86   <2e-16 ***
## total_steps      -0.21363    0.01947  -10.97   <2e-16 ***
## total_distance  390.88142   25.23232   15.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 517.2 on 937 degrees of freedom
## Multiple R-squared:  0.4825, Adjusted R-squared:  0.4814 
## F-statistic: 436.8 on 2 and 937 DF,  p-value: < 2.2e-16

Interpreting the results: here you’ll see p-value < 2.2e-16. This means that the p-value is a lot less that 0.05, which means we can reject the null hypothesis that \(\beta\) = 0. This means that there is a significant relationship between the variables ‘calories’, ‘total_steps’, and ‘total_distance’ in this multiple regression model.

Moreover, the Adjusted R-squared: 0.4814, tells us that 48.14% of the change in calories can be explained by the change in total steps and total distance covered. Notice that our new model we have a coefficient for ‘total_distance’. Remember that these coefficients represent the quantity of change that we’d expect to see in the outcome variable, for a one unit change in the explanatory variable. In this case, for every unit of change in distance covered (miles), we can expect a change in calories (approximately 390cal).

  • We see a positive correlation between calories burned and total steps and distance covered.
  • In order to help Bellabeat users improve their over all health, the Bellabeat app should send notifications to nudge people to walk more. Perhaps even to build up to the CDCs recommended 10,000 steps per day.
  • Although the interpretation of the results is accurate, the results can be skewed due to a lack of raw data and sample size. Further raw data is needed to support these findings.

The above results are what was expected. The more people move and the greater distance covered, the more calories they burn. Lets see if we can find out when our participants are most active.

Hourly steps and calories

Here we analyze how user activity is distributed throughout the day. In this section we will explore, clean, and analyze data from the ‘hourly_steps’ df and ‘hourly_calories’ df.

##           Id          ActivityHour StepTotal
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366  4/12/2016 1:00:00 AM       160
## 3 1503960366  4/12/2016 2:00:00 AM       151
## 4 1503960366  4/12/2016 3:00:00 AM         0
## 5 1503960366  4/12/2016 4:00:00 AM         0
## 6 1503960366  4/12/2016 5:00:00 AM         0
##           Id          ActivityHour Calories
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366  4/12/2016 1:00:00 AM       61
## 3 1503960366  4/12/2016 2:00:00 AM       59
## 4 1503960366  4/12/2016 3:00:00 AM       47
## 5 1503960366  4/12/2016 4:00:00 AM       48
## 6 1503960366  4/12/2016 5:00:00 AM       48

Cleaning column names | correcting data type

# Cleaning column names
hourly_steps <- hourly_steps %>%
  clean_names()
hourly_calories <- hourly_calories %>%
  clean_names()
# Cleaning: correcting data format
hourly_steps$id= as.character(hourly_steps$id)
hourly_calories$id= as.character(hourly_calories$id)

# Cleaning: correcting date format

hourly_steps <- hourly_steps %>%
  rename(date= activity_hour)

hourly_steps$date=as.POSIXct(paste(hourly_steps$date, hourly_steps$date), 
                             format= "%m/%d/%Y %I:%M:%S %p") 
# This date change was tricky. had to use 'paste()' in the as.POSIXct() function for it to work. 
# When I used traditional as.POSIXct() as done below, it returned NA values.


hourly_calories <- hourly_calories %>%
  rename(date=activity_hour) %>%
  mutate(date=as.POSIXct(date, format= "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))

Check for duplicates

## [1] 0
## [1] 0
## [1] 33
## [1] 33

Merging ‘hourly_steps’ and ‘hourly_calories’ data frames

Now we take the merged data frame with hourly steps and hourly calories burned data and create a column with time of day only.

##           id                date step_total calories  time
## 1 1503960366 2016-04-12 00:00:00        373       81 00:00
## 2 1503960366 2016-04-12 01:00:00        160       61 01:00
## 3 1503960366 2016-04-12 02:00:00        151       59 02:00
## 4 1503960366 2016-04-12 03:00:00          0       47 03:00
## 5 1503960366 2016-04-12 04:00:00          0       48 04:00
## 6 1503960366 2016-04-12 05:00:00          0       48 05:00

Check for duplicates

## [1] 0
## [1] 33

View the data: ‘hourly_steps_calories_merged’

## Rows: 24
## Columns: 3
## $ time             <chr> "00:00", "01:00", "02:00", "03:00", "04:00", "05:00",…
## $ mean_total_steps <dbl> 42.188437, 23.102894, 17.110397, 6.426581, 12.699571,…
## $ mean_calories    <dbl> 71.80514, 70.16506, 69.18650, 67.53805, 68.26180, 81.…

Visualizing the data

The column charts above show that our participants are most active from 12pm to 2pm and from 5pm till 7pm. We can infer from the data that participants are have sedentary jobs, and their activity level picks up around lunchtime and after work.

Daily Sleep

The CDCs recommended hours of sleep per day for adults between the ages of 18-60 years is 7 or more hours per night (CDC.gov). Now that our data is cleaned, lets see how our participants average sleep time compares to the recommended amount.

We are going to analyze daily sleep for the 24 participants that shared their sleep data. Our hypothesis is that those participants who are more active get more sleep time.

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

Clean ‘sleep_day’ data frame

# Cleaning column names
sleep_day <- sleep_day %>%
  clean_names()

# Cleaning duplicates if applicable

sum(duplicated(sleep_day))
## [1] 3

** There are 3 duplicate rows of data in ‘sleep_day’ data frame**

# Cleaning: removing duplicates

sleep_day <- sleep_day %>%
  distinct() %>%
  drop_na()

sum(duplicated(sleep_day))
## [1] 0
## Rows: 410
## Columns: 5
## $ id                   <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1…
## $ sleep_day            <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM",…
## $ total_sleep_records  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ total_minutes_asleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430,…
## $ total_time_in_bed    <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449,…
# Cleaning: correct data type

sleep_day$id= as.character(sleep_day$id)

# Cleaning: date format
sleep_day <- sleep_day %>%
  rename(date=sleep_day) %>%
  mutate(date= as_datetime(date, format= "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))
# Cleaning: check changes
glimpse(sleep_day)
## Rows: 410
## Columns: 5
## $ id                   <chr> "1503960366", "1503960366", "1503960366", "150396…
## $ date                 <dttm> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, …
## $ total_sleep_records  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ total_minutes_asleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430,…
## $ total_time_in_bed    <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449,…
sleep_day_mean <- sleep_day %>%
  group_by(id) %>%
  drop_na() %>%
  summarize(mean_total_minutes_asleep= mean(total_minutes_asleep), 
            mean_time_in_bed= mean(total_time_in_bed))

sleep_day_mean <- sleep_day_mean %>%
  mutate(time_asleep_percent_rec = round((mean_total_minutes_asleep/420)*100))

glimpse(sleep_day_mean)
## Rows: 24
## Columns: 4
## $ id                        <chr> "1503960366", "1644430081", "1844505072", "1…
## $ mean_total_minutes_asleep <dbl> 360.2800, 294.0000, 652.0000, 417.0000, 506.…
## $ mean_time_in_bed          <dbl> 383.2000, 346.0000, 961.0000, 437.8000, 537.…
## $ time_asleep_percent_rec   <dbl> 86, 70, 155, 99, 121, 15, 106, 70, 83, 113, …

Summary of Sleep Day ‘total_minutes_asleep’

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    58.0   361.0   432.5   419.2   490.0   796.0

Visualizing the data

ggplot(data=sleep_day_mean) +
  geom_point(aes(mean_total_minutes_asleep, time_asleep_percent_rec)) +
  labs(title= "Average sleep time", subtitle = "Percentage of recommended 7 hours +", x="Average min asleep", y="Percentage from 420 min") +
  geom_hline(yintercept = 100, color="blue") +
  theme_bw()+
  theme(plot.title= element_text(size= 20),
        plot.subtitle = element_text(size = 15, color= "blue"),
        axis.title.x= element_text(size= 15),
        axis.title.y= element_text(size=15))

The above scatter plot shows each participant average minutes asleep as a percentage of the recommended 7hrs (or 420 minutes). Here we see that over half the users sleep less than the recommended amount.

## [1] 13

Of the 24 participants that shared their sleep data, less than half are sleeping the recommended 7 hrs.

Daily sleep vs steps

Here we examine the relationship between the average number of steps and average minutes of sleep.

## 'data.frame':    410 obs. of  18 variables:
##  $ id                        : chr  "1503960366" "1503960366" "1503960366" "1503960366" ...
##  $ date                      : Date, format: "2016-04-12" "2016-04-13" ...
##  $ total_steps               : int  13162 10735 9762 12669 9705 15506 10544 9819 14371 10039 ...
##  $ total_distance            : num  8.5 6.97 6.28 8.16 6.48 ...
##  $ tracker_distance          : num  8.5 6.97 6.28 8.16 6.48 ...
##  $ logged_activities_distance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ very_active_distance      : num  1.88 1.57 2.14 2.71 3.19 ...
##  $ moderately_active_distance: num  0.55 0.69 1.26 0.41 0.78 ...
##  $ light_active_distance     : num  6.06 4.71 2.83 5.04 2.51 ...
##  $ sedentary_active_distance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ very_active_minutes       : int  25 21 29 36 38 50 28 19 41 39 ...
##  $ fairly_active_minutes     : int  13 19 34 10 20 31 12 8 21 5 ...
##  $ lightly_active_minutes    : int  328 217 209 221 164 264 205 211 262 238 ...
##  $ sedentary_minutes         : int  728 776 726 773 539 775 818 838 732 709 ...
##  $ calories                  : int  1985 1797 1745 1863 1728 2035 1786 1775 1949 1788 ...
##  $ total_sleep_records       : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ total_minutes_asleep      : int  327 384 412 340 700 304 360 325 361 430 ...
##  $ total_time_in_bed         : int  346 407 442 367 712 320 377 364 384 449 ...
## [1] 0
## [1] 24
## [1] 410

Note: daily_activity had 33 participants and 940 observations, while sleep_day had 24 participants and 410 observations.

Finding relationship between ‘minutes asleep’ and ‘total steps’

ggplot(data= daily_steps_sleep_day_merged, aes(x= total_steps, y= total_minutes_asleep)) +
  geom_point(size=3, alpha=.5) +
  geom_smooth(method= lm, se= F) +
  theme_bw() +
  labs(title= "Minutes Asleep explained by Daily Total Steps",
       x= "Total Steps",
       y= "Minutes Asleep")
## `geom_smooth()` using formula = 'y ~ x'

The above scatter plot shows that there is a slight negative correlation between minutes asleep and total steps. Next we will create a linear model and run a summary analysis to quantify how much of the change in minutes asleep can be explained by total steps.

Create a model

A linear model with this data can be used in two ways: 1. determine how much of the change in daily minutes sleep can be explained by change in daily total steps? 2. predict the minutes of sleep, given the daily total steps.

To create a model we simply use the ‘lm()’ function. Once we have a model we can take a look at some of the details there within by using the ‘summary()’ function. Let’s take a look.

Create the model

Generate the summary

## 
## Call:
## lm(formula = total_minutes_asleep ~ total_steps, data = daily_steps_sleep_day_merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -386.32  -56.67   13.31   70.32  350.99 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 465.423623  13.138907  35.423  < 2e-16 ***
## total_steps  -0.005432   0.001387  -3.916 0.000105 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116.6 on 408 degrees of freedom
## Multiple R-squared:  0.03623,    Adjusted R-squared:  0.03387 
## F-statistic: 15.34 on 1 and 408 DF,  p-value: 0.0001054

This summary shows that daily_steps has very little effect on total minutes of sleep

Although there is a slight negative correlation between ‘minutes asleep’ and ‘total steps’, the Multiple R-squared of 0.03623 suggest that only 3.6% of the change in ‘minutes asleep’ can be explained by change in ‘total steps’.

Finding relationship between ‘minutes asleep’ and ‘total distance’

ggplot(data= daily_steps_sleep_day_merged, aes(x= total_distance, y= total_minutes_asleep)) +
  geom_point(size=3, alpha=.5) +
  geom_smooth(method= lm, se= F) +
  theme_bw() +
  labs(title= "Total Minutes Asleep Explained by Total Distance")
## `geom_smooth()` using formula = 'y ~ x'

The above scatter plot shows that there is a slight negative correlation between ‘minutes asleep’ and ‘total distance’. Next we will create a linear model and run a summary analysis to quantify how much of the change in minutes asleep can be explained by total steps.

Create the model

Generate the summary

## 
## Call:
## lm(formula = total_minutes_asleep ~ total_distance, data = daily_steps_sleep_day_merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -384.53  -54.44   13.41   70.27  354.19 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     460.635     12.795  36.002  < 2e-16 ***
## total_distance   -6.896      1.899  -3.631 0.000318 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116.9 on 408 degrees of freedom
## Multiple R-squared:  0.03131,    Adjusted R-squared:  0.02893 
## F-statistic: 13.19 on 1 and 408 DF,  p-value: 0.000318

This summary shows that ‘total_distance’ has very little effect on ‘total_minutes_asleep’

Although there is a slight negative correlation between ‘total_minutes_asleep’ and ‘total_distance’, the Multiple R-squared of 0.03131 suggest that only 3.1% of the change in ‘total_minutes_asleep’ can be explained by change in ‘total_distance’.

Since there is no significant correlation between daily sleep and total steps or total distance, lets move on and see if there is a relationship between daily sleep and activity minutes.

Daily sleep vs Activity minutes

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The four scatter plots above show relationship between ‘Minutes Asleep’ and their respective activity type (i.e. sedentary, very active, lightly active, fairly active). The black horizontal line represents the CDC’s recommended 7+ hours of sleep.

According to the first scatter plot, labeled “Sedentary Minutes vs Minutes Asleep”, there appears to be negative correlation between ‘sedentary minutes’ and ‘minutes asleep’. Lets create a model to examine the relationship with further detail.

Create the model

## 
## Call:
## lm(formula = total_minutes_asleep ~ sedentary_minutes, data = daily_steps_sleep_day_merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -422.74  -50.04    8.05   57.68  363.91 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       724.74079   20.65331   35.09   <2e-16 ***
## sedentary_minutes  -0.42911    0.02825  -15.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.93 on 408 degrees of freedom
## Multiple R-squared:  0.3613, Adjusted R-squared:  0.3597 
## F-statistic: 230.8 on 1 and 408 DF,  p-value: < 2.2e-16

Interpreting the results: here you’ll see p-value < 2.2e-16. This means that the p-value is a lot less that 0.05, which means we can reject the null hypothesis that there is no relationship between sedentary minutes and minutes asleep or \(\beta\) = 0. This means that there is a significant relationship between the variables ‘total_minutes_asleep’ and ‘sedentary_minutes’ in this simple regression model. Moreover, the Multiple R-squared: 0.3613, tells us that 36.13% of the change in minutes asleep can be explained by the change in sedentary minutes.

  • We can clearly see a negative relationship between sedentary minutes and sleep time.
  • In order to help Bellabeat users improve their sleep time, the Bellabeat app should recommend reducing sedentary time.
  • Although the interpretation of the results is accurate, the results can be skewed due to a lack of raw data and sample size.Further raw data is needed to support these findings.

Phase 6: Act

Key insights

  • The average sedentary time of the participants is 16.5 hours.

  • The average daily steps is 7,638 far below the CDCs recommended 10,000 steps.

    • 24.2% of the participants have a sedentary lifestyle (<5,000 steps per day)
    • 27.3% of the participants have can be categorized as having low activity (5,000-7,500 steps per day).

    This means that 51.5% of the participants are taking less than 7,500 steps per day.

  • Participants were most active during 12pm to 2pm and from 5pm till 7pm.

  • The average sleep time of the participants is just below the CDCs recommended 7+ hours.

Our analysis shows that there is a significant positive correlation between calories burned and total daily steps and distance covered. Therefore, the Bellabeat app should send notifications to help its customers increase their daily activity.

Moreover, our analysis shows that sedentary time and sleep time have a negative correlation. Therefore, the Bellabeat app should send notifications to help customers reduce sedentary time. This will help achieve better sleep and promote a healthier lifestyle.

Ideas

-Work Day Work Out Challenge: simple fast exercises that can be completed on a lunch break

-Notifications: send notifications every two hours that show percentage to goal of 10,000 steps with ideas to help you attain daily goal. Also, notifications to wind down before bedtime. Perhaps, the app give ideas to our customers that can help create new daily habits help them to wind down before bedtime like 15-20 minutes of meditation or reading.

-Friend Support: Customers can link up with friends and share their personal goals and achievements.

-Community Support: in app community support can help keep users motivated. Especially by sharing success stories

-Preventative Health Care: reminders about age appropriate check ups (eg. breast exam etc.)

-Healthy Diet: (additional payment) the app can provide an easy-to-use food log to help customers track caloric intake vs daily caloric budget that is set based on weight goals.

Thank You for reading!

Please note: this is my first project using R. I appreciate any constructive feedback. Special thanks to YouTube channel R Programming 101 (R Programming 101). His channel helped me understand linear models and regression analysis.