A. INTRODUCTION:

This is an analysis project for a growing company called Bellabeat that focuses on female wellness technology products. Their products include:

Bellabeat app: An app that provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products. Leaf: Bellabeat’s classic wellness tracker which can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress. Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness. Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels. Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

The stakeholders invloved:

Urska Srsen, Co-Founder and Chief Creative Officer of Bellabeat.
Sando Mur, a Mathematician and Co-Founder of Bellabeat and key executive member.
The Bellabeat Marketing Analytics Team.

Summary Business Task:

To discover trends in smart device usage using dataset from a similar more established company, relate the trends to Bellabeat customers and make strategic product marketing decisions based on the observed usage trends.

Description of Data Sources:

The data has been sourced from the https://www.kaggle.com/datasets/arashnic/fitbit (CC0: Public Domain, dataset made available through Mobius). It is a Kaggle data set that contains personal fitness tracker data from thirty FitBit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Based on the above information, the data is available for public use as a public domain dataset and can be said to be from a credible source, FitBit, making it reliable. It is also a cited dataset and so can be trusted.

However, it may not be totally ROCCC-compliant, where ROCCC stands for reliability, originality, comprehensiveness, current and cited, respectively. There is no information about the demographics of the participants such as age, sex, presence of any health conditions, etc,and it also has lots of data missing for some measured variables. The dataset is, therefore, incomplete and could contain some bias. In addition, the data is not very current since it is data from 2016 being used in 2024 and many factors relating to the variables measured or the methods of measurement used may have changed.

Therefore, judgements made from this dataset would have to be further verified using more current and comprehensive datasets.

B. Preparation of Data For Exploration:

1. Installing Packages:

install.packages('tidyverse')

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages('lubridate')

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages('dplyr')

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages('ggplot2')

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages('tidyr')

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

2. Loading Packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)

3. Loading the CSV files:

daily_activity <- read.csv("/cloud/project/Bellabeat_Dataset/dailyActivity_merged1.csv")
daily_sleep <- read.csv("/cloud/project/Bellabeat_Dataset/sleepDay_merged.csv")
daily_calories <- read.csv("/cloud/project/Bellabeat_Dataset/dailyCalories_merged.csv")
daily_intensities <- read.csv("/cloud/project/Bellabeat_Dataset/dailyIntensities_merged.csv")
daily_steps <- read.csv("/cloud/project/Bellabeat_Dataset/dailySteps_merged.csv")

4. Data exploration using the head() and colnames() function.

head(daily_activity)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(daily_sleep)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

head(daily_sleep)

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

head(daily_calories)

##           Id ActivityDay Calories
## 1 1503960366   4/12/2016     1985
## 2 1503960366   4/13/2016     1797
## 3 1503960366   4/14/2016     1776
## 4 1503960366   4/15/2016     1745
## 5 1503960366   4/16/2016     1863
## 6 1503960366   4/17/2016     1728

head(daily_intensities)

##           Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366   4/12/2016              728                  328
## 2 1503960366   4/13/2016              776                  217
## 3 1503960366   4/14/2016             1218                  181
## 4 1503960366   4/15/2016              726                  209
## 5 1503960366   4/16/2016              773                  221
## 6 1503960366   4/17/2016              539                  164
##   FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1                  13                25                       0
## 2                  19                21                       0
## 3                  11                30                       0
## 4                  34                29                       0
## 5                  10                36                       0
## 6                  20                38                       0
##   LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1                6.06                     0.55               1.88
## 2                4.71                     0.69               1.57
## 3                3.91                     0.40               2.44
## 4                2.83                     1.26               2.14
## 5                5.04                     0.41               2.71
## 6                2.51                     0.78               3.19

head(daily_steps)

##           Id ActivityDay StepTotal
## 1 1503960366   4/12/2016     13162
## 2 1503960366   4/13/2016     10735
## 3 1503960366   4/14/2016     10460
## 4 1503960366   4/15/2016      9762
## 5 1503960366   4/16/2016     12669
## 6 1503960366   4/17/2016      9705

While exploring the data, I noticed that the date and time formats are not all the same across dataframes though there are similarities, each has an Id and a date column. Thus, I would be correcting the date/time formats across the dataframes before merging some of them on similar columns for further exploration.

C. Further Data Processing:

1. Converting Dates on each dataframe for consistency across dataframes.

daily_intensities

daily_intensities $ ActivityDay = as.POSIXct(daily_intensities $ ActivityDay, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
daily_intensities $ time <- format(daily_intensities $ ActivityDay, format = "%H:%M:%S")
daily_intensities $ date <- format(daily_intensities $ ActivityDay, format = "%m/%d/%y")

daily_calories

daily_calories $ ActivityDay = as.POSIXct(daily_calories $ ActivityDay, format="%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
daily_calories $ time <- format(daily_calories $ ActivityDay, format = "%H:%M:%S")
daily_calories $ date <- format(daily_calories $ ActivityDay, format = "%m/%d/%y")

daily_activity

daily_activity $ ActivityDate = as.POSIXct(daily_activity $ ActivityDate, format ="%m/%d/%Y", tz=Sys.timezone())
daily_activity $ date <- format(daily_activity $ ActivityDate, format = "%m/%d/%y")

daily_sleep

daily_sleep $ SleepDay = as.POSIXct(daily_sleep $ SleepDay, format = "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
daily_sleep $ date <- format(daily_sleep $ SleepDay, format = "%m/%d/%y")

daily_steps

daily_steps $ ActivityDay = as.POSIXct(daily_steps $ ActivityDay, format = "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
daily_steps $ date <- format(daily_steps $ ActivityDay, format = "%m/%d/%y")

2. Some Summary Statistics of Each of the Dataframes:

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(daily_intensities$Id)

## [1] 33

n_distinct(daily_calories$Id)

## [1] 33

n_distinct(daily_sleep$Id)

## [1] 24

n_distinct(daily_steps$Id)

## [1] 33

It was observed that all the dataframes have 33 distinct entries/participants except the daily_sleep dataframe which has 24 only. 24 participants are not really enough for one to make any statistically significant conclusions. Therefore, judgements made from the daily_sleep dataframe may not be very reliable statistically.

3. Continuation of Summary Statistics:

number of observations in each dataframe:

nrow(daily_activity)

## [1] 940

nrow(daily_intensities)

## [1] 940

nrow(daily_calories)

## [1] 940

nrow(daily_sleep)

## [1] 413

nrow(daily_steps)

## [1] 940

Again, each dataframe has 940 observations except the daily_sleep dataframe which has just 413 observations.

4. More Summary Stats: Mean, Median, Min, Max, Percentile stats.

#### a. For the daily activity dataframe:
daily_activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes) %>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

#### b. For the daily intensities data frame:
daily_intensities %>%
  select(SedentaryMinutes,
         LightlyActiveMinutes,
         FairlyActiveMinutes,
         VeryActiveMinutes) %>%
  summary()

##  SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes
##  Min.   :   0.0   Min.   :  0.0        Min.   :  0.00      Min.   :  0.00   
##  1st Qu.: 729.8   1st Qu.:127.0        1st Qu.:  0.00      1st Qu.:  0.00   
##  Median :1057.5   Median :199.0        Median :  6.00      Median :  4.00   
##  Mean   : 991.2   Mean   :192.8        Mean   : 13.56      Mean   : 21.16   
##  3rd Qu.:1229.5   3rd Qu.:264.0        3rd Qu.: 19.00      3rd Qu.: 32.00   
##  Max.   :1440.0   Max.   :518.0        Max.   :143.00      Max.   :210.00

#### c. For the daily calories dataframe:
daily_calories %>%
  select(Calories) %>%
  summary()

##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900

#### d. For the daily steps dataframe:
daily_steps %>%
  select(StepTotal) %>%
  summary()

##    StepTotal    
##  Min.   :    0  
##  1st Qu.: 3790  
##  Median : 7406  
##  Mean   : 7638  
##  3rd Qu.:10727  
##  Max.   :36019

#### e. For the sleep dataframe:
daily_sleep %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

5. Some Interesting Insights From the Summary Statistics:

Mean Total steps across participants was 7638 steps which is slightly below the acceptable range(from 8000 steps according to a University of Granada-led research in 2023) for maintaining a healthy cardiovascular system and reducing all-cause mortality. Many of the participants would have to increase the total steps taken and hence reduce sedentary time.
Mean sedentary time was a whooping 16.5 hours per day. According to a study by a team of researchers from the University of Mississippi and Pusan National University, less than 6 hours of daily sedentary time is required for healthy living and prevention of obesity and heart diseases.
On the average, participants had roughly 40 mins lag time between when they went to bed and when they actually slept. This could be improved. According to SleepFoundation.org, an average healthy individual takes about 15 to 20 mins after going to bed to fall asleep.

Thus, I would like to get further information about how the active or sedentary lifestyle of the participants relate to their sleep patterns and other factors such as calories.

This would likely help the company target the marketing to those who need to get better sleep quality based on reducing the amount of time spent in bed versus actual amount of time asleep. Notifications to increase movement throughout the day can help those who need to burn more calories to stay healthy or maintain a healthy weight.

D. Further Transformation of the Dataset for More Insights:

I will be merging the daily_sleep and daily_activity dataframes to get better insights using visualizations since they are the two most unique dataframes. The daily activity dataframe contains most of what is in the others except the data in the daily sleep dataframe. I will be using an outer join to merge all the data together since the sleep dataframe is shorter than that of activity. This will help retain all of the data in the two dataframes.

1. Merging dataframes:

merged <- merge(daily_sleep, daily_activity, by = c("Id","date"), all = TRUE)
head(merged)

##           Id     date   SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 04/12/16 2016-04-12                 1                327
## 2 1503960366 04/13/16 2016-04-13                 2                384
## 3 1503960366 04/14/16       <NA>                NA                 NA
## 4 1503960366 04/15/16 2016-04-15                 1                412
## 5 1503960366 04/16/16 2016-04-16                 2                340
## 6 1503960366 04/17/16 2016-04-17                 1                700
##   TotalTimeInBed ActivityDate TotalSteps TotalDistance TrackerDistance
## 1            346   2016-04-12      13162          8.50            8.50
## 2            407   2016-04-13      10735          6.97            6.97
## 3             NA   2016-04-14      10460          6.74            6.74
## 4            442   2016-04-15       9762          6.28            6.28
## 5            367   2016-04-16      12669          8.16            8.16
## 6            712   2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

Just to confirm that everything went well, I would be checking how many participants are in the combined dataset.

2. Confirmation of number of participants:

n_distinct(merged$Id)

## [1] 33

Before I start plotting visualizations, I noticed a few redundant columns that I would have to remove to make our dataframe cleaner. So I remove them from our merged dataframe using:

3. Removal of redundant columns

final_merged <- merged %>% select(-c(TrackerDistance, LoggedActivitiesDistance, SleepDay, ActivityDate))
head(final_merged)

##           Id     date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 04/12/16                 1                327            346
## 2 1503960366 04/13/16                 2                384            407
## 3 1503960366 04/14/16                NA                 NA             NA
## 4 1503960366 04/15/16                 1                412            442
## 5 1503960366 04/16/16                 2                340            367
## 6 1503960366 04/17/16                 1                700            712
##   TotalSteps TotalDistance VeryActiveDistance ModeratelyActiveDistance
## 1      13162          8.50               1.88                     0.55
## 2      10735          6.97               1.57                     0.69
## 3      10460          6.74               2.44                     0.40
## 4       9762          6.28               2.14                     1.26
## 5      12669          8.16               2.71                     0.41
## 6       9705          6.48               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

I would like to get a sum of all the daily active minutes for each of the participants so that I could use it for some comparisons:

4. Getting the sum of all daily active minutes per participant

merged_final <- final_merged %>%
    mutate(TotalActiveMinutes = rowSums(across(c(LightlyActiveMinutes,FairlyActiveMinutes,VeryActiveMinutes))))
head(merged_final)

##           Id     date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 04/12/16                 1                327            346
## 2 1503960366 04/13/16                 2                384            407
## 3 1503960366 04/14/16                NA                 NA             NA
## 4 1503960366 04/15/16                 1                412            442
## 5 1503960366 04/16/16                 2                340            367
## 6 1503960366 04/17/16                 1                700            712
##   TotalSteps TotalDistance VeryActiveDistance ModeratelyActiveDistance
## 1      13162          8.50               1.88                     0.55
## 2      10735          6.97               1.57                     0.69
## 3      10460          6.74               2.44                     0.40
## 4       9762          6.28               2.14                     1.26
## 5      12669          8.16               2.71                     0.41
## 6       9705          6.48               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
##   TotalActiveMinutes
## 1                366
## 2                257
## 3                222
## 4                272
## 5                267
## 6                222

E. Analysis and Visualizations:

I would like to know if the quality of sleep of the participants depended on the amount of time they spent active

1. Total active minutes against total minutes asleep:

ggplot(data = merged_final, aes(x = TotalActiveMinutes, y = TotalMinutesAsleep)) +
  geom_point() + 
  geom_smooth() + 
  labs(title="Total Active Minutes vs. Total Minutes Asleep", caption = 'Data Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 530 rows containing missing values or values outside the scale range
## (`geom_point()`).

Insights: Being more active overall was more beneficial than being sedentary with respect to having a longer sleeping time. Although, being lightly active seemed to the most beneficial form of activity in this regard.

Even though any activity is better than no activity at all, it is more beneficial to perform moderate to high intensity physical activity at least up to 30 mins daily for optimum heart health. It has also been agreed by several researchers including those from John Hopkins Medicine that moderate-intensity physical activity may be beneficial for better sleep quality.

Hence, I move on to find out if these observations apply to the participants in this dataset and how that may be of advantage to making strategic marketing decisions on some Bellabeat products.

Is there any relationship between the more moderate to highly active hours and the actual length of time participants spend asleep?

I will begin by getting the sum of the more active minutes in a new column, thus:

2. Calculating the sum of fairly to very active minutes:

merged_finally <- final_merged %>%
  mutate(SumMoreActiveMinutes = rowSums(across(c(FairlyActiveMinutes, VeryActiveMinutes))))

head(merged_finally)

##           Id     date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 04/12/16                 1                327            346
## 2 1503960366 04/13/16                 2                384            407
## 3 1503960366 04/14/16                NA                 NA             NA
## 4 1503960366 04/15/16                 1                412            442
## 5 1503960366 04/16/16                 2                340            367
## 6 1503960366 04/17/16                 1                700            712
##   TotalSteps TotalDistance VeryActiveDistance ModeratelyActiveDistance
## 1      13162          8.50               1.88                     0.55
## 2      10735          6.97               1.57                     0.69
## 3      10460          6.74               2.44                     0.40
## 4       9762          6.28               2.14                     1.26
## 5      12669          8.16               2.71                     0.41
## 6       9705          6.48               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
##   SumMoreActiveMinutes
## 1                   38
## 2                   40
## 3                   41
## 4                   63
## 5                   46
## 6                   58

I will then be plotting the sum of more active minutes versus the total minutes asleep:

3. Plot of Higher Time of Activity:

ggplot(data = merged_finally, aes(x = SumMoreActiveMinutes, y = TotalMinutesAsleep)) +
  geom_point()+
  geom_smooth() +
  labs(title="Sum of More Active Time vs. Total Minutes Asleep", 
       caption = 'Data Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 530 rows containing missing values or values outside the scale range
## (`geom_point()`).

Insights: It seems that spending more time active per day was slightly beneficial to being able to spend more time asleep. However, those who did litle to no activity also slept almost as long. Though this may be biased given that the sleep dataset had lots of missing values.

Furthermore, I want to check if the sedentary time has influence on length of time asleep.

So I plot sedentary minutes with total minutes asleep.

4. Sedementary Minutes vs Total Time Asleep:

ggplot(data=merged_finally, aes(x=SedentaryMinutes, y = TotalMinutesAsleep)) + 
  geom_point(color='navyblue') + 
  geom_smooth() +
  labs(title="Sedentary Minutes vs Total Time Asleep",
       caption = 'Data Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 530 rows containing missing values or values outside the scale range
## (`geom_point()`).

Insights: Sedentary lifestyle was found to have a negative effect on the length of time sleep with those with spent more time in a sedentary position having less than 7 hours of sleep. The more sedentary the less the length of time asleep.

Again, I want to know the relationship between the distance moved throughout the day and sleep length as well as its relationship with the amount of calories burnt per day. So I plot, first, total distance vs total sleep minutes.

5. Total distance vs total sleep minutes:

ggplot(data=merged_finally, aes(y=TotalMinutesAsleep, x=TotalDistance)) + 
  geom_point(color='darkslategray') + 
  geom_smooth() +
  labs(title="Minutes Asleep vs. Total Distance", 
       caption = 'Data Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 530 rows containing missing values or values outside the scale range
## (`geom_point()`).

Insights: Distance covered in the course of physical activity didn’t seem to have any significant effect on the length of sleep time for participants.

6. Total distance vs Calories: Does increase in distance cause affect amount of calories burned.

ggplot(data=merged_finally, aes(y=Calories, x=TotalDistance)) + 
  geom_point(color='darkblue') + 
  geom_smooth() +
  labs(title="Calories vs. Total Distance",
       caption = 'Data Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Insight: Unsurprisingly, covering more distance by the participants led to increased burning of calories.

Still exploring the data, I would like to know if the sleep quality/efficiency affects the physical activity of the participants. This is important because if found significant, the overall health of the participants who had poor sleep quality could be at risk. And such customers could be targeted with reminders/notifications to go to bed earlier to be able to get more sleep that would translate to better activity during the day and improve overall health outcomes.

I will start by calculating the sleep efficiency(or quality) which is a percentage of total time in bed spent actually sleeping.

7. Sleep Efficiency Calculation:

merged_finally = merged_finally %>%
  mutate(SleepEfficiency = ((TotalMinutesAsleep/TotalTimeInBed)*100)) 
head(merged_finally)

##           Id     date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 04/12/16                 1                327            346
## 2 1503960366 04/13/16                 2                384            407
## 3 1503960366 04/14/16                NA                 NA             NA
## 4 1503960366 04/15/16                 1                412            442
## 5 1503960366 04/16/16                 2                340            367
## 6 1503960366 04/17/16                 1                700            712
##   TotalSteps TotalDistance VeryActiveDistance ModeratelyActiveDistance
## 1      13162          8.50               1.88                     0.55
## 2      10735          6.97               1.57                     0.69
## 3      10460          6.74               2.44                     0.40
## 4       9762          6.28               2.14                     1.26
## 5      12669          8.16               2.71                     0.41
## 6       9705          6.48               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
##   SumMoreActiveMinutes SleepEfficiency
## 1                   38        94.50867
## 2                   40        94.34889
## 3                   41              NA
## 4                   63        93.21267
## 5                   46        92.64305
## 6                   58        98.31461

8. Plotting Sleep Efficiency vs Total Steps:

How does a good quality sleep affect number of steps taken per day?

ggplot(data=merged_finally, aes(x=TotalSteps, y=SleepEfficiency)) + 
  geom_line() + 
  geom_smooth() +
  labs(title="Sleep Efficiency vs. Total Steps",
       caption = 'Data Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 530 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 87 rows containing missing values or values outside the scale range
## (`geom_line()`).

Insights: From our plot, the sleep efficiency seemed to get increasingly better from around 15000 steps and above. This implies once again that higher activity is more beneficial to sleep than no activity at all. On the average, participants had less than 8000 steps per day, almost less than half of the amount of steps that led to better sleep in this case.

9. Plotting Sleep Efficiency/Quality by Total time of Activity

#Getting the sum of all daily active minutes per participant:
merged_finally2 <- merged_finally %>%
    mutate(TotalActiveMinutes = rowSums(across(c(LightlyActiveMinutes,FairlyActiveMinutes,VeryActiveMinutes))), 
           SleepEfficiency = ((TotalMinutesAsleep/TotalTimeInBed)*100))
head(merged_finally2)

##           Id     date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 04/12/16                 1                327            346
## 2 1503960366 04/13/16                 2                384            407
## 3 1503960366 04/14/16                NA                 NA             NA
## 4 1503960366 04/15/16                 1                412            442
## 5 1503960366 04/16/16                 2                340            367
## 6 1503960366 04/17/16                 1                700            712
##   TotalSteps TotalDistance VeryActiveDistance ModeratelyActiveDistance
## 1      13162          8.50               1.88                     0.55
## 2      10735          6.97               1.57                     0.69
## 3      10460          6.74               2.44                     0.40
## 4       9762          6.28               2.14                     1.26
## 5      12669          8.16               2.71                     0.41
## 6       9705          6.48               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
##   SumMoreActiveMinutes SleepEfficiency TotalActiveMinutes
## 1                   38        94.50867                366
## 2                   40        94.34889                257
## 3                   41              NA                222
## 4                   63        93.21267                272
## 5                   46        92.64305                267
## 6                   58        98.31461                222

#Plot:
merged_finally2 %>% 
  select(SleepEfficiency, TotalActiveMinutes) %>% 
  mutate(sleep_quality = ifelse(SleepEfficiency < 60, 'Poor sleep',
                                ifelse(SleepEfficiency < 80, 'Good sleep',
                                       ifelse(SleepEfficiency <= 100, 'Excellent sleep')))) %>% 
  mutate(active_level = ifelse(TotalActiveMinutes >= 150,'High activity',
                               ifelse(TotalActiveMinutes >= 30,'Moderate activity',
                                      ifelse(TotalActiveMinutes >=1, 'Low activity',
                                             ifelse(TotalActiveMinutes >= 0, 'Sedentary'))))) %>% 
  select(-c(SleepEfficiency, TotalActiveMinutes)) %>% 
  drop_na() %>% 
  group_by(sleep_quality, active_level) %>% 
  summarise(counts = n()) %>% 
  mutate(active_level = factor(active_level, 
                               levels = c('Sedentary','Low activity',
                                          'Moderate activity',
                                          'High activity'))) %>% 
  mutate(sleep_quality = factor(sleep_quality, 
                                levels = c('Poor sleep','Good sleep',
                                           'Excellent sleep'))) %>% 
  ggplot(aes(x = sleep_quality, 
             y = counts, 
             fill = sleep_quality)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values=c("gold", "darkblue", "darkred")) +
  facet_wrap(~active_level, nrow = 1) +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(strip.text = element_text(colour = 'black', size = 8)) +
  theme(strip.background = element_rect(fill = "antiquewhite1", color = 'black'))+
  labs(
    title = "Sleep quality by Level of Activity",
    x = "Sleep quality",
    y = "Count",
    caption = 'Data Source: FitBit Fitness Tracker Data')

## `summarise()` has grouped output by 'sleep_quality'. You can override using the
## `.groups` argument.

Insights: With the limitations of our dataset in mind, those who were more active experienced better sleep quality overall. It seems the more active, the better the quality of sleep. In the plot, poor sleep represents sleep efficiency of less than 60%, good sleep and excellent sleep represent sleep efficiency above 60%.

Will this be the same for length of time asleep? Let’s check it out:

10. A Plot of Length of Sleep with Intensity of Activity

merged_finally2 %>% 
  select(TotalMinutesAsleep, TotalActiveMinutes ) %>% 
  mutate(sleep_quality = ifelse(TotalMinutesAsleep <= 420, 'Poor Sleep',
                                ifelse(TotalMinutesAsleep <= 540, 'Optimal Sleep', 
                                       'Excess Sleep'))) %>% 
  mutate(active_level = ifelse(TotalActiveMinutes >= 150,'High activity',
                               ifelse(TotalActiveMinutes >=30 ,'Moderate activity',
                                             ifelse(TotalActiveMinutes >= 0,'Low activity', 'Sedentary')))) %>% 
  select(-c(TotalMinutesAsleep, TotalActiveMinutes)) %>% 
  drop_na() %>% 
  group_by(sleep_quality, active_level) %>% 
  summarise(counts = n()) %>% 
  mutate(active_level = factor(active_level, 
                               levels = c('Sedentary','Low activity',
                                          'Moderate activity',
                                          'High activity'))) %>% 
  mutate(sleep_quality = factor(sleep_quality, 
                                levels = c('Poor Sleep','Optimal Sleep',
                                           'Excess Sleep'))) %>% 
  ggplot(aes(x = sleep_quality, 
             y = counts, 
             fill = sleep_quality)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values=c("#99CCFF", "#336699", "#000066")) +
  facet_wrap(~active_level, nrow = 1) +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(strip.text = element_text(colour = 'black', size = 8)) +
  theme(strip.background = element_rect(fill = "#CCFFFF", color = 'black'))+
  labs(
    title = "Length of Sleep by Level of Activity",
    x = "Length of Sleep",
    y = "Count",
    caption = 'Data Source: FitBit Fitness Tracker Data')

## `summarise()` has grouped output by 'sleep_quality'. You can override using the
## `.groups` argument.

Insights: Once again, higher time of activity seemed to help participants have more time of sleep. On the flip side, however, it seems that it was not beneficial to length of sleep on some occasions. This can be explained by the fact that increased activity close to bedtime may influence a person’s ability to wind down as soon as possible to be able to fall asleep while in bed. In the plot optimal sleep is defined by total sleep time of 7 to 9 hours, poor sleep = total sleep time of less than 7 hours and excess is more than 9 hours.

F. Summary of Trends and Insights From the Analysis:

Mean Total steps across participants was 7638 steps which is slightly below the acceptable minimum (from 8000 steps according to a University of Granada-led research in 2023) for maintaining a healthy cardiovascular system and reducing all-cause mortality. Many of the participants would have to increase the total steps taken and hence reduce sedentary time. This analysis further confirmed that taking more steps was not only beneficial for the heart but also for having more quality sleep though the length of sleep time was not very significantly affected.
It seemed most people spent a large chunk of their day in sedentary positions, an average of 16.5 hours daily. There is no information about the age, gender or health status of the participants. The aforementioned factors can affect the overall activity level of individuals. However, using the available data, it seems most users of the tracker are people with sedentary lifestyles.
On the average, participants had approximately up to 40 mins lag between actual sleep time and time in bed. An average healthy individual takes about 15 to 20 mins to fall asleep after going to bed according to SleepFoundation.org.
Most of the time was spent on activities of light intensity. Not much time was spent in moderate to very active physical activities.
Being more active overall was more beneficial than being sedentary with respect to having a longer and more quality sleeping time. However, it may be more beneficial to reduce the activity close to bedtime in order to benefit more in terms of length of sleep.
As expected, covering more distance led to increased burning of calories which could be beneficial especially to customers who are interested in maintaining a healthy weight.

G. Recommendations based on Observations:

It is recommended for overweight or obese individuals to spend less than 6 sedentary hours when not asleep. Overweight or obese customers could be encouraged to set healthy weight goals and achieve them by taking targeted number of steps daily using the Bellabeat app on any of their smart devices.
Because of the high level of sedentary lifestyle among users, the Bellabeat app could be used to regularly nudge customers to take little steps away from their sedentary positions at intervals to increase daily activity. Generally, customers could be encouraged to take more steps daily, increasing total steps to at least 8000 steps daily which studies have shown to be the optimal number of daily steps for helathy living. Any activity is better than no activity at all.
The WHO and many other health regulatory bodies around the world agree that engaging in a minimum of 150 minutes of moderate-to-high intensity activity weekly or 30 minutes of similar intensity activity 5 days a week is beneficial for heart health. It has been agreed that maintaining this level of activity weekly can reduce the risk of all-cause and certain disease-specific mortality.

Thus, customers can be encouraged to use the Bellabeat app to intentionally schedule and perform moderate-to-high intensity suggested activities at least 30 minutes daily with daily reminder notifications.

In order to encourage better sleep, in addition to encouraging customers to increase activity throughout the day, the app could use reminder beeps to remind customers to start winding down towards bedtime in order to benefit optimally from the increased activity throughout the day.

H. References:

Carnevale, V., Macciocchi, D., & Sessa, M. (2023). Daily sedentary time of less than six hours is beneficial for the prevention of obesity in US adults. SEMS Journal. https://doi.org/10.34045/SEMS/2023/19
World Health Organization. (2020). WHO guidelines on physical activity and sedentary behaviour. World Health Organization. https://www.who.int/publications/i/item/9789240015128
University of Granada. (2023, October 26). Scientists show for the first time how many steps to take each day to reduce the risk of premature death: 8,000. ScienceDaily. https://www.sciencedaily.com/releases/2023/10/231026131551.htm
Rausch-Phung, E., & Rehman, A. (2023, December 19). How long should it take to fall asleep? SleepFoundation.org. https://www.sleepfoundation.org/sleep-faqs/how-long-should-it-take-to-fall-asleep

5.Johns Hopkins Medicine. (n.d.). Exercising for better sleep. HopkinsMedicine.org. Retrieved July 11, 2024, from https://www.hopkinsmedicine.org/health/wellness-and-prevention/exercising-for-better-sleep

Bellabeat Project in R

Chidimma R. Nnadozie

2024-07-11