1 Introduction

Bellabeat is a high-tech company that manufactures health-focused smart products. The company collects data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits. It is believed that analyzing smart device fitness data could help unlock new growth opportunities for the company.

Bussines taks: This project aims to analyze data from non-Bellabeat smart devices, such as smart watches and trackers, to understand consumer usage patterns. By identifying these trends, we can enhance Bellabeat’s marketing strategies and product offerings to better support women’s health and wellness.

1.1 Data source

To address the aforementioned task, we can analyze the open-source datasets offered by Mobius that are available on Kaggle. The data was obtained with the help of eligible users who used a FitBit Fitness Tracker and consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. We possess 29 CSV files collected from 33 FitBit users through a survey. The data encompasses a time window between 03/12/2016 and 05/12/2016.

2 Preparing data

To conduct this analysis, we will focus on datasets that provide key information such as calories burned, levels of physical activity, and sleep patterns. By examining these variables, we can gain insights into individuals’ physical activity behaviors. The specific datasets we will utilize include:

i) dailyActivity_merged: Contains data on daily steps, active minutes, and total distance traveled.

ii) dailySteps_merged: Provides information on the number of steps taken per day.

iii) dailyCalories_merged: Records the total calories burned per day.

iv) WeightLogInfo_merged: Includes weight data reported by users of FitBit devices.

These datasets will enable us to analyze and understand users’ physical and sleep activity patterns effectively.

2.1 Load packges in R

To start working with data structures in R it is necessary to install and load the following packages:

To install the packages we excecute install.packages(‘package-name’) and load the as library(package-name).

2.2 Importing datasets

We pull our datasets from the data source:

daily_activity <- read.csv('dailyActivity_merged.csv')
daily_steps <- read.csv('dailySteps_merged.csv')
daily_sleep <- read.csv('sleepDay_merged.csv')
weight_evolution <- read.csv('weightLogInfo_merged.csv')

2.3 Dataset description

To obtain an optimal description of our data, we use the skim_without_charts function. This function offers a comprehensive overview of datasets, including key statistical information such as minimum, maximum, standard deviation, and mean. It also displays the count of missing values and the data types of variables within the dataset. This functionality is particularly useful for obtaining a statistical summary of our data.

2.3.1 Daily activity

The dataset comprises 940 observations (rows) and 15 variables (columns), with 31 unique entries. Notably, the column names exhibit inconsistencies, including variations in letter casing. Additionally, the ‘activity date’ column is currently formatted as a string; it should be converted to a date data type for accurate analysis.

head(daily_activity)
skimr::skim_without_charts(daily_activity)
Data summary
Name daily_activity
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ActivityDate 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09
TotalSteps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04
TotalDistance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
TrackerDistance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
LoggedActivitiesDistance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00
VeryActiveDistance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01
ModeratelyActiveDistance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00
LightActiveDistance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01
SedentaryActiveDistance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01
VeryActiveMinutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02
FairlyActiveMinutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02
LightlyActiveMinutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02
SedentaryMinutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03
Calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03

2.3.2 Daily sleep behavior

This dataset contains 413 entries and 31 observations. We will need to merge this data later on with the daily_data dataset by the ‘id’ column for easier manipulation. Furthermore, the sleep_day column should be transformed into a date type for proper analysis and processing.

head(daily_sleep)
skimr::skim_without_charts(daily_sleep)
Data summary
Name daily_sleep
Number of rows 413
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
SleepDay 0 1 20 21 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1 5.000979e+09 2.06036e+09 1503960366 3977333714 4702921684 6962181067 8792009665
TotalSleepRecords 0 1 1.120000e+00 3.50000e-01 1 1 1 1 3
TotalMinutesAsleep 0 1 4.194700e+02 1.18340e+02 58 361 433 490 796
TotalTimeInBed 0 1 4.586400e+02 1.27100e+02 61 403 463 526 961

2.3.3 Daily steps

This field has 940 rows and three columns and 31 unique values. Column names should be cleaned and data type of ActivityDay should be formatted as date data-type.

head(daily_steps)
skimr::skim_without_charts(daily_steps)
Data summary
Name daily_steps
Number of rows 940
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ActivityDay 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4445114986.0 6962181067 8877689391
StepTotal 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7405.5 10727 36019

2.3.4 Weight evolution

Compared to the other datasets, this one contains relatively few entries —67 rows with 56 unique values. Despite the smaller size, we will analyze the data to identify any potential trends related to weight evolution.

head(weight_evolution)
skimr::skim_without_charts(weight_evolution)
Data summary
Name weight_evolution
Number of rows 67
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date 0 1 19 21 0 56 0
IsManualReport 0 1 4 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1.00 7.009282e+09 1.950322e+09 1.503960e+09 6.962181e+09 6.962181e+09 8.877689e+09 8.877689e+09
WeightKg 0 1.00 7.204000e+01 1.392000e+01 5.260000e+01 6.140000e+01 6.250000e+01 8.505000e+01 1.335000e+02
WeightPounds 0 1.00 1.588100e+02 3.070000e+01 1.159600e+02 1.353600e+02 1.377900e+02 1.875000e+02 2.943200e+02
Fat 65 0.03 2.350000e+01 2.120000e+00 2.200000e+01 2.275000e+01 2.350000e+01 2.425000e+01 2.500000e+01
BMI 0 1.00 2.519000e+01 3.070000e+00 2.145000e+01 2.396000e+01 2.439000e+01 2.556000e+01 4.754000e+01
LogId 0 1.00 1.461772e+12 7.829948e+08 1.460444e+12 1.461079e+12 1.461802e+12 1.462375e+12 1.463098e+12

3 Processing our data

In this part of the project we will focus on structuring, formatting, checking for duplicates, and cleaning the data.

3.1 Structure

We will start our cleaning process by giving a consistent name to our columns:

daily_activity <- janitor::clean_names(daily_activity)
daily_sleep <- janitor::clean_names(daily_sleep)
daily_steps <- janitor::clean_names(daily_steps)
weight_evolution <- janitor::clean_names(weight_evolution)

3.2 Formatting dates

Our skim_without_charts and head tables show that the data type of date column of our data sets is a string. Therefore we have to format it as date type.

daily_activity <- daily_activity %>% dplyr::mutate(activity_date = lubridate::mdy(daily_activity$activity_date))

daily_sleep <- daily_sleep %>% dplyr::mutate(sleep_day = lubridate::as_date(daily_sleep$sleep_day, format ="%m/%d/%Y %I:%M:%S %p" , tz = NULL))

daily_steps <- daily_steps %>% dplyr::mutate(activity_day = lubridate::mdy(daily_steps$activity_day))

weight_evolution<- weight_evolution %>% dplyr::mutate(date = lubridate::as_date(weight_evolution$date,format ="%m/%d/%Y %I:%M:%S %p" , tz = NULL))

3.3 Rename date-type columns

It is essential that we designate all variables containing date-type data as ‘date’ to keep naming consistency across the dataset.

daily_activity <- daily_activity %>% rename(date = activity_date)
daily_steps <- rename(daily_steps, date = activity_day)
daily_sleep <- rename(daily_sleep, date = sleep_day)

3.4 Null values

After executing the code below, we observe that the weight_evolution dataset contains 65 missing (NA) values in the ‘fat’ column. We will exclude this variable from our analysis. Additionally, due to the limited number of records, the weight_evolution dataset will only be analyzed for correlation purposes.

weight_evolution <- weight_evolution %>% select(-fat)
sum(is.na(daily_activity))
## [1] 0
sum(is.na(daily_sleep))
## [1] 0
sum(is.na(daily_steps))
## [1] 0
sum(is.na(weight_evolution))
## [1] 0

3.5 Find duplicates

To identify and remove duplicate observations in our datasets, we can use the duplicated() function. This function returns a logical vector indicating which rows are duplicates. By applying it to our dataset, we can create a table that shows the count of duplicate and unique values. After running the code we find out that daily_sleep has 3 repeats that should be removed.

daily_activity_repeats<- print(table(duplicated(daily_activity)))
## 
## FALSE 
##   940
daily_sleep_repeats <- print(table(duplicated(daily_sleep)))
## 
## FALSE  TRUE 
##   410     3
daily_steps_repeats <- print(table(duplicated(daily_steps)))
## 
## FALSE 
##   940
weight_evolution_repeats <-print(table(duplicated(weight_evolution)))
## 
## FALSE 
##    67

3.6 Eliminate duplicates

To remove the existing duplicates, we can use the !duplicated() function, which returns a logical vector indicating which rows are not duplicates. The number of rows is then equal to 410, which is the number of the unique values, indicating that repeats have been removed.

daily_sleep <- daily_sleep[!duplicated(daily_sleep), ]
print(nrow(daily_sleep))
## [1] 410

3.7 Merging datasets

To perform calculations using data from different datasets, it’s essential to merge them. In R, this can be accomplished using the merge() function as follows:

daily_activity_all <- merge(daily_activity, daily_sleep, by = c('id', 'date'))

4 Analysis of datasets

We will begin our data analysis by calculating metrics such as speed and pace. These measurements help us examine the movement patterns of the participants.

We check our cleaned datasets.

head(daily_activity)
head(daily_steps)
head(daily_sleep)
head(weight_evolution)
head(daily_activity_all)

4.1 Relative speed

Relative speed is categorized as high, moderate, or light, based on the distances(km) and durations (in hours) spent in very active, moderately active, and lightly active states. From this, we derive three speed levels —high, moderate, and light. The average of these speed levels is then calculated and used as a reference value to qualitatively classify users’ movement patterns.

  • High speed:
daily_activity_all<- mutate(daily_activity_all, high_speed = very_active_distance/(very_active_minutes/60))
  • Moderate speed:
daily_activity_all<- mutate(daily_activity_all, moderate_speed = moderately_active_distance/(fairly_active_minutes/60))
  • Light speed:
daily_activity_all <- daily_activity_all %>% mutate(low_speed = light_active_distance/(lightly_active_minutes/60))
  • Average speed:
daily_activity_all<- daily_activity_all %>% mutate(average_speed = (high_speed + moderate_speed + low_speed)/3)
speeds <- daily_activity_all %>% 
na.omit() %>% 
select(id, date, high_speed, moderate_speed, low_speed, average_speed)

head(speeds)

Now we can calculate the average speed of every speed level:

daily_average_sp <-daily_activity_all %>% 
  na.omit() %>% 
  group_by(id) %>% 
summarise(mean_high_speed = mean(high_speed), mean_high_speed = mean(high_speed), mean_moderate_speed = mean(moderate_speed), mean_low_speed = mean(low_speed))
head(daily_average_sp)
# We calculate de mean of every speed level
summary(daily_average_sp)
##        id            mean_high_speed mean_moderate_speed mean_low_speed  
##  Min.   :1.504e+09   Min.   :1.895   Min.   :0.9267      Min.   :0.7000  
##  1st Qu.:4.010e+09   1st Qu.:3.604   1st Qu.:2.3422      1st Qu.:0.9769  
##  Median :4.631e+09   Median :4.186   Median :2.6216      Median :1.0132  
##  Mean   :5.062e+09   Mean   :3.962   Mean   :2.4585      Mean   :1.0434  
##  3rd Qu.:6.822e+09   3rd Qu.:4.443   3rd Qu.:2.7649      3rd Qu.:1.1630  
##  Max.   :8.792e+09   Max.   :6.215   Max.   :3.1559      Max.   :1.5323

4.1.1 Analysis of relative speed

To define a categorical parameter for classifying user movement speed, we calculate the average speed for each category and assign levels such as fast, medium, slow, and very slow. This approach allows us to establish velocity ranges, enabling a qualitative visualization of the data.

Average speed classification:

  • high_speed = 3.96 km/h
  • moderate_speed = 2.46 km
  • low_speed = 1.04 km/h

These values are used to create speed intervals (levels):

  • fast: speed ≥ \(3.96 km/h\)
  • medium: \(2.46 km/h\) ≤ speed < \(3.96 km/h\)
  • slow: \(1.04 km/h\) ≤ speed < \(2.46 km/h\)
  • very slow: speed < \(1.043 km/h\)

In this way, we can qualitatively classify each user’s average speed as follows:

speed_level <- daily_activity_all %>% 
  tidyr::drop_na() %>%
  group_by(id) %>%
  mutate(speed_level = case_when(
    average_speed >=3.96 ~ "fast", 
    average_speed <3.96 & average_speed >= 2.46 ~ "medium",
    average_speed < 2.46 & average_speed >= 1.04 ~ "slow",
    average_speed <1.043 ~ "very slow")) %>% 
  select(id,date, speed_level)
head(speed_level)

We plot the the results to determine the speed distribution of participants:

library(ggplot2)
ggplot2::ggplot(data = speed_level ) +  
ggplot2::geom_bar(mapping = aes(x = date, fill = speed_level)) + 
  labs(title = 'Overall speed level distribution', x = 'Date', y = 'Frequency')+
  scale_x_date(date_labels = "%b,%d", breaks = 'week', date_minor_breaks = "1 day")

It is also useful to visualize the speed distribution of users based on the day of the week.

speed_level_weekday <-speed_level %>% 
  group_by(id) %>% 
  mutate(weekday = weekdays(date))

speed_level_weekday$weekday <- factor(speed_level_weekday$weekday, c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))

ggplot(data = speed_level_weekday) + 
geom_bar(mapping = aes(x = weekday, fill = speed_level)) +
labs(title = 'Speed intensity across the weekdays', x = 'Weekday', y = 'Count')

Here, the speed categories are expressed as percentages, and a Pareto chart is used to enhance the visualization of these results.

percentage_sp <- speed_level %>% 
  group_by(speed_level) %>% 
  summarise(counts = n()) %>% 
  mutate(total = sum(counts)) %>% 
  group_by(speed_level) %>% 
  summarise(percent = counts/total*100, counts)
head(percentage_sp)
ggplot2::ggplot(percentage_sp, aes(x = speed_level, y = counts)) +
 ggQC::stat_pareto(point.color = "red",
                   point.size = 3,
                   line.color = "black",
                   bars.fill = c("orange", "green")) +
  labs(title = 'Pareto chart of speed categories ', x = 'Speed level', y = 'Count') + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5, size=10))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the plots above, we can conclude the following:

  1. Approximately one-third of users moved at speeds between \(1.04 km/h\) and \(2.46 km/h\), while around two-thirds moved between \(2.46 km/h\) and \(3.96 km/h\).

  2. The majority of users fell into the fast and medium speed categories.

  3. Tuesday had the highest activity level among participants, whereas Sunday had the lowest.

  4. The fastest recorded speed among participants occurred on Monday.

4.2 Relative pace

Pace quantifies the time required to travel a unit distance and is defined as: \(pace=time (min)/distance (km)\). Relative pace is categorized into high, moderate, and light. High pace is derived using “very active time” and “very active distance” data, while moderate and light pace are calculated similarly. Additionally, the average pace is determined to serve as a reference value for the qualitative classification of users’ pace.

  • High pace:
daily_activity_all<- daily_activity_all %>% mutate(high_pace = very_active_minutes/very_active_distance)
  • Moderate pace:
daily_activity_all<- daily_activity_all %>% mutate(moderate_pace = fairly_active_minutes/moderately_active_distance)
  • Low pace:
daily_activity_all<- daily_activity_all %>% mutate(low_pace = lightly_active_minutes/light_active_distance)
  • Average pace:
daily_activity_all<- daily_activity_all %>% mutate(average_pace = (high_pace + moderate_pace + low_pace)/3)
pace <- daily_activity_all %>% 
na.omit() %>% 
select(id, date, high_pace, moderate_pace, low_pace, average_pace)

head(pace)

We calculate the average pace for each category (high, moderate, and light). These values are then used to establish pace intervals for further analysis.

daily_avg_pace <-daily_activity_all %>% 
  na.omit() %>% 
  group_by(id) %>% 
summarise(mean_high_pace = mean(high_pace), mean_moderate_pace = mean(moderate_pace), mean_low_pace = mean(low_pace))
head(daily_avg_pace)
# We calculate de mean of every pace level
summary(daily_avg_pace)
##        id            mean_high_pace   mean_moderate_pace mean_low_pace  
##  Min.   :1.504e+09   Min.   : 9.662   Min.   :19.54      Min.   :40.01  
##  1st Qu.:4.010e+09   1st Qu.:13.600   1st Qu.:21.77      1st Qu.:52.66  
##  Median :4.631e+09   Median :15.614   Median :23.66      Median :60.27  
##  Mean   :5.062e+09   Mean   :19.424   Mean   :29.16      Mean   :61.14  
##  3rd Qu.:6.822e+09   3rd Qu.:18.903   3rd Qu.:30.07      3rd Qu.:63.04  
##  Max.   :8.792e+09   Max.   :51.254   Max.   :69.81      Max.   :85.88

4.2.1 Analysis of relative pace

For better visualization of our pace data, we define levels of pace (slow, medium, high, and very high). To create such levels we use the mean of the three different pace categories:

Average pace classification :

  • high_pace = \(19.42 min/km\)
  • moderate_pace = \(29.16 min/km\)
  • low_pace = \(61.14 min/km\)

The pace intervals were defined as follows:

  • slow: > \(61.42 min/km\)
  • medium: \(29.16 min/km\) < pace ≤ \(61.14 min/km\)
  • high: \(19.42 min/km\)< pace ≤ \(29.16 min/km\)
  • very high: ≤ \(19.42 min/km\):
pace_level <- daily_activity_all %>% 
  tidyr::drop_na() %>% 
  group_by(id) %>% 
  mutate(pace_degree = case_when(
     average_pace > 61.14 ~ "slow",
     average_pace > 29.16 & average_pace <= 61.14 ~ "medium",
     average_pace > 19.42 & average_pace <= 29.16 ~ "high",
     average_pace <= 19.42 ~ "very high")) %>% 
  select(id,date,average_pace, pace_degree)
head(pace_level)

We plot the results to determine the pace distribution of the participants per day.

library(ggplot2)
ggplot2::ggplot(data = pace_level ) +  
ggplot2::geom_bar(mapping = aes(x = date, fill = pace_degree)) + 
  labs(title = 'Overall pace distribution', x = 'Date', y = 'Count')+
  scale_x_date(date_labels = "%b,%d", breaks = 'week', date_minor_breaks = "1 day")

The pace level distribution can also be visualized for each day of the week:

pace_level_weekday <-pace_level %>%
  group_by(id) %>% 
  mutate(weekday = weekdays(date))

pace_level_weekday$weekday <- factor(speed_level_weekday$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))

ggplot(data = pace_level_weekday) + 
geom_bar(mapping = aes(x = weekday, fill = pace_degree)) +
labs(title = 'Pace intensity across the Weekdays', x = 'Weekday', y = 'Count')

Each pace level can be expressed as a percentage and visualized using a Pareto chart; thus, it is straightforward to observe the proportional relationships among the different pace levels.

percentage_pace <- pace_level %>% 
  group_by(pace_degree)%>% 
  summarise(counts = n()) %>%
  mutate(total = sum(counts)) %>%
  group_by(pace_degree) %>%
  summarise(total_percent = round(counts / total*100, digits = 2), counts) 
head(percentage_pace)
ggplot(percentage_pace, aes(x = pace_degree, y = counts)) +
ggQC::stat_pareto(point.color = "red",
                   point.size = 3,
                   line.color = "black",
                   bars.fill = c("orange", "green")) +
  labs(title = 'Pareto chart of pace level', x = 'Pace level', y = 'Count')+ 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5, size=10))

Based on the pace data analysis, we can state the following:

  1. The majority of users moved at a fast or medium pace. Fast pace accounts for 33% of participants, while medium pace is the most prevalent rendering 65%.

  2. Tuesday and Wednesday are the most active days, whereas Sunday and Monday have the lowest activity levels.

4.3 Analysis of daily steps

To analyze step data, we classify daily step records into levels based on the Graduated Step Index published in the Applied Physiology, Nutrition, and Metabolism journal Tudor-Locke et al. (2013):

  • < 5000 sedentary

  • 5000 - 7499 low active

  • 7500 - 9999 somewhat active

  • 10000 - 12499 active

  • ≥ 12500 highly active.

For the steps-per-day classification, we will use the daily average step count of each user.

steps_day_av <- daily_activity_all %>% 
  group_by(id) %>% 
  summarise(steps_per_day = mean(total_steps)) %>% 
  mutate(steps_level = case_when(steps_per_day < 5000 ~ 'sedentary',
                                 steps_per_day >= 5000 & steps_per_day <= 7499 ~ 'low active',
                                 steps_per_day >= 7500 & steps_per_day <= 9999 ~ 'somewhat active',
                                 steps_per_day >= 10000 & steps_per_day <= 12499 ~ 'active',
                                 steps_per_day >= 12500 ~ 'highly active'))
head(steps_day_av)

The step levels can be represented as percentages for better visualization.

percentage_steps <- steps_day_av %>% 
  group_by(steps_level) %>% 
  summarise(counts = n()) %>% 
  mutate(total = sum(counts)) %>% 
  group_by(steps_level) %>% 
  summarise(total_percent = round(counts/total*100, digits = 2), counts)
head(percentage_steps)

We can visualize the percentages and frequencies by creating a Pareto chart.

ggplot2::ggplot(percentage_steps, aes(x = steps_level, y = counts)) +
 ggQC::stat_pareto(point.color = "red",
                   point.size = 3,
                   line.color = "black",
                   bars.fill = c("orange", "green")) +
  labs(title = 'Pareto chart of steps activity', x = 'Pace activity', y = 'Count') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=1, size=10))

We generate another data frame to analyze weekly step records and step activity trends.

steps_week <- daily_activity_all %>% 
  mutate(weekday = weekdays(date)) %>% 
  mutate(steps_level_week = case_when(total_steps < 5000 ~ 'sedentary',
                                 total_steps>= 5000 & total_steps <= 7499 ~ 'low active',
                                 total_steps >= 7500 & total_steps <= 9999 ~ 'somewhat active',
                                 total_steps >= 10000 & total_steps <= 12499 ~ 'active',
                                 total_steps >= 12500 ~ 'highly active')) %>% 
select(id,date, weekday, steps_level_week)

head(steps_week)
steps_week$weekday <- factor(steps_week$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))

ggplot(steps_week) + geom_bar(mapping = aes(x = weekday, fill = steps_level_week)) +
  labs(title= 'Step intensity by weekday', x = 'Weekday', y = 'Count')

From the daily steps analysis, we can conclude the following:

  1. Only a small percentage of users (4%) are classified as highly active. The majority (37%) fall into the somewhat active category, while the low active, sedentary, and active groups each make up approximately 20%.

  2. The highly active group is largest on Saturday and smallest on Friday, whereas sedentary behavior is most prevalent on Sunday.

4.4 Analysis of the daily sleep

To evaluate users’ sleep quality, we calculate the sleep efficiency (SE) as the ratio of total sleep time (TST) to time in bed (TIB) multiplied by 100. This metric is based on the methodology outlined in Reed and Sacco (2016): \(SE = TST/TIM * 100\).

To classify the sleep quality qualitatively, we use our SE values and rank them according tothe Pittsburgh Sleep Quality Index (PSQI) scale Buysse et al. (1989):

  • > 85 % very good

  • 75 - 84 % fairly good

  • 65 - 74 % fairly bad

  • < 65 % very bad

sleep_quality <- daily_activity_all %>% 
  mutate(weekday = weekdays(date)) %>% 
  mutate(sleep_efficiency = total_minutes_asleep/total_time_in_bed * 100) %>% 
  mutate(sleep_duration_h = total_minutes_asleep/60) %>% 
  mutate(sleep_level = case_when(sleep_efficiency >= 85 ~ 'very good',
                                 sleep_efficiency >= 75 & sleep_efficiency <= 84 ~ 'fairly good',
                                 sleep_efficiency >= 65 & sleep_efficiency <= 74 ~ 'fairly bad',
                                 sleep_efficiency <= 64 ~ 'very bad')) %>% 
  na.omit() %>% 
  select(id, date, weekday, sleep_efficiency, sleep_level, sleep_duration_h, calories, total_steps, total_distance, average_speed, average_pace)
head(sleep_quality)

We aim to classify sleep efficiency for each user and determine the percentage distribution of each sleep level category. This classification is based on the average sleep efficiency values of individual users.

sleep_efficiency_user <- sleep_quality %>% 
  group_by(id) %>% 
  summarise(sleep_efficiency_av = mean(sleep_efficiency)) %>% 
  mutate(sleep_level = case_when(sleep_efficiency_av >= 85 ~ 'very good',
                                 sleep_efficiency_av >= 75 & sleep_efficiency_av <= 84 ~ 'fairly good',
                                 sleep_efficiency_av >= 65 & sleep_efficiency_av <= 74 ~ 'fairly bad',
                                 sleep_efficiency_av <= 64 ~ 'very bad'))
head(sleep_efficiency_user)
sleep_level_percentage <- sleep_efficiency_user %>% 
  group_by(sleep_level) %>% 
  summarise(counts = n()) %>% 
  mutate(total_percent = counts/sum(counts) * 100) %>% 
  group_by(sleep_level)
head(sleep_level_percentage)

The following plots are used to analyze sleep quality:

  1. Sleep quality vs Weekday – Examines variations in sleep quality across different days of the week.
  2. Sleep efficiency vs Sleep duration (Density plot) – Visualizes the distribution of sleep efficiency concerning sleep duration.
  3. Sleep Efficiency vs Sleep duration (Box plot) – Displays the spread and variability of sleep efficiency across different sleep durations.
sleep_quality$weekday <- factor(sleep_quality$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))

# Sleep quality distribution per day of the week
ggplot(data = sleep_quality) + geom_bar(mapping = aes(x = weekday, fill = sleep_level)) + facet_wrap(~sleep_level, nrow = 2, scale = 'free') + 
labs(title = 'Sleep quality', x = 'Weekday', y= 'Count') +
theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))

# Density plot  of sleep quality
d1 <- ggplot(sleep_quality, aes(x = sleep_duration_h, y = sleep_efficiency)) + 
  geom_point(alpha = 0.3) + 
  ylim(80, 100) +
  xlim(4,11) +
  geom_density_2d_filled(alpha = 0.7) +
  labs(title = 'Density (a) and Box (b) plots: Sleep efficiency vs Sleep duration', subtitle = 'a)', x = 'Sleep duration (h)', y = 'Sleep efficiency (%)')
 
# Box plot of sleep quality 
d2 <- ggplot(sleep_quality, aes(x = sleep_duration_h, y = sleep_efficiency, fill = sleep_level)) + 
  geom_boxplot() +
  labs(subtitle ='b)', x = 'Sleep duration (h)', y = 'Sleep efficiency (%)')

# Generation of the planel plot 
egg::ggarrange(d1, d2, heights = NULL, widths= NULL, ncol = 1, nrow = 2)

The results of the daily sleep analysis reveal that:

  1. Most users (95%) experience very good sleep quality overall. However, Sunday has the poorest sleep quality, with the smallest proportion of users in the “very good” category and the highest proportion in the “very bad” category.

  2. The density plot indicates that the largest segment of participants achieves very good sleep efficiency when sleeping between six and eight hours. Additionally, the box plot reveals that users tend to sleep very poorly when sleeping less than 5.5 hours and only “fairly good” when sleeping for more than 8.5 hours. The plot also suggests that the highest sleep efficiency is achieved with approximately 7 hours of sleep.

4.5 Correlations

To examine potential correlations among the studied variables, we use ‘calories’ as the independent variable and plot it against ‘total_distance,’ ‘average_speed,’ ‘average_pace,’ ‘sleep_duration,’ and ‘weight_evolution.’ To assess the relationships between these variables, we apply a simple linear regression model.

4.5.1 Normality assessment and calories predictor model

We aim to develop a model capable of predicting calories burned based on the distance traveled by the user. To achieve this, we will analyze the variables calories and total_distance. The initial step in building our model is to verify whether the data for the predictor (total_distance) and the response (calories) follow a normal distribution. We can assess the normality of the data using the following methods:

  • Building a histogram to visually inspect the distribution,

  • Examining quantiles to check for deviations from normality, and

  • Performing the Kolmogorov-Smirnov test, a statistical test for normality. This method was selected because our sample is very large.

For the Kolmogorov-Smirnov test, we set up the hypotheses as follows:

  • \(H_0\) (Null Hypothesis): The data are normally distributed (\(p > 0.05\)).

  • \(H_1\) (Alternative Hypothesis): The data are not normally distributed (\(p < 0.05\)).

# Visual Inspection - Histograms
a<-ggplot(daily_activity_all, aes(x = calories)) + 
  geom_histogram(binwidth = 500, fill = "mediumturquoise", color = "black") +
  labs(title = "Histogram of Calories", x = "Calories", y = "Count")

b<-ggplot(daily_activity_all, aes(x = total_distance)) + 
  geom_histogram(binwidth = 1, fill = "pink", color = "black") +
  labs(title= "Histogram of Total distance", x = "Total distance", y = "Count")

# Generate panel plot
ggpubr::ggarrange(a,b)

# Visual Inspection - Q-Q Plots
par(mfrow = c(1, 2))
c<-qqnorm(daily_activity_all$calories, main = "Q-Q Plot of Calories", xlab = "Theoretical quantiles", ylab = "Sample quantiles")
qqline(daily_activity_all$calories, col = "red")

d<-qqnorm(daily_activity_all$total_distance, main = "Q-Q Plot of Total distance", xlab = "Theoretical quantiles", ylab = "Sample quantiles")
qqline(daily_activity_all$total_distance, col = "red")

# Standardize the variables
calories_z <- scale(daily_activity_all$calories) %>% na.omit()
total_distance_z <- scale(daily_activity_all$total_distance)  %>% na.omit()

# Kolmogorov-Smirnov Test for Calories
ks.test(calories_z, "pnorm")
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  calories_z
## D = 0.10536, p-value = 0.0002229
## alternative hypothesis: two-sided
# Kolmogorov-Smirnov Test for Total distance
ks.test(total_distance_z, "pnorm")
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  total_distance_z
## D = 0.053047, p-value = 0.1988
## alternative hypothesis: two-sided

According to the results of the Kolmogorov-Smirnov test, the null hypothesis (\(H_0\)) is rejected for the variable ‘calories’, indicating that the data are not normally distributed. Nonetheless, the Q-Q plot and histogram of this variable indicate a pattern that leans toward normality. For ‘total_distance’, both the statistical test and visual inspection confirm normal distribution (we accept \(H_0\)). Non-normality in the data can impact the robustness of our linear model. Consequently, alternative modeling strategies, such as nonlinear regression or data transformations, may be required to better fit the data.

4.5.2 Lineal correlation model of Total distance vs Calories

For hypothesis testing in our linear regression models(\(y = β_0 + β_1x_1\)), we define the hypotheses as follows:

  • \(H_0\) (Null Hypothesis): \(β_1=0\), \(p>0.05\). The independent variable does not significantly affect the dependent variable.

  • \(H_1\) (Alternative Hypothesis): \(β_1≠0\), \(p<0.05\). The independent variable does significantly affect the dependent variable.

If the p-value is less than 0.05, we reject \(H_0\) and conclude that the independent variable has a significant effect on the dependent variable.

# a) We check the linear correlation of total_distance vs calories
ggplot(data = sleep_quality, aes(x = total_distance, y = calories)) + geom_point() + geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 2, label.y = 4600, size = 4) +
  labs(title = 'Total distance vs Burned calories', x = 'Total distance (km)', y = 'Calories') + 
  annotate('text', x = 7, y = 4600, label = 'R^2 = 0.13', color = 'blue', size = 4)
## `geom_smooth()` using formula = 'y ~ x'

# b) We determine the linear regression coefficients:
model <- lm(sleep_quality$calories ~ sleep_quality$total_distance, data=sleep_quality)
summary(model)
## 
## Call:
## lm(formula = sleep_quality$calories ~ sleep_quality$total_distance, 
##     data = sleep_quality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1272.7  -605.3   -24.1   489.3  1745.3 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1785.8      142.0  12.579  < 2e-16 ***
## sleep_quality$total_distance    111.7       18.0   6.207 2.14e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 728 on 259 degrees of freedom
## Multiple R-squared:  0.1295, Adjusted R-squared:  0.1261 
## F-statistic: 38.52 on 1 and 259 DF,  p-value: 2.138e-09

Considering the threshold of p < 0.05, the p-values of our regression coefficients indicate that we must reject the null hypothesis (\(H_0\)). Therefore, we accept the alternative hypothesis (\(H_1\)) and conclude that the dependent variable calories is significantly affected by the independent variable total_distance (i.e., \(β_1≠0\)). However, the correlation coefficient \(R^2\) of our model is very low. This suggests that the linear regression model does not fit the data points well, and we cannot accurately predict calories values using this model.

4.5.3 Calories vs Average speed, Average pace, Sleep duration, and Weight evolution

In the following plot panel, the correlations between the dependent variable “calories” and the other variables of interest are displayed.

# a) average_speed vs calories
a <- ggplot(data = sleep_quality, aes(x = average_speed, y = calories)) + geom_point() + 
  geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 0, label.y = 4600, size = 4) +
  labs(title = 'Average speed vs Burned calories', x = 'Average speed (km/h)', y = 'Calories') + 
  annotate('text', x = 3, y = 4600, label = 'R^2 = 0.13', color = 'blue', size = 4)

# b) average_pace vs calories
b<- ggplot(data = sleep_quality, aes(x = average_pace, y = calories)) + geom_point() + geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 1, label.y = 4600, size = 4) +
  labs(title = 'Average pace vs Burned calories', x = 'Average pace (min/km)', y = 'Calories') + 
  annotate('text', x = 70, y = 4600, label = 'R^2 = 0.03', color = 'blue', size = 4)

# c) sleep_duration vs calories
c<- ggplot(data = sleep_quality, aes(x = sleep_duration_h, y = calories)) + geom_point() + geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 1, label.y = 4600, size = 4) +
  labs(title = 'Sleep duration vs Burned calories', x = 'Sleep duration (h)', y = 'Calories') + 
  annotate('text', x = 9, y = 4600, label = 'R^2 = 0.15', color = 'blue', size = 4)

# We merge weight_evolution table with day_activity_all to plot calories vs weight_evolution

weight_calories <- merge(daily_activity_all, weight_evolution, by = c('id', 'date'))

# d) Calories vs Weight evolution
d <- ggplot(data = weight_calories, aes(x = weight_kg , y = calories)) + geom_point() + geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 50, label.y = 4000, size = 4) +
  xlim(50, 110) +
  labs(title = 'Weight vs Burned calories', x = 'Weight (kg)', y = 'Calories') + 
  annotate('text', x = 90, y = 4000, label = 'R^2 = 0.1', color = 'blue', size = 4)

# Generation of the panel plot
ggpubr::ggarrange(a,b,c,d)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The null hypothesis (\(H_0\)) was rejected in all cases except for Calories vs. Weight. Rejection of the null hypothesis indicates that the independent variables (average speed, average pace, sleep duration) have an effect on the burned calories. However, the very low \(R^2\) values suggest that the linear model does not adequately describe the data, therefore we cannot predict calories accurately with this model.

In the case of Calories vs. Weight, we accept the null hypothesis (\(H_0\)) (\(p = 0.05294\)) and conclude that there is no significant relationship between burned calories and weight (i.e., \(β_1=0\)).

5 Conclusions

After our analysis, we identified some insights that could help Bellabeat make informed business decisions and unlock new growth opportunities:

  1. Speed: Participants were classified into four categories (fast, medium, slow, very slow). We found that users tend to move relatively slowly, with the average fast speed calculated as \(3.96 km/h\), which is quite low. This suggests that most users primarily walk. We recommend that Bellabeat could use a marketing campaign to encourage users to increase the intensity of their movements, aiming for higher activity levels.

  2. Pace: This variable was also classified into four categories (slow, medium, high, very high). The results show that users take between \(30\) minutes and \(1\) hour to cover \(1 km\). These results align with the speed findings and further indicate that, on average, users are not very active.

  3. Daily Steps: The analysis shows that most users are somewhat active, walking \(7,500\) to \(9,999\) steps daily. However, on Sundays, the sedentary group (around \(5,000\) steps) is notably higher. This indicates a high-potential opportunity to motivate users to stay active throughout the week.

  4. Daily Sleep: Using the Pittsburgh Sleep Quality Index, we classified users’ sleep quality. The results reveal that \(95%\) of participants sleep very well (with over \(90%\) sleep efficiency). The best sleep quality occurs when users sleep between \(6\) and \(8.5\) hours. Bellabeat users should be made aware that longer sleep durations do not necessarily equate to better sleep quality.

  5. Correlations: Through hypothesis testing, we found relationships between burned calories and traveled distance, speed, pace, and sleep duration. However, a linear model does not appropriately describe the data due to a low Pearson correlation coefficient (\(R^2 < 0.2\)). We can explore data transformation methods to find a more suitable model that better fits the data.

6 Recomendations

Based on the results from our data analysis, we suggest the following recommendations:

  1. Demographic Information: Since we do not have access to demographic information about the participants, the results of our analysis might be biased. For future projects, it is essential to include demographic data to obtain a more comprehensive and representative analysis.

  2. Sample Size: In this project, we worked with a small sample of 33 participants. To identify trends within a larger segment of the population and target them as potential Bellabeat customers, a larger sample size and data collection over a longer time period are needed.

  3. Data Collection: It is highly recommended to collect data while users engage in different physical activities (e.g., walking, running, swimming, etc.). This would allow us to identify trends that can help advise users on how to burn calories more effectively and optimally.

  4. Correlations: Since linear models do not adequately describe the data, further investigation into alternative models or data transformations may be necessary to improve prediction accuracy and ensure a better fit for our data.

References

Buysse, Daniel J., Charles F. Reynolds, Timothy H. Monk, Susan R. Berman, and David J. Kupfer. 1989. “The Pittsburgh Sleep Quality Index: A New Instrument for Psychiatric Practice and Research.” Psychiatry Research 28 (2): 193–213. https://doi.org/https://doi.org/10.1016/0165-1781(89)90047-4.
Firke, Sam. 2024. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://github.com/sfirke/janitor.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Müller, Kirill, and Jennifer Bryan. 2017. Here Package. https://here.r-lib.org/index.html.
Reed, David L., and William P. Sacco. 2016. “Measuring Sleep Efficiency: What Should the Denominator Be?” Journal of Clinical Sleep Medicine 12 (02): 263–66. https://doi.org/10.5664/jcsm.5498.
Tudor-Locke, Catrine, Cora L. Craig, John P. Thyfault, and John C. Spence. 2013. “A Step-Defined Sedentary Lifestyle Index: <5000 Steps/Day.” Applied Physiology, Nutrition, and Metabolism 38 (2): 100–114. https://doi.org/10.1139/apnm-2012-0235.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2025. Skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.