Bellabeat is a high-tech company that manufactures health-focused smart products. The company collects data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits. It is believed that analyzing smart device fitness data could help unlock new growth opportunities for the company.
Bussines taks: This project aims to analyze data from non-Bellabeat smart devices, such as smart watches and trackers, to understand consumer usage patterns. By identifying these trends, we can enhance Bellabeat’s marketing strategies and product offerings to better support women’s health and wellness.
To address the aforementioned task, we can analyze the open-source datasets offered by Mobius that are available on Kaggle. The data was obtained with the help of eligible users who used a FitBit Fitness Tracker and consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. We possess 29 CSV files collected from 33 FitBit users through a survey. The data encompasses a time window between 03/12/2016 and 05/12/2016.
To conduct this analysis, we will focus on datasets that provide key information such as calories burned, levels of physical activity, and sleep patterns. By examining these variables, we can gain insights into individuals’ physical activity behaviors. The specific datasets we will utilize include:
i) dailyActivity_merged: Contains data on daily steps, active minutes, and total distance traveled.
ii) dailySteps_merged: Provides information on the number of steps taken per day.
iii) dailyCalories_merged: Records the total calories burned per day.
iv) WeightLogInfo_merged: Includes weight data reported by users of FitBit devices.
These datasets will enable us to analyze and understand users’ physical and sleep activity patterns effectively.
To start working with data structures in R it is necessary to install and load the following packages:
To install the packages we excecute install.packages(‘package-name’) and load the as library(package-name).
We pull our datasets from the data source:
daily_activity <- read.csv('dailyActivity_merged.csv')
daily_steps <- read.csv('dailySteps_merged.csv')
daily_sleep <- read.csv('sleepDay_merged.csv')
weight_evolution <- read.csv('weightLogInfo_merged.csv')
To obtain an optimal description of our data, we use the
skim_without_charts function. This function offers a
comprehensive overview of datasets, including key statistical
information such as minimum, maximum, standard deviation, and mean. It
also displays the count of missing values and the data types of
variables within the dataset. This functionality is particularly useful
for obtaining a statistical summary of our data.
The dataset comprises 940 observations (rows) and 15 variables (columns), with 31 unique entries. Notably, the column names exhibit inconsistencies, including variations in letter casing. Additionally, the ‘activity date’ column is currently formatted as a string; it should be converted to a date data type for accurate analysis.
head(daily_activity)
skimr::skim_without_charts(daily_activity)
| Name | daily_activity |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDate | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| TotalSteps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| TotalDistance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| TrackerDistance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| LoggedActivitiesDistance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| VeryActiveDistance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| ModeratelyActiveDistance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| LightActiveDistance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| VeryActiveMinutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| FairlyActiveMinutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| LightlyActiveMinutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| SedentaryMinutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| Calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
This dataset contains 413 entries and 31 observations. We will need to merge this data later on with the daily_data dataset by the ‘id’ column for easier manipulation. Furthermore, the sleep_day column should be transformed into a date type for proper analysis and processing.
head(daily_sleep)
skimr::skim_without_charts(daily_sleep)
| Name | daily_sleep |
| Number of rows | 413 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| SleepDay | 0 | 1 | 20 | 21 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 5.000979e+09 | 2.06036e+09 | 1503960366 | 3977333714 | 4702921684 | 6962181067 | 8792009665 |
| TotalSleepRecords | 0 | 1 | 1.120000e+00 | 3.50000e-01 | 1 | 1 | 1 | 1 | 3 |
| TotalMinutesAsleep | 0 | 1 | 4.194700e+02 | 1.18340e+02 | 58 | 361 | 433 | 490 | 796 |
| TotalTimeInBed | 0 | 1 | 4.586400e+02 | 1.27100e+02 | 61 | 403 | 463 | 526 | 961 |
This field has 940 rows and three columns and 31 unique values. Column names should be cleaned and data type of ActivityDay should be formatted as date data-type.
head(daily_steps)
skimr::skim_without_charts(daily_steps)
| Name | daily_steps |
| Number of rows | 940 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDay | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4445114986.0 | 6962181067 | 8877689391 |
| StepTotal | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7405.5 | 10727 | 36019 |
Compared to the other datasets, this one contains relatively few entries —67 rows with 56 unique values. Despite the smaller size, we will analyze the data to identify any potential trends related to weight evolution.
head(weight_evolution)
skimr::skim_without_charts(weight_evolution)
| Name | weight_evolution |
| Number of rows | 67 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Date | 0 | 1 | 19 | 21 | 0 | 56 | 0 |
| IsManualReport | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 7.009282e+09 | 1.950322e+09 | 1.503960e+09 | 6.962181e+09 | 6.962181e+09 | 8.877689e+09 | 8.877689e+09 |
| WeightKg | 0 | 1.00 | 7.204000e+01 | 1.392000e+01 | 5.260000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 |
| WeightPounds | 0 | 1.00 | 1.588100e+02 | 3.070000e+01 | 1.159600e+02 | 1.353600e+02 | 1.377900e+02 | 1.875000e+02 | 2.943200e+02 |
| Fat | 65 | 0.03 | 2.350000e+01 | 2.120000e+00 | 2.200000e+01 | 2.275000e+01 | 2.350000e+01 | 2.425000e+01 | 2.500000e+01 |
| BMI | 0 | 1.00 | 2.519000e+01 | 3.070000e+00 | 2.145000e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 |
| LogId | 0 | 1.00 | 1.461772e+12 | 7.829948e+08 | 1.460444e+12 | 1.461079e+12 | 1.461802e+12 | 1.462375e+12 | 1.463098e+12 |
In this part of the project we will focus on structuring, formatting, checking for duplicates, and cleaning the data.
We will start our cleaning process by giving a consistent name to our columns:
daily_activity <- janitor::clean_names(daily_activity)
daily_sleep <- janitor::clean_names(daily_sleep)
daily_steps <- janitor::clean_names(daily_steps)
weight_evolution <- janitor::clean_names(weight_evolution)
Our skim_without_charts and head tables
show that the data type of date column of our data sets is a string.
Therefore we have to format it as date type.
daily_activity <- daily_activity %>% dplyr::mutate(activity_date = lubridate::mdy(daily_activity$activity_date))
daily_sleep <- daily_sleep %>% dplyr::mutate(sleep_day = lubridate::as_date(daily_sleep$sleep_day, format ="%m/%d/%Y %I:%M:%S %p" , tz = NULL))
daily_steps <- daily_steps %>% dplyr::mutate(activity_day = lubridate::mdy(daily_steps$activity_day))
weight_evolution<- weight_evolution %>% dplyr::mutate(date = lubridate::as_date(weight_evolution$date,format ="%m/%d/%Y %I:%M:%S %p" , tz = NULL))
It is essential that we designate all variables containing date-type data as ‘date’ to keep naming consistency across the dataset.
daily_activity <- daily_activity %>% rename(date = activity_date)
daily_steps <- rename(daily_steps, date = activity_day)
daily_sleep <- rename(daily_sleep, date = sleep_day)
After executing the code below, we observe that the
weight_evolution dataset contains 65 missing (NA) values in
the ‘fat’ column. We will exclude this variable from our analysis.
Additionally, due to the limited number of records, the
weight_evolution dataset will only be analyzed for
correlation purposes.
weight_evolution <- weight_evolution %>% select(-fat)
sum(is.na(daily_activity))
## [1] 0
sum(is.na(daily_sleep))
## [1] 0
sum(is.na(daily_steps))
## [1] 0
sum(is.na(weight_evolution))
## [1] 0
To identify and remove duplicate observations in our datasets, we can
use the duplicated() function. This function returns a
logical vector indicating which rows are duplicates. By applying it to
our dataset, we can create a table that shows the count of duplicate and
unique values. After running the code we find out that daily_sleep has 3
repeats that should be removed.
daily_activity_repeats<- print(table(duplicated(daily_activity)))
##
## FALSE
## 940
daily_sleep_repeats <- print(table(duplicated(daily_sleep)))
##
## FALSE TRUE
## 410 3
daily_steps_repeats <- print(table(duplicated(daily_steps)))
##
## FALSE
## 940
weight_evolution_repeats <-print(table(duplicated(weight_evolution)))
##
## FALSE
## 67
To remove the existing duplicates, we can use the
!duplicated() function, which returns a logical vector
indicating which rows are not duplicates. The number of rows is then
equal to 410, which is the number of the unique values, indicating that
repeats have been removed.
daily_sleep <- daily_sleep[!duplicated(daily_sleep), ]
print(nrow(daily_sleep))
## [1] 410
To perform calculations using data from different datasets, it’s
essential to merge them. In R, this can be accomplished using the
merge() function as follows:
daily_activity_all <- merge(daily_activity, daily_sleep, by = c('id', 'date'))
We will begin our data analysis by calculating metrics such as speed and pace. These measurements help us examine the movement patterns of the participants.
We check our cleaned datasets.
head(daily_activity)
head(daily_steps)
head(daily_sleep)
head(weight_evolution)
head(daily_activity_all)
Relative speed is categorized as high, moderate, or light, based on the distances(km) and durations (in hours) spent in very active, moderately active, and lightly active states. From this, we derive three speed levels —high, moderate, and light. The average of these speed levels is then calculated and used as a reference value to qualitatively classify users’ movement patterns.
daily_activity_all<- mutate(daily_activity_all, high_speed = very_active_distance/(very_active_minutes/60))
daily_activity_all<- mutate(daily_activity_all, moderate_speed = moderately_active_distance/(fairly_active_minutes/60))
daily_activity_all <- daily_activity_all %>% mutate(low_speed = light_active_distance/(lightly_active_minutes/60))
daily_activity_all<- daily_activity_all %>% mutate(average_speed = (high_speed + moderate_speed + low_speed)/3)
speeds <- daily_activity_all %>%
na.omit() %>%
select(id, date, high_speed, moderate_speed, low_speed, average_speed)
head(speeds)
Now we can calculate the average speed of every speed level:
daily_average_sp <-daily_activity_all %>%
na.omit() %>%
group_by(id) %>%
summarise(mean_high_speed = mean(high_speed), mean_high_speed = mean(high_speed), mean_moderate_speed = mean(moderate_speed), mean_low_speed = mean(low_speed))
head(daily_average_sp)
# We calculate de mean of every speed level
summary(daily_average_sp)
## id mean_high_speed mean_moderate_speed mean_low_speed
## Min. :1.504e+09 Min. :1.895 Min. :0.9267 Min. :0.7000
## 1st Qu.:4.010e+09 1st Qu.:3.604 1st Qu.:2.3422 1st Qu.:0.9769
## Median :4.631e+09 Median :4.186 Median :2.6216 Median :1.0132
## Mean :5.062e+09 Mean :3.962 Mean :2.4585 Mean :1.0434
## 3rd Qu.:6.822e+09 3rd Qu.:4.443 3rd Qu.:2.7649 3rd Qu.:1.1630
## Max. :8.792e+09 Max. :6.215 Max. :3.1559 Max. :1.5323
To define a categorical parameter for classifying user movement speed, we calculate the average speed for each category and assign levels such as fast, medium, slow, and very slow. This approach allows us to establish velocity ranges, enabling a qualitative visualization of the data.
Average speed classification:
These values are used to create speed intervals (levels):
In this way, we can qualitatively classify each user’s average speed as follows:
speed_level <- daily_activity_all %>%
tidyr::drop_na() %>%
group_by(id) %>%
mutate(speed_level = case_when(
average_speed >=3.96 ~ "fast",
average_speed <3.96 & average_speed >= 2.46 ~ "medium",
average_speed < 2.46 & average_speed >= 1.04 ~ "slow",
average_speed <1.043 ~ "very slow")) %>%
select(id,date, speed_level)
head(speed_level)
We plot the the results to determine the speed distribution of participants:
library(ggplot2)
ggplot2::ggplot(data = speed_level ) +
ggplot2::geom_bar(mapping = aes(x = date, fill = speed_level)) +
labs(title = 'Overall speed level distribution', x = 'Date', y = 'Frequency')+
scale_x_date(date_labels = "%b,%d", breaks = 'week', date_minor_breaks = "1 day")
It is also useful to visualize the speed distribution of users based on the day of the week.
speed_level_weekday <-speed_level %>%
group_by(id) %>%
mutate(weekday = weekdays(date))
speed_level_weekday$weekday <- factor(speed_level_weekday$weekday, c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))
ggplot(data = speed_level_weekday) +
geom_bar(mapping = aes(x = weekday, fill = speed_level)) +
labs(title = 'Speed intensity across the weekdays', x = 'Weekday', y = 'Count')
Here, the speed categories are expressed as percentages, and a Pareto chart is used to enhance the visualization of these results.
percentage_sp <- speed_level %>%
group_by(speed_level) %>%
summarise(counts = n()) %>%
mutate(total = sum(counts)) %>%
group_by(speed_level) %>%
summarise(percent = counts/total*100, counts)
head(percentage_sp)
ggplot2::ggplot(percentage_sp, aes(x = speed_level, y = counts)) +
ggQC::stat_pareto(point.color = "red",
point.size = 3,
line.color = "black",
bars.fill = c("orange", "green")) +
labs(title = 'Pareto chart of speed categories ', x = 'Speed level', y = 'Count') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5, size=10))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on the plots above, we can conclude the following:
Approximately one-third of users moved at speeds between \(1.04 km/h\) and \(2.46 km/h\), while around two-thirds moved between \(2.46 km/h\) and \(3.96 km/h\).
The majority of users fell into the fast and medium speed categories.
Tuesday had the highest activity level among participants, whereas Sunday had the lowest.
The fastest recorded speed among participants occurred on Monday.
Pace quantifies the time required to travel a unit distance and is defined as: \(pace=time (min)/distance (km)\). Relative pace is categorized into high, moderate, and light. High pace is derived using “very active time” and “very active distance” data, while moderate and light pace are calculated similarly. Additionally, the average pace is determined to serve as a reference value for the qualitative classification of users’ pace.
daily_activity_all<- daily_activity_all %>% mutate(high_pace = very_active_minutes/very_active_distance)
daily_activity_all<- daily_activity_all %>% mutate(moderate_pace = fairly_active_minutes/moderately_active_distance)
daily_activity_all<- daily_activity_all %>% mutate(low_pace = lightly_active_minutes/light_active_distance)
daily_activity_all<- daily_activity_all %>% mutate(average_pace = (high_pace + moderate_pace + low_pace)/3)
pace <- daily_activity_all %>%
na.omit() %>%
select(id, date, high_pace, moderate_pace, low_pace, average_pace)
head(pace)
We calculate the average pace for each category (high, moderate, and light). These values are then used to establish pace intervals for further analysis.
daily_avg_pace <-daily_activity_all %>%
na.omit() %>%
group_by(id) %>%
summarise(mean_high_pace = mean(high_pace), mean_moderate_pace = mean(moderate_pace), mean_low_pace = mean(low_pace))
head(daily_avg_pace)
# We calculate de mean of every pace level
summary(daily_avg_pace)
## id mean_high_pace mean_moderate_pace mean_low_pace
## Min. :1.504e+09 Min. : 9.662 Min. :19.54 Min. :40.01
## 1st Qu.:4.010e+09 1st Qu.:13.600 1st Qu.:21.77 1st Qu.:52.66
## Median :4.631e+09 Median :15.614 Median :23.66 Median :60.27
## Mean :5.062e+09 Mean :19.424 Mean :29.16 Mean :61.14
## 3rd Qu.:6.822e+09 3rd Qu.:18.903 3rd Qu.:30.07 3rd Qu.:63.04
## Max. :8.792e+09 Max. :51.254 Max. :69.81 Max. :85.88
For better visualization of our pace data, we define levels of pace (slow, medium, high, and very high). To create such levels we use the mean of the three different pace categories:
Average pace classification :
The pace intervals were defined as follows:
pace_level <- daily_activity_all %>%
tidyr::drop_na() %>%
group_by(id) %>%
mutate(pace_degree = case_when(
average_pace > 61.14 ~ "slow",
average_pace > 29.16 & average_pace <= 61.14 ~ "medium",
average_pace > 19.42 & average_pace <= 29.16 ~ "high",
average_pace <= 19.42 ~ "very high")) %>%
select(id,date,average_pace, pace_degree)
head(pace_level)
We plot the results to determine the pace distribution of the participants per day.
library(ggplot2)
ggplot2::ggplot(data = pace_level ) +
ggplot2::geom_bar(mapping = aes(x = date, fill = pace_degree)) +
labs(title = 'Overall pace distribution', x = 'Date', y = 'Count')+
scale_x_date(date_labels = "%b,%d", breaks = 'week', date_minor_breaks = "1 day")
The pace level distribution can also be visualized for each day of the week:
pace_level_weekday <-pace_level %>%
group_by(id) %>%
mutate(weekday = weekdays(date))
pace_level_weekday$weekday <- factor(speed_level_weekday$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))
ggplot(data = pace_level_weekday) +
geom_bar(mapping = aes(x = weekday, fill = pace_degree)) +
labs(title = 'Pace intensity across the Weekdays', x = 'Weekday', y = 'Count')
Each pace level can be expressed as a percentage and visualized using a Pareto chart; thus, it is straightforward to observe the proportional relationships among the different pace levels.
percentage_pace <- pace_level %>%
group_by(pace_degree)%>%
summarise(counts = n()) %>%
mutate(total = sum(counts)) %>%
group_by(pace_degree) %>%
summarise(total_percent = round(counts / total*100, digits = 2), counts)
head(percentage_pace)
ggplot(percentage_pace, aes(x = pace_degree, y = counts)) +
ggQC::stat_pareto(point.color = "red",
point.size = 3,
line.color = "black",
bars.fill = c("orange", "green")) +
labs(title = 'Pareto chart of pace level', x = 'Pace level', y = 'Count')+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5, size=10))
Based on the pace data analysis, we can state the following:
The majority of users moved at a fast or medium pace. Fast pace accounts for 33% of participants, while medium pace is the most prevalent rendering 65%.
Tuesday and Wednesday are the most active days, whereas Sunday and Monday have the lowest activity levels.
To analyze step data, we classify daily step records into levels based on the Graduated Step Index published in the Applied Physiology, Nutrition, and Metabolism journal Tudor-Locke et al. (2013):
< 5000 sedentary
5000 - 7499 low active
7500 - 9999 somewhat active
10000 - 12499 active
≥ 12500 highly active.
For the steps-per-day classification, we will use the daily average step count of each user.
steps_day_av <- daily_activity_all %>%
group_by(id) %>%
summarise(steps_per_day = mean(total_steps)) %>%
mutate(steps_level = case_when(steps_per_day < 5000 ~ 'sedentary',
steps_per_day >= 5000 & steps_per_day <= 7499 ~ 'low active',
steps_per_day >= 7500 & steps_per_day <= 9999 ~ 'somewhat active',
steps_per_day >= 10000 & steps_per_day <= 12499 ~ 'active',
steps_per_day >= 12500 ~ 'highly active'))
head(steps_day_av)
The step levels can be represented as percentages for better visualization.
percentage_steps <- steps_day_av %>%
group_by(steps_level) %>%
summarise(counts = n()) %>%
mutate(total = sum(counts)) %>%
group_by(steps_level) %>%
summarise(total_percent = round(counts/total*100, digits = 2), counts)
head(percentage_steps)
We can visualize the percentages and frequencies by creating a Pareto chart.
ggplot2::ggplot(percentage_steps, aes(x = steps_level, y = counts)) +
ggQC::stat_pareto(point.color = "red",
point.size = 3,
line.color = "black",
bars.fill = c("orange", "green")) +
labs(title = 'Pareto chart of steps activity', x = 'Pace activity', y = 'Count') +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=1, size=10))
We generate another data frame to analyze weekly step records and step activity trends.
steps_week <- daily_activity_all %>%
mutate(weekday = weekdays(date)) %>%
mutate(steps_level_week = case_when(total_steps < 5000 ~ 'sedentary',
total_steps>= 5000 & total_steps <= 7499 ~ 'low active',
total_steps >= 7500 & total_steps <= 9999 ~ 'somewhat active',
total_steps >= 10000 & total_steps <= 12499 ~ 'active',
total_steps >= 12500 ~ 'highly active')) %>%
select(id,date, weekday, steps_level_week)
head(steps_week)
steps_week$weekday <- factor(steps_week$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))
ggplot(steps_week) + geom_bar(mapping = aes(x = weekday, fill = steps_level_week)) +
labs(title= 'Step intensity by weekday', x = 'Weekday', y = 'Count')
From the daily steps analysis, we can conclude the following:
Only a small percentage of users (4%) are classified as highly active. The majority (37%) fall into the somewhat active category, while the low active, sedentary, and active groups each make up approximately 20%.
The highly active group is largest on Saturday and smallest on Friday, whereas sedentary behavior is most prevalent on Sunday.
To evaluate users’ sleep quality, we calculate the sleep efficiency (SE) as the ratio of total sleep time (TST) to time in bed (TIB) multiplied by 100. This metric is based on the methodology outlined in Reed and Sacco (2016): \(SE = TST/TIM * 100\).
To classify the sleep quality qualitatively, we use our SE values and rank them according tothe Pittsburgh Sleep Quality Index (PSQI) scale Buysse et al. (1989):
> 85 % very good
75 - 84 % fairly good
65 - 74 % fairly bad
< 65 % very bad
sleep_quality <- daily_activity_all %>%
mutate(weekday = weekdays(date)) %>%
mutate(sleep_efficiency = total_minutes_asleep/total_time_in_bed * 100) %>%
mutate(sleep_duration_h = total_minutes_asleep/60) %>%
mutate(sleep_level = case_when(sleep_efficiency >= 85 ~ 'very good',
sleep_efficiency >= 75 & sleep_efficiency <= 84 ~ 'fairly good',
sleep_efficiency >= 65 & sleep_efficiency <= 74 ~ 'fairly bad',
sleep_efficiency <= 64 ~ 'very bad')) %>%
na.omit() %>%
select(id, date, weekday, sleep_efficiency, sleep_level, sleep_duration_h, calories, total_steps, total_distance, average_speed, average_pace)
head(sleep_quality)
We aim to classify sleep efficiency for each user and determine the percentage distribution of each sleep level category. This classification is based on the average sleep efficiency values of individual users.
sleep_efficiency_user <- sleep_quality %>%
group_by(id) %>%
summarise(sleep_efficiency_av = mean(sleep_efficiency)) %>%
mutate(sleep_level = case_when(sleep_efficiency_av >= 85 ~ 'very good',
sleep_efficiency_av >= 75 & sleep_efficiency_av <= 84 ~ 'fairly good',
sleep_efficiency_av >= 65 & sleep_efficiency_av <= 74 ~ 'fairly bad',
sleep_efficiency_av <= 64 ~ 'very bad'))
head(sleep_efficiency_user)
sleep_level_percentage <- sleep_efficiency_user %>%
group_by(sleep_level) %>%
summarise(counts = n()) %>%
mutate(total_percent = counts/sum(counts) * 100) %>%
group_by(sleep_level)
head(sleep_level_percentage)
The following plots are used to analyze sleep quality:
sleep_quality$weekday <- factor(sleep_quality$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))
# Sleep quality distribution per day of the week
ggplot(data = sleep_quality) + geom_bar(mapping = aes(x = weekday, fill = sleep_level)) + facet_wrap(~sleep_level, nrow = 2, scale = 'free') +
labs(title = 'Sleep quality', x = 'Weekday', y= 'Count') +
theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))
# Density plot of sleep quality
d1 <- ggplot(sleep_quality, aes(x = sleep_duration_h, y = sleep_efficiency)) +
geom_point(alpha = 0.3) +
ylim(80, 100) +
xlim(4,11) +
geom_density_2d_filled(alpha = 0.7) +
labs(title = 'Density (a) and Box (b) plots: Sleep efficiency vs Sleep duration', subtitle = 'a)', x = 'Sleep duration (h)', y = 'Sleep efficiency (%)')
# Box plot of sleep quality
d2 <- ggplot(sleep_quality, aes(x = sleep_duration_h, y = sleep_efficiency, fill = sleep_level)) +
geom_boxplot() +
labs(subtitle ='b)', x = 'Sleep duration (h)', y = 'Sleep efficiency (%)')
# Generation of the planel plot
egg::ggarrange(d1, d2, heights = NULL, widths= NULL, ncol = 1, nrow = 2)
The results of the daily sleep analysis reveal that:
Most users (95%) experience very good sleep quality overall. However, Sunday has the poorest sleep quality, with the smallest proportion of users in the “very good” category and the highest proportion in the “very bad” category.
The density plot indicates that the largest segment of participants achieves very good sleep efficiency when sleeping between six and eight hours. Additionally, the box plot reveals that users tend to sleep very poorly when sleeping less than 5.5 hours and only “fairly good” when sleeping for more than 8.5 hours. The plot also suggests that the highest sleep efficiency is achieved with approximately 7 hours of sleep.
To examine potential correlations among the studied variables, we use ‘calories’ as the independent variable and plot it against ‘total_distance,’ ‘average_speed,’ ‘average_pace,’ ‘sleep_duration,’ and ‘weight_evolution.’ To assess the relationships between these variables, we apply a simple linear regression model.
We aim to develop a model capable of predicting calories burned based on the distance traveled by the user. To achieve this, we will analyze the variables calories and total_distance. The initial step in building our model is to verify whether the data for the predictor (total_distance) and the response (calories) follow a normal distribution. We can assess the normality of the data using the following methods:
Building a histogram to visually inspect the distribution,
Examining quantiles to check for deviations from normality, and
Performing the Kolmogorov-Smirnov test, a statistical test for normality. This method was selected because our sample is very large.
For the Kolmogorov-Smirnov test, we set up the hypotheses as follows:
\(H_0\) (Null Hypothesis): The data are normally distributed (\(p > 0.05\)).
\(H_1\) (Alternative Hypothesis): The data are not normally distributed (\(p < 0.05\)).
# Visual Inspection - Histograms
a<-ggplot(daily_activity_all, aes(x = calories)) +
geom_histogram(binwidth = 500, fill = "mediumturquoise", color = "black") +
labs(title = "Histogram of Calories", x = "Calories", y = "Count")
b<-ggplot(daily_activity_all, aes(x = total_distance)) +
geom_histogram(binwidth = 1, fill = "pink", color = "black") +
labs(title= "Histogram of Total distance", x = "Total distance", y = "Count")
# Generate panel plot
ggpubr::ggarrange(a,b)
# Visual Inspection - Q-Q Plots
par(mfrow = c(1, 2))
c<-qqnorm(daily_activity_all$calories, main = "Q-Q Plot of Calories", xlab = "Theoretical quantiles", ylab = "Sample quantiles")
qqline(daily_activity_all$calories, col = "red")
d<-qqnorm(daily_activity_all$total_distance, main = "Q-Q Plot of Total distance", xlab = "Theoretical quantiles", ylab = "Sample quantiles")
qqline(daily_activity_all$total_distance, col = "red")
# Standardize the variables
calories_z <- scale(daily_activity_all$calories) %>% na.omit()
total_distance_z <- scale(daily_activity_all$total_distance) %>% na.omit()
# Kolmogorov-Smirnov Test for Calories
ks.test(calories_z, "pnorm")
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: calories_z
## D = 0.10536, p-value = 0.0002229
## alternative hypothesis: two-sided
# Kolmogorov-Smirnov Test for Total distance
ks.test(total_distance_z, "pnorm")
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: total_distance_z
## D = 0.053047, p-value = 0.1988
## alternative hypothesis: two-sided
According to the results of the Kolmogorov-Smirnov test, the null hypothesis (\(H_0\)) is rejected for the variable ‘calories’, indicating that the data are not normally distributed. Nonetheless, the Q-Q plot and histogram of this variable indicate a pattern that leans toward normality. For ‘total_distance’, both the statistical test and visual inspection confirm normal distribution (we accept \(H_0\)). Non-normality in the data can impact the robustness of our linear model. Consequently, alternative modeling strategies, such as nonlinear regression or data transformations, may be required to better fit the data.
For hypothesis testing in our linear regression models(\(y = β_0 + β_1x_1\)), we define the hypotheses as follows:
\(H_0\) (Null Hypothesis): \(β_1=0\), \(p>0.05\). The independent variable does not significantly affect the dependent variable.
\(H_1\) (Alternative Hypothesis): \(β_1≠0\), \(p<0.05\). The independent variable does significantly affect the dependent variable.
If the p-value is less than 0.05, we reject \(H_0\) and conclude that the independent variable has a significant effect on the dependent variable.
# a) We check the linear correlation of total_distance vs calories
ggplot(data = sleep_quality, aes(x = total_distance, y = calories)) + geom_point() + geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 2, label.y = 4600, size = 4) +
labs(title = 'Total distance vs Burned calories', x = 'Total distance (km)', y = 'Calories') +
annotate('text', x = 7, y = 4600, label = 'R^2 = 0.13', color = 'blue', size = 4)
## `geom_smooth()` using formula = 'y ~ x'
# b) We determine the linear regression coefficients:
model <- lm(sleep_quality$calories ~ sleep_quality$total_distance, data=sleep_quality)
summary(model)
##
## Call:
## lm(formula = sleep_quality$calories ~ sleep_quality$total_distance,
## data = sleep_quality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1272.7 -605.3 -24.1 489.3 1745.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1785.8 142.0 12.579 < 2e-16 ***
## sleep_quality$total_distance 111.7 18.0 6.207 2.14e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 728 on 259 degrees of freedom
## Multiple R-squared: 0.1295, Adjusted R-squared: 0.1261
## F-statistic: 38.52 on 1 and 259 DF, p-value: 2.138e-09
Considering the threshold of p < 0.05, the p-values of our regression coefficients indicate that we must reject the null hypothesis (\(H_0\)). Therefore, we accept the alternative hypothesis (\(H_1\)) and conclude that the dependent variable calories is significantly affected by the independent variable total_distance (i.e., \(β_1≠0\)). However, the correlation coefficient \(R^2\) of our model is very low. This suggests that the linear regression model does not fit the data points well, and we cannot accurately predict calories values using this model.
In the following plot panel, the correlations between the dependent variable “calories” and the other variables of interest are displayed.
# a) average_speed vs calories
a <- ggplot(data = sleep_quality, aes(x = average_speed, y = calories)) + geom_point() +
geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 0, label.y = 4600, size = 4) +
labs(title = 'Average speed vs Burned calories', x = 'Average speed (km/h)', y = 'Calories') +
annotate('text', x = 3, y = 4600, label = 'R^2 = 0.13', color = 'blue', size = 4)
# b) average_pace vs calories
b<- ggplot(data = sleep_quality, aes(x = average_pace, y = calories)) + geom_point() + geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 1, label.y = 4600, size = 4) +
labs(title = 'Average pace vs Burned calories', x = 'Average pace (min/km)', y = 'Calories') +
annotate('text', x = 70, y = 4600, label = 'R^2 = 0.03', color = 'blue', size = 4)
# c) sleep_duration vs calories
c<- ggplot(data = sleep_quality, aes(x = sleep_duration_h, y = calories)) + geom_point() + geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 1, label.y = 4600, size = 4) +
labs(title = 'Sleep duration vs Burned calories', x = 'Sleep duration (h)', y = 'Calories') +
annotate('text', x = 9, y = 4600, label = 'R^2 = 0.15', color = 'blue', size = 4)
# We merge weight_evolution table with day_activity_all to plot calories vs weight_evolution
weight_calories <- merge(daily_activity_all, weight_evolution, by = c('id', 'date'))
# d) Calories vs Weight evolution
d <- ggplot(data = weight_calories, aes(x = weight_kg , y = calories)) + geom_point() + geom_smooth(method = 'lm', color = 'red') + ggpubr::stat_regline_equation(label.x = 50, label.y = 4000, size = 4) +
xlim(50, 110) +
labs(title = 'Weight vs Burned calories', x = 'Weight (kg)', y = 'Calories') +
annotate('text', x = 90, y = 4000, label = 'R^2 = 0.1', color = 'blue', size = 4)
# Generation of the panel plot
ggpubr::ggarrange(a,b,c,d)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The null hypothesis (\(H_0\)) was rejected in all cases except for Calories vs. Weight. Rejection of the null hypothesis indicates that the independent variables (average speed, average pace, sleep duration) have an effect on the burned calories. However, the very low \(R^2\) values suggest that the linear model does not adequately describe the data, therefore we cannot predict calories accurately with this model.
In the case of Calories vs. Weight, we accept the null hypothesis (\(H_0\)) (\(p = 0.05294\)) and conclude that there is no significant relationship between burned calories and weight (i.e., \(β_1=0\)).
After our analysis, we identified some insights that could help Bellabeat make informed business decisions and unlock new growth opportunities:
Speed: Participants were classified into four categories (fast, medium, slow, very slow). We found that users tend to move relatively slowly, with the average fast speed calculated as \(3.96 km/h\), which is quite low. This suggests that most users primarily walk. We recommend that Bellabeat could use a marketing campaign to encourage users to increase the intensity of their movements, aiming for higher activity levels.
Pace: This variable was also classified into four categories (slow, medium, high, very high). The results show that users take between \(30\) minutes and \(1\) hour to cover \(1 km\). These results align with the speed findings and further indicate that, on average, users are not very active.
Daily Steps: The analysis shows that most users are somewhat active, walking \(7,500\) to \(9,999\) steps daily. However, on Sundays, the sedentary group (around \(5,000\) steps) is notably higher. This indicates a high-potential opportunity to motivate users to stay active throughout the week.
Daily Sleep: Using the Pittsburgh Sleep Quality Index, we classified users’ sleep quality. The results reveal that \(95%\) of participants sleep very well (with over \(90%\) sleep efficiency). The best sleep quality occurs when users sleep between \(6\) and \(8.5\) hours. Bellabeat users should be made aware that longer sleep durations do not necessarily equate to better sleep quality.
Correlations: Through hypothesis testing, we found relationships between burned calories and traveled distance, speed, pace, and sleep duration. However, a linear model does not appropriately describe the data due to a low Pearson correlation coefficient (\(R^2 < 0.2\)). We can explore data transformation methods to find a more suitable model that better fits the data.
Based on the results from our data analysis, we suggest the following recommendations:
Demographic Information: Since we do not have access to demographic information about the participants, the results of our analysis might be biased. For future projects, it is essential to include demographic data to obtain a more comprehensive and representative analysis.
Sample Size: In this project, we worked with a small sample of 33 participants. To identify trends within a larger segment of the population and target them as potential Bellabeat customers, a larger sample size and data collection over a longer time period are needed.
Data Collection: It is highly recommended to collect data while users engage in different physical activities (e.g., walking, running, swimming, etc.). This would allow us to identify trends that can help advise users on how to burn calories more effectively and optimally.
Correlations: Since linear models do not adequately describe the data, further investigation into alternative models or data transformations may be necessary to improve prediction accuracy and ensure a better fit for our data.