Bellabeat is a successful small company, a high-tech manufacturer of health-focused products for women but they have the potential to become a larger player in the global small device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing small device fitness data could help unlock new growth opportunities for the company.
Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
Key questions to ask;
Within Bellabeat’s marketing analyst team, I, as a junior Data Analyst, am responsible for analyzing non-Bellabeat smart device usage data. The trends discovered are integrated into a presentation to guide the marketing strategy for Bellabeat customers.
Urška Srše: Bellabeat’s co-founder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.
FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
In this phase, I downloaded the sample dataset and stored in my computer and examined it by using Google Sheets and Excel to identify various variables. I also verified the data and reached the conclusion that all the files are easily accessible in Google Sheets, Excel, BigQuery, and R.
Data is structured in both wide and long formats. I lean towards the long format, as it proves more convenient when handling dates.
Regarding the data being bias, we lack information about the individuals and the criteria used for their selection in the survey, making it uncertain whether the survey is biased.
About the ROOC;
Reliability: data is coming from a reliable source, so it is credible,
Original: data is original, shared by Fitbit,
Comprehensive: Yes, most of the data is complete and detailed but in a sample size,
Current: Unfortunately data is not current, from 2016, out of date,
Citied: It is considered credible since it originates from an authentic source, Fitbit users, and a reputable website.
I chose to focus on R to enhance my proficiency and understanding in this particular data processing tool while I am proficient in using Sheets, Excel, and SQL due to my existing knowledge.
The dataset exhibits missing values, and not all users contributed to certain analyses. Additionally, there are abnormalities in heart rate and sleeping time lead me to believe that certain CSV files may contain incomplete data, making accurate analysis challenging.
I realized that the data was mostly clean, requiring my attention mainly towards converting column variables into a clear, understandable and readable format, including the conversion of date & time variables to their appropriate form and handling decimal numbers.
The datasets I’m currently analyzing to gain insights for this project are:
dailyActivity_merged.csv: The file consolidates daily activity logs, combining information from different files,
heartrate_seconds_merged.csv: The data in this document records the heart rates of individuals at five-second intervals,
hourlyIntensities_merged.csv: Correlation between overall and average intensities in relation to hourly analyses,
weightLogInfo_merged.csv: Comparison of an individual’s daily weight,
dailyIntensities_merged.csv: daily physical activity intensities,
sleepDay_merged.csv: The duration of time spent in bed for
sleeping.
Packages and Libraries used in the Project
install.packages(‘tidyverse’)
install.packages(“janitor”)
install.packages(‘readr’)
install.packages(“ggplot2”)
install.packages(“dplyr”)
install.packages(‘colorspace’)
library(colorspace)
library(tidyverse)
library(readr)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(babynames)
library(janitor)
library(tinytex)
library(DescTools)
# Importing the dailyActivity_merged.csv file
daily_activity_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Cleaning all the columns to make sure all clear, making them lower Case
daily_activity_df<- clean_names(daily_activity_df)
head(daily_activity_df) # Checking the table and variables as a tibble
## # A tibble: 6 × 15
## id activity_date total_steps total_distance tracker_distance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## # very_active_distance <dbl>, moderately_active_distance <dbl>,
## # light_active_distance <dbl>, sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
View(daily_activity_df) # Viewing the contents of our data frame
str(daily_activity_df) # Understanding the internal structure of data frames, lists, and other R objects
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_date : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ total_steps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ total_distance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ tracker_distance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ logged_activities_distance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ very_active_distance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ moderately_active_distance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ light_active_distance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ sedentary_active_distance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ very_active_minutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ fairly_active_minutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ lightly_active_minutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentary_minutes : num [1:940] 728 776 1218 726 773 ...
## $ calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
glimpse(daily_activity_df) # Inspecting the data type and the first few rows of each variable in a data set
## Rows: 940
## Columns: 15
## $ id <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ activity_date <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/1…
## $ total_steps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…
colnames(daily_activity_df) # Making sure we see all the Column names
## [1] "id" "activity_date"
## [3] "total_steps" "total_distance"
## [5] "tracker_distance" "logged_activities_distance"
## [7] "very_active_distance" "moderately_active_distance"
## [9] "light_active_distance" "sedentary_active_distance"
## [11] "very_active_minutes" "fairly_active_minutes"
## [13] "lightly_active_minutes" "sedentary_minutes"
## [15] "calories"
The goal is to identify the number of users engaged and the duration of this activity over a series of days,
# Using unique & n_distinct functions to be able to retrieve Unique user names
unique(daily_activity_df$id)
## [1] 1503960366 1624580081 1644430081 1844505072 1927972279 2022484408
## [7] 2026352035 2320127002 2347167796 2873212765 3372868164 3977333714
## [13] 4020332650 4057192912 4319703577 4388161847 4445114986 4558609924
## [19] 4702921684 5553957443 5577150313 6117666160 6290855005 6775888955
## [25] 6962181067 7007744171 7086361926 8053475328 8253242879 8378563200
## [31] 8583815059 8792009665 8877689391
# Here is the combined total number of users and the total number of days shown in this activity
user_id <- n_distinct(daily_activity_df$id) # Unique user Ids in total
days_total <- n_distinct(daily_activity_df$activity_date) # Examining the count of days in this activity
descr_columns <- c('Number_of_ users', 'Number_of_days')
numbers_total <- c(user_id, days_total)
number_of_users <- data.frame(descr_columns, numbers_total)
number_of_users
## descr_columns numbers_total
## 1 Number_of_ users 33
## 2 Number_of_days 31
Changing the format of the activity_date column into DATE format in order to display the start and end date,
# Converting the format of the activity_date column into DATE format
daily_activity_df_v1 <- daily_activity_df %>%
mutate(activity_date = as.Date(activity_date, format = '%m/%d/%Y'))
min_date <- min(daily_activity_df_v1$activity_date)
max_date <- max(daily_activity_df_v1$activity_date)
base::paste0('-Starting Date: ', min_date, ' -Ending Date: ', max_date)
## [1] "-Starting Date: 2016-04-12 -Ending Date: 2016-05-12"
head(daily_activity_df_v1)
## # A tibble: 6 × 15
## id activity_date total_steps total_distance tracker_distance
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## # very_active_distance <dbl>, moderately_active_distance <dbl>,
## # light_active_distance <dbl>, sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
View(daily_activity_df_v1)
#Relationship between Total Distance vs Total Steps
ggplot(data = daily_activity_df_v1) +
geom_point(mapping = aes(x = total_distance, y = total_steps, color = calories)) +
geom_smooth(mapping = aes(x = total_distance, y = total_steps)) +
labs(title = 'Total Distance vs. Total Steps',
x = 'Total Distance',
y = 'Total Steps') +
theme(axis.text.x = element_text(size = 10, color = 'gray40'),
axis.text.y = element_text(size = 10, color = 'gray40'),
axis.title = element_text(size = 10, color = 'grey20'),
plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'))
#Relationship between Total Steps vs Calories burnt
# cut() function here is used to create a new column/variable,'calories_category' in this dataset
daily_activity_df_v2 <- daily_activity_df_v1
daily_activity_df_v2$calories_category <- cut(daily_activity_df_v1$calories,
breaks = 3,
labels = c("Low", "Medium", "High"))
ggplot(data = daily_activity_df_v2) +
geom_point(mapping = aes(x = total_steps, y = calories, color = calories_category),
alpha = 0.7, stat = 'identity') +
labs(title = 'Relationship between Steps and Calories',
x = 'Total Steps',
y = 'Total Calories') +
theme(axis.text.x = element_text(size = 10, color = 'gray40'),
axis.text.y = element_text(size = 10, color = 'gray40'),
axis.title = element_text(size = 10, color = 'grey20'),
plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'))
#Analyzing Total Steps vs Total Calories
daily_activity_df_v3 <-daily_activity_df_v1
ggplot(data = daily_activity_df_v3) +
geom_point(mapping = aes(x = calories, y = total_steps, fill = calories)) +
geom_jitter(mapping = aes(x = calories, y= total_steps, color = total_steps, alpha = 0.35), na.rm = TRUE) +
geom_smooth(mapping = aes(x = calories, y= total_steps )) +
labs(x = 'Calories', y = 'Total Steps', title= 'Total Steps vs. Calories burnt', size =14, color = 'red') +
theme(axis.text.x = element_text( size = 10, color = 'gray40'),
axis.title = element_text(size = 10, color = 'grey20'),
plot.title = element_text(size = 14, color = 'darkblue', face = 'bold')) +
scale_color_gradient(low = "blue", high = "red") +
guides(fill = FALSE) # not to display the legend for the fill color
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
As you see here there is a direct connection is observed between calories burnt and the total steps taken, with an increase in steps corresponding to a higher calorie burn.
Next viz, we are going for an alternative approach, presenting three distinct methods for categorizing calories burnt.
#Relationship between Distance and Calories (Separate plots for each category)
daily_activity_df_v2$calories_category <- cut(daily_activity_df_v1$calories,
breaks = 3, labels = c("Low", "Medium", "High"))
ggplot(data = daily_activity_df_v2) +
geom_point(mapping = aes(x = total_distance, y = calories, color = calories_category), stat = 'identity') +
labs(x = 'Total Distance', y = 'Total Calories',
title = 'Relationship between Distance vs Calories',
subtitle ='Each category is shown individually here') +
theme(axis.text.x = element_text(angle = 90, size = 10, color = 'gray40'),
axis.text.y = element_text(size = 10, color = 'gray40'),
axis.title = element_text(size = 10, color = 'gray20'),
plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'),
plot.subtitle = element_text(size = 12, color = 'gray30')) +
facet_grid(~calories_category) + coord_flip()
As illustrated, I organized calorie levels into Low, Medium, and High categories to facilitate a comparison of distances, uncovering a positive correlation.
#Average Distance vs Per User
user_avg_activity_df<- daily_activity_df_v2 %>%
group_by(id) %>%
summarise(avg_distance = mean(total_distance),
avg_calories = mean(calories),
avg_steps = mean(total_steps))
user_avg_activity_df$id <- factor(user_avg_activity_df$id)
user_avg_activity_df$id <- reorder(user_avg_activity_df$id, user_avg_activity_df$avg_distance)
# type of average given mpre weight to smaller values to be more accurate
harmonic_mean_dist <- round(1 / mean(1 / user_avg_activity_df$avg_distance))
subtitle = 'On average, individuals achieve a distance of'
ggplot(user_avg_activity_df) + #bar plot
geom_bar(mapping = aes(x = avg_distance, y = id, fill= avg_distance ), stat = "identity") +
geom_vline(xintercept = harmonic_mean_dist, linetype = "dashed", color = "brown") +
labs(y = "Individuals", x = "Average Distance", title = "Average distance by each Individual",
subtitle = paste(subtitle, harmonic_mean_dist, 'miles in their exercise routines')) +
theme(axis.text.x =element_text(angle= 90, size = 9),
axis.text.y = element_text(angle = 0, size = 9, color = 'gray20'),
axis.title = element_text(size = 10, color = 'gray40'),
plot.title = element_text(size = 14, color = 'darkblue', face = 'bold')) +
# Annotate the plot with a label positioned at coordinates (2150, 10) indicating the average calorie per individual
annotate( x =4, y= 7, label = 'mean distance per individual', geom = 'label', alpha = 0.2, hjust = 0, vjust = 1 ) +
scale_fill_gradient(low = 'green', high = 'red') + coord_flip()
#Average Calories per individual
user_avg_activity_df$avg_calories <- round(user_avg_activity_df$avg_calories)
user_avg_activity_df$id <- reorder(user_avg_activity_df$id, user_avg_activity_df$avg_calories)
harmonic_mean_cal <- round(1 / mean(1 / user_avg_activity_df$avg_calories))
subtitle <-'On average, individuals burn'
# geom_vline() is creating a vertical dashed line to represent the average.
ggplot(data = user_avg_activity_df) +
geom_bar(mapping = aes(x = avg_calories, y = id, fill = avg_calories), stat = 'identity') +
geom_vline(xintercept = harmonic_mean_cal, linetype = "dashed", color = "brown") +
labs(x = "Average Calories", y = 'Individuals', title = "Average Calories by each Individual",
subtitle = paste(subtitle, harmonic_mean_cal, 'calories')) +
theme(axis.text.x =element_text(angle= 90, size = 9),
axis.text.y = element_text(angle = 0, size = 9, color = 'gray20'),
axis.title = element_text(size = 10, color = 'gray40'),
plot.title = element_text(size = 14, color = 'darkblue', face = 'bold')) +
annotate( x =2150, y= 10, label = 'average calorie per individual', geom = 'label',
alpha = 0.2, hjust = 0, vjust = 1 ) +
scale_fill_gradient(low = 'green', high = 'red') + coord_flip()
The average distance covered per person is around 4 miles, with an average calorie burn of 2150 calories per individual. This suggests that, on average, participants were engaging in a 4-mile activity and burning 2150 calories. Additionally, a positive correlation was noticed between increased distance and higher calorie expenditure. Nevertheless, details about age, gender, and environmental factors are lacking.
We are analyzing to see which are the most active days of the month,
daily_activity_df_v_1 <- daily_activity_df_v1 %>%
mutate(day_of_week = format(daily_activity_df_v1$activity_date, '%A'))
View(daily_activity_df_v_1)
# Removed the rows showing the values '0' in ordert to have more accurate readings
daily_activity_df_v_2 <- daily_activity_df_v_1[daily_activity_df_v_1$total_steps != 0, ]
daily_activity_df_v_1 <- daily_activity_df_v_2
# Convert day_of_week to a factor with ordered weekdays
daily_activity_df_v_1$day_of_week <- factor(daily_activity_df_v_1$day_of_week,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), ordered = TRUE)
# Group by day_of_week and calculate total activity duration for each day
total_steps_by_day <- daily_activity_df_v_1 %>%
group_by(day_of_week) %>%
summarise(avg_steps = round(mean(total_steps, na.rm = TRUE)))
print(total_steps_by_day, n = 7)
## # A tibble: 7 × 2
## day_of_week avg_steps
## <ord> <dbl>
## 1 Monday 8488
## 2 Tuesday 8949
## 3 Wednesday 8158
## 4 Thursday 8185
## 5 Friday 7821
## 6 Saturday 8947
## 7 Sunday 7627
most_steps <- mean(total_steps_by_day$avg_steps)
df_over <- total_steps_by_day %>%
filter(avg_steps >= most_steps) %>%
pull(day_of_week)
few_steps_over <- total_steps_by_day %>%
filter(avg_steps >= most_steps) %>%
pull(avg_steps)
# Print the results
cat('Most steps taken by Day:', as.character(df_over), "\n")
## Most steps taken by Day: Monday Tuesday Saturday
cat('Monthly step count for those days:', few_steps_over, "\n")
## Monthly step count for those days: 8488 8949 8947
ggplot(data = total_steps_by_day) +
geom_bar(mapping = aes(x = day_of_week, y = avg_steps, fill = avg_steps), stat = 'identity') +
scale_fill_gradient(low = "lightblue", high = "coral") +
# Added arrows to show most active days
geom_segment(data = filter(total_steps_by_day, day_of_week %in% df_over),
aes(x = day_of_week, y = avg_steps, xend = day_of_week, yend = avg_steps + 5000 ),
arrow = arrow(length = unit(0.2, "cm")), color = "black") +
labs(title = 'Average steps per day over the month', subtitle = 'Each arrow signifies the day with the highest step count over the month',
y = 'Average Steps') +
theme_minimal()
We are starting with cleaning the dataset making sure that column names in an easy format to process.
# Importing the heartrate_seconds_merged.csv file to RStudio
heart_rate_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')
# Standardizing column names by converting them to lowercase for consistency and clarity
heart_rate_df <- clean_names(heart_rate_df)
heart_rate_df_v1 <- heart_rate_df # We will use this later in another viz
head(heart_rate_df_v1)
## # A tibble: 6 × 3
## id time value
## <dbl> <chr> <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM 97
## 2 2022484408 4/12/2016 7:21:05 AM 102
## 3 2022484408 4/12/2016 7:21:10 AM 105
## 4 2022484408 4/12/2016 7:21:20 AM 103
## 5 2022484408 4/12/2016 7:21:25 AM 101
## 6 2022484408 4/12/2016 7:22:05 AM 95
#Date conversion
heart_rate_df <- heart_rate_df %>%
mutate(time = as.Date(time, format = '%m/%d/%Y'))
start_date <- min(heart_rate_df$time)
end_date <- max(heart_rate_df$time)
base::paste0('Starting Date: ', min_date, ' Ending Date:', max_date)
## [1] "Starting Date: 2016-04-12 Ending Date:2016-05-12"
# Determine and list the number of unique users in this dataset
paste0('Total number of unique ids: ',n_distinct(heart_rate_df$id) )
## [1] "Total number of unique ids: 14"
# Utilize the matrix() and list() functions to display the IDs in a easly readable two-column format
list_names <- list('Unique id #s listed below:', matrix(unique(heart_rate_df$id), ncol = 2))
list_names
## [[1]]
## [1] "Unique id #s listed below:"
##
## [[2]]
## [,1] [,2]
## [1,] 2022484408 5577150313
## [2,] 2026352035 6117666160
## [3,] 2347167796 6775888955
## [4,] 4020332650 6962181067
## [5,] 4388161847 7007744171
## [6,] 4558609924 8792009665
## [7,] 5553957443 8877689391
c('Column names : ', colnames(heart_rate_df)) # printing column names, easily readable
## [1] "Column names : " "id" "time" "value"
Heart rate readings are captured every 5 seconds,
#Analyzing Individual User's Avg Heart Rate Readings
avg_heartrate_readings <- heart_rate_df %>%
group_by(id) %>%
summarise(avg_heartrate = round(mean(value))) %>%
arrange(factor(avg_heartrate))
# Reorder the id factor based on avg_heartrate
avg_heartrate_readings$id <- factor(avg_heartrate_readings$id, levels = avg_heartrate_readings$id)
ggplot(data = avg_heartrate_readings) +
geom_bar(mapping = aes( x = factor(id), y = avg_heartrate, fill= avg_heartrate),
stat = 'identity') +
labs(
title = 'Average Heart Rate by Person',
y = "Average Beats " ,
x = 'Individual Person'
) +
theme(
axis.text.x = element_text(angle = 90),
plot.title = element_text(color = 'darkblue'),
axis.title.x = element_text( color = 'gray30'),
axis.title.y = element_text(color = 'gray30')
) +
scale_fill_gradient(low ='green', high='red')
This plot allows for a visual comparison of average heart rates across different individuals. Observing the distribution of average heart rates can provide insights into the variability among individuals. and additional information is required to determine if there is an ongoing activity or identify the specific time of day when the average heart rate exceeds 100 beats per minute.
showcasing the.minimum, maximum, and average heart rates corresponding to each individual.
heart_rate_summary <- heart_rate_df %>%
group_by(id) %>%
summarise(
min_rate = as.integer(min(value)),
max_rate = as.integer(max(value)),
avg_rate = as.integer(mean(value)),
sd_rate = as.integer(sd(value))
)
heart_rate_summary$id <- reorder(heart_rate_summary$id,heart_rate_summary$avg_rate)
ggplot(data = heart_rate_summary) +
geom_boxplot(mapping = aes(x = factor(id), ymin = min_rate, lower = avg_rate,
middle = avg_rate, upper = avg_rate,
ymax = max_rate ),
stat = 'identity',
color = "gray20" ) +
geom_point(data = filter(heart_rate_df, value > 100),
mapping = aes(x = factor(id), y = value),
color = "coral", position = position_dodge(width = 0.8 )
) +
labs(
title = 'Individual Heart Rate Assessment',
subtitle = 'Min, max, avg rates summarized',
x = 'Individual Users',
y = 'Heart Rate bpm ( Red for bpm > 100 ) \n Min, Avg, and Max bpm rates shown by the gray line'
) +
theme(
plot.title = element_text(color = 'darkblue'),
axis.title.x = element_text(color = 'gray30'),
axis.title.y = element_text(color = 'gray30')
) +
coord_flip() + scale_fill_identity() + scale_color_identity()
The plot serves to give an overview of the heart rate distribution among different users, emphasizing instances of elevated heart rates. The line within the box represents the median heart rate. The whiskers extend to the minimum and maximum values, excluding outliers.
Additionally, red points on the plot highlight instances where the heart rate exceeds 100 bpm. This provides a visual cue for elevated heart rates in certain observations.
heart_rate_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')
heart_rate_df <- clean_names(heart_rate_df)
heart_rate_df_v1 <- heart_rate_df
heart_rate_df_v1$time <- mdy_hms(heart_rate_df_v1$time)
heart_rate_date_df <- heart_rate_df_v1 %>%
mutate(time = as.Date(time))
cat('Start Date: ', min(as.character(heart_rate_date_df$time)),
'End Date: ', max(as.character(heart_rate_date_df$time))
)
## Start Date: 2016-04-12 End Date: 2016-05-12
hourly_heartrate_df <- heart_rate_df_v1 %>%
group_by(day = as.Date(time), hour=hour(time)) %>%
summarise( hourly_hrate = round(mean(value)) )
min_value <- min(heart_rate_df_v1$value)
max_value <- max(heart_rate_df_v1$value)
ggplot(data = hourly_heartrate_df) +
geom_boxplot(mapping = aes(x = factor(hour), y = hourly_hrate),
outlier.color = 'red', outlier.shape = 16) +
geom_vline( xintercept =19, linetype ='dashed', color = 'brown' ) +
annotate( x = 11, y= 60, label='max heart rate w/o outliers ',
geom = 'label', alpha = 0.2, hjust=0, vjust=1 ) +
annotate('rect', xmin = 18.5, xmax = 19.5, ymin = -Inf, ymax = Inf,
fill = 'lightcoral', alpha = 0.3) +
labs(title = "Monthly Average Heart Rate (Box Plot with Outliers)",
subtitle = 'by Hour - red dots are outliers',
x = "Hour of the Day",
y = "Average Heart Rate") + theme_minimal()
Here, we used box plot. The line inside the box represents the median of the data. Lines extending from the box, whiskers, show the range of the data.
Red dots are used here to emphasize outliers that falls outside of the range, extend beyond the whiskers, providing additional clarity to the box plot representation.
Outliers Below represent hours where the average heart rate is unusually low compared to the majority of the data.
Outliers Above represent hours where the average heart rate is unusually high compared to the majority of the data
Outliers can provide valuable information about the data distribution. They might indicate unusual events, measurement errors, or specific patterns that differ from the overall trend.
Please see hourly intensity chart to compare intensity by hour comparasion on 4.3.1
As usual, the initial step is to import the data and then proceed with data cleaning, focusing on simplifying column names.
# Importing the data, hourlyintensities_merged.csv
hourly_intensities_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv')
# Cleaning the data, making the names unique and in lower case
hourly_intensities_df <- clean_names(hourly_intensities_df)
str(hourly_intensities_df)
## spc_tbl_ [22,099 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_hour : chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ total_intensity : num [1:22099] 20 8 7 0 0 0 0 0 13 30 ...
## $ average_intensity: num [1:22099] 0.333 0.133 0.117 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityHour = col_character(),
## .. TotalIntensity = col_double(),
## .. AverageIntensity = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
c('Column names : ', colnames(hourly_intensities_df)) # check column names here
## [1] "Column names : " "id" "activity_hour"
## [4] "total_intensity" "average_intensity"
# Let's see how many unique Individuals here with their unique ids
list_names <- list('Unique id #s listed below:', matrix(unique(hourly_intensities_df$id), ncol = 3, byrow = TRUE))
list_names
## [[1]]
## [1] "Unique id #s listed below:"
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1503960366 1624580081 1644430081
## [2,] 1844505072 1927972279 2022484408
## [3,] 2026352035 2320127002 2347167796
## [4,] 2873212765 3372868164 3977333714
## [5,] 4020332650 4057192912 4319703577
## [6,] 4388161847 4445114986 4558609924
## [7,] 4702921684 5553957443 5577150313
## [8,] 6117666160 6290855005 6775888955
## [9,] 6962181067 7007744171 7086361926
## [10,] 8053475328 8253242879 8378563200
## [11,] 8583815059 8792009665 8877689391
paste0('Total # of individuals: ', n_distinct(hourly_intensities_df$id))
## [1] "Total # of individuals: 33"
Including peak and low activity hours.
# Analyzing Hourly activities vs Individuals
# Starting with converting activity_hour to POSTXct time
hourly_intensities_df$activity_hour <- mdy_hms(hourly_intensities_df$activity_hour)
hour_to_pos <- hourly_intensities_df$activity_hour
intensities_perhour_df <- hourly_intensities_df %>%
group_by(hour = hour(hour_to_pos) )%>%
summarise(avg_intensity = mean(total_intensity))
min_active <- min(round(intensities_perhour_df$avg_intensity, digits = 2))
max_active <- max(round(intensities_perhour_df$avg_intensity, digits = 2))
gg <- ggplot(data = intensities_perhour_df) +
geom_area(mapping = aes(x = hour, y = avg_intensity, fill= 'hour'), stat = 'identity', color = 'blue') +
geom_vline(xintercept = 3.2, linetype = "dashed", size =1 , color = 'darkblue') +
geom_vline(xintercept = 18, linetype = "dashed", size = 1, color = 'brown') +
scale_x_continuous(labels = scales::date_format("%H"),
breaks = unique(intensities_perhour_df$hour)) +
labs(
title = 'Hourly intensity chart',
x = "Hour" ,
y = 'Intensity '
) +
theme(
plot.title = element_text(color = 'darkblue'),
axis.title.x = element_text( color = 'gray30'),
axis.title.y = element_text(color = 'gray30'),
) +
annotate( x =3.2, y= 7, label = 'least active', geom = 'label', alpha = 0.2,
hjust = 0, vjust = 1 ) +
annotate( x =18, y= 7, label = 'most active', geom = 'label', alpha = 0.2,
hjust = 0, vjust = 1 ) +
guides(fill = FALSE) +
scale_fill_manual(values = c('hour' = 'lightpink'))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(gg)
Based on the earlier box plot depicting Hourly Average Heart Rate, it was observed that heart rate tends to be elevated during the noon to afternoon period, reaching its peak at 6 pm. To gain further insights, I compared this data with an area plot illustrating peak activity hours. In order to have a deper understanding, I compared this data with an area plot here. Notably, I observed that the peak hours align, indicating a direct correlation between heart rate and activity levels.
According to our findings individuals exhibit the lowest levels of activity at 3:00 am, reach their highest activity levels between 10:00 am and 8:00 pm, with the peak of activity occurring at 6:00 pm as well as the peak heart rate per minute, please see chart 4.2.3, previous viz.
The goal is to assess individual daily weight charts to determine whether there are notable variations on a weekly basis.
Refining and manipulating data involves, the process of cleaning and transforming, raw data to ensure accuracy, consistency.
# Importing weightLoginfo_merged here
weight_log_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')
#Cleaning the data, making the names unique and in lower case
weight_log_df <- clean_names(weight_log_df)
# Converting date to POSTXct format
weight_log_df$date <- mdy_hms(weight_log_df$date)
# Looking at the beginning and end date
min_date <- min(as.Date(weight_log_df$date))
max_date <- max(as.Date(weight_log_df$date))
paste0('Starting Date: ', min_date, ' - ', 'End Date:', max_date)
## [1] "Starting Date: 2016-04-12 - End Date:2016-05-12"
# Let's see how many unique Individuals with unique ids and the total entries for each individual
entry_count_df <- weight_log_df %>%
group_by(id) %>%
summarise(entry_count = n())
result_df <- data.frame(Number_of_Users = unique(entry_count_df$id),
Entries = entry_count_df$entry_count
)
print(result_df)
## Number_of_Users Entries
## 1 1503960366 2
## 2 1927972279 1
## 3 2873212765 2
## 4 4319703577 2
## 5 4558609924 5
## 6 5577150313 1
## 7 6962181067 30
## 8 8877689391 24
In this database, weight recordings are available for a total of 8 individuals, out of 33 participant. However, the reliability of the sample is compromised as 6 of them recorded their weight less than 5 days a month. This limitation limits our ability to analyze changes in average monthly weight based on the total steps taken by each individual.
# Weight vs daily area chart for per Individual -There is no enaugh data here...
weight_log_df_v1 <- weight_log_df %>%
group_by(date = date(date)) %>%
select(-weight_kg, -fat, -bmi, -is_manual_report, -log_id) %>%
mutate(date = date(date))
ggplot(data = weight_log_df_v1) +
geom_point( mapping = aes(x = date(date), y = round(weight_pounds,
digit = 1), color = 'coral'), stat = 'identity') +
facet_grid(~id) +
labs(title = 'Daily Weight Chart', x = 'Date', y = 'Weight') +
theme(
plot.title = element_text(color = 'darkblue'),
axis.title.x = element_text(color = 'gray30'),
axis.title.y = element_text(color = 'gray30'),
axis.text.x = element_text(angle = 90)
) +
scale_fill_identity() + scale_color_identity()
Insufficient data prevents a detailed analysis of weightLogInfo_merged.csv.
Nonetheless, the provided chart displays daily weight trends for individuals for the month. Notably, two participants consistently document their weight, indicating a stable and active lifestyle aimed at maintaining a nearly constant weight.
# Importing the data, dailyintensities_merged.csv
daily_intensities_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
# Cleaning the data, making the names unique and in lower case
daily_intensities_df <- clean_names(daily_intensities_df)
c('Column names : ', colnames(daily_intensities_df)) # check column names here
## [1] "Column names : " "id"
## [3] "activity_day" "sedentary_minutes"
## [5] "lightly_active_minutes" "fairly_active_minutes"
## [7] "very_active_minutes" "sedentary_active_distance"
## [9] "light_active_distance" "moderately_active_distance"
## [11] "very_active_distance"
# Let's see how many unique Individuals here with their unique ids
list_dnames <- list('Unique id #s listed below:', matrix(unique(daily_intensities_df$id), ncol = 3, byrow = TRUE))
list_dnames
## [[1]]
## [1] "Unique id #s listed below:"
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1503960366 1624580081 1644430081
## [2,] 1844505072 1927972279 2022484408
## [3,] 2026352035 2320127002 2347167796
## [4,] 2873212765 3372868164 3977333714
## [5,] 4020332650 4057192912 4319703577
## [6,] 4388161847 4445114986 4558609924
## [7,] 4702921684 5553957443 5577150313
## [8,] 6117666160 6290855005 6775888955
## [9,] 6962181067 7007744171 7086361926
## [10,] 8053475328 8253242879 8378563200
## [11,] 8583815059 8792009665 8877689391
paste0('Total # of individuals: ', n_distinct(daily_intensities_df$id))
## [1] "Total # of individuals: 33"
# finding start and end date
date_range <- paste("-Starting Date:", min(daily_intensities_df$activity_day),
" -Ending Date:",
max(daily_intensities_df$activity_day))
date_range
## [1] "-Starting Date: 4/12/2016 -Ending Date: 5/9/2016"
Let’s explore the average duration of different activity types for each participant over the course of the month,
# Importing the data, dailyintensities_merged.csv
daily_intensities_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
# Cleaning the data, making the names unique and in lower case
daily_intensities_df <- clean_names(daily_intensities_df)
daily_intensities_df$activity_day = as.Date(daily_intensities_df$activity_day, format = '%m/%d/%Y')
# Removed the rows showing the values '0' in order to have a accurate readings
daily_intensities_df_v4 <- daily_intensities_df[daily_intensities_df$sedentary_minutes != 0, ]
daily_intensities_df_v5 <- daily_intensities_df_v4[daily_intensities_df_v4$sedentary_minutes != 2, ]
daily_intensities_df_v6 <-
daily_intensities_df_v5[daily_intensities_df_v5$lightly_active_minutes != 0, ]
daily_intensities_df_v7 <-
daily_intensities_df_v6[daily_intensities_df_v6$fairly_active_minutes != 0, ]
daily_intensities_df_v8 <-
daily_intensities_df_v7[daily_intensities_df_v7$very_active_minutes != 0, ]
daily_intensities_df <- daily_intensities_df_v8
daily_intensities_df_v1 <- daily_intensities_df
avg_intens_by_id_df <- daily_intensities_df_v1 %>%
group_by(id) %>%
summarise(
sedentary = mean(sedentary_minutes),
lightly_active = mean(lightly_active_minutes),
fairly_active = mean(fairly_active_minutes),
very_active = mean(very_active_minutes)
)
avg_intens_long <- avg_intens_by_id_df %>%
gather(key = "avg_activity_type", value = "avg_minutes", -id)
View(avg_intens_long )
avg_intens_long$id <- reorder(avg_intens_long$id,
avg_intens_long$avg_activity_type,
FUN = function(x) mean(avg_intens_long$avg_minutes))
least_seden <- min(avg_intens_by_id_df$sedentary, na.rm = TRUE)
most_seden <- max(avg_intens_by_id_df$sedentary, na.rm = TRUE)
total_activity_individuals_df <- avg_intens_by_id_df %>%
group_by(id) %>%
summarise(total_active_individual = sum(lightly_active + fairly_active + very_active)
)
print(total_activity_individuals_df, n = 33)
## # A tibble: 33 × 2
## id total_active_individual
## <dbl> <dbl>
## 1 1503960366 287.
## 2 1624580081 239
## 3 1644430081 257.
## 4 1844505072 266
## 5 1927972279 95
## 6 2022484408 321.
## 7 2026352035 192
## 8 2320127002 199.
## 9 2347167796 304.
## 10 2873212765 307.
## 11 3372868164 354.
## 12 3977333714 263.
## 13 4020332650 234
## 14 4057192912 97
## 15 4319703577 294.
## 16 4388161847 291.
## 17 4445114986 218.
## 18 4558609924 322.
## 19 4702921684 296.
## 20 5553957443 284.
## 21 5577150313 284.
## 22 6117666160 416.
## 23 6290855005 304.
## 24 6775888955 117.
## 25 6962181067 316.
## 26 7007744171 387.
## 27 7086361926 235.
## 28 8053475328 258.
## 29 8253242879 166.
## 30 8378563200 243.
## 31 8583815059 229.
## 32 8792009665 252.
## 33 8877689391 318.
most_active_individual <- max(total_activity_individuals_df$total_active_individual)
least_active_individual <- min(total_activity_individuals_df$total_active_individual)
# Filter rows where total_active_individual is equal to the maximum value
most_active_id <- total_activity_individuals_df %>%
filter(total_active_individual == most_active_individual) %>%
pull(id)
# Filter rows where total_active_individual is equal to the minimum value
least_active_id <- total_activity_individuals_df %>%
filter(total_active_individual == least_active_individual) %>%
pull(id)
cat("Most Active User ID:", most_active_id, "with total activity:", most_active_individual, "\n")
## Most Active User ID: 6117666160 with total activity: 416.3333
cat("Least Active User ID:", least_active_id, "with total activity:", least_active_individual, "\n")
## Least Active User ID: 1927972279 with total activity: 95
ggplot(data = avg_intens_long) +
geom_bar( mapping = aes( x = id, y = avg_minutes,
fill = avg_activity_type), stat = 'identity', position = "dodge") +
labs(title = 'Monthly activity type vs Average duration per participants', subtitle = 'Monthly activity durations classified based on the type of activity', x = 'Participants', y = 'Avg minutes by activity category') +
annotate( x =28, y= 1280, label = 'most active',
geom = 'label', alpha = 0.2, vjust = 0, hjust = 1 ) +
annotate( x =29.9, y= 1185, label = 'least sedentary',
geom = 'label', alpha = 0.2, vjust = 0, hjust = 1 ) +
annotate( x =11.2, y= 1280, label = 'least active',
geom = 'label', alpha = 0.2, vjust = 0, hjust = 1 ) +
annotate( x =13.3, y= 1185, label = 'most sedentary',
geom = 'label', alpha = 0.2, vjust = 0, hjust = 1 ) +
annotate('rect', xmin = 4.5, xmax = 5.7, ymin = -Inf, ymax = Inf,
fill = 'yellow', alpha = 0.3) +
annotate('rect', xmin = 21.5, xmax = 22.5, ymin = -Inf, ymax = Inf,
fill = 'yellow', alpha = 0.3) +
theme(
axis.text.x = element_text(angle = 90, size = 10, color = 'gray20'),
axis.text.y = element_text(size = 10, color = 'gray20'),
axis.title = element_text(size = 10, color = 'gray20'),
plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'),
plot.subtitle = element_text(size = 12, color = 'gray30'),
panel.background = element_rect(fill = "white" )
)
Participants with lower sedentary behavior tend to be more active compared to those with higher sedentary behavior. For instance, individuals who spend less time sitting often engage in higher levels of physical activity than those who have a more sedentary lifestyle.
varies on weekdays over the course of the month, pinpointing the day exhibiting the highest activity intensity,
#Weekdays vs Activity type
daily_intensities_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
# Cleaning the data, making the names unique and in lower case
daily_intensities_df <- clean_names(daily_intensities_df)
daily_intensities_df_v6 <-
daily_intensities_df[daily_intensities_df$lightly_active_minutes != 0, ]
daily_intensities_df_v7 <-
daily_intensities_df_v6[daily_intensities_df_v6$fairly_active_minutes != 0, ]
daily_intensities_df_v8 <-
daily_intensities_df_v7[daily_intensities_df_v7$very_active_minutes != 0, ]
least_seden <- min(avg_intens_by_id_df$sedentary, na.rm = TRUE)
daily_intensities_df <- daily_intensities_df_v8
daily_intensities_df$activity_day = as.Date(daily_intensities_df$activity_day,
format = '%m/%d/%Y')
daily_intensities_df_v1 <- daily_intensities_df %>%
mutate(day_of_week = format(daily_intensities_df$activity_day, '%A'))
# Modifying column names - placing 1st column followed by the 10th
daily_intensities_df_v1 <- daily_intensities_df_v1[, c(1, 11, 2:10)]
# Convert day_of_week to a factor with ordered weekdays
daily_intensities_df_v1$day_of_week <- factor(daily_intensities_df_v1$day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), ordered = TRUE)
avg_intens_by_day_long <- daily_intensities_df_v1 %>%
group_by(day_of_week) %>%
summarise(
lightly_active = mean(lightly_active_minutes),
fairly_active = mean(fairly_active_minutes),
very_active = mean(very_active_minutes) ) %>%
pivot_longer(cols = -day_of_week, names_to = 'activity_type' , values_to = 'avg_minutes')
total_activity_daily <- avg_intens_by_day_long %>%
group_by(day_of_week) %>%
summarise(
total_activity_minutes = sum (avg_minutes )
)
# Identify the most and least active days
least_act = min(total_activity_daily$total_activity_minutes)
most_act = max(total_activity_daily$total_activity_minutes)
most_active_day <- total_activity_daily %>%
filter(total_activity_minutes == max(total_activity_minutes)) %>%
pull(day_of_week)
least_active_day <- total_activity_daily %>%
filter(total_activity_minutes == min(total_activity_minutes)) %>%
pull(day_of_week)
# Convert the numbers to day of week
most_active_day <- as.character(most_active_day)
least_active_day <- as.character(least_active_day)
cat('Most Active Day:', most_active_day, 'with minutes', most_act, "\n")
## Most Active Day: Saturday with minutes 303.7576
cat('Least Active Day:', least_active_day, 'with minutes', least_act, "\n")
## Least Active Day: Sunday with minutes 260.2373
head(total_activity_daily, n = 7 )
## # A tibble: 7 × 2
## day_of_week total_activity_minutes
## <ord> <dbl>
## 1 Monday 263.
## 2 Tuesday 290.
## 3 Wednesday 272.
## 4 Thursday 271.
## 5 Friday 275.
## 6 Saturday 304.
## 7 Sunday 260.
ggplot(data = avg_intens_by_day_long) +
geom_bar(mapping = aes(x = day_of_week, y = avg_minutes, fill = activity_type ), stat = 'identity', position = "dodge" ) +
geom_vline(xintercept = most_active_day, linetype = "dashed", size = 1, color = "brown") +
geom_vline(xintercept = least_active_day, linetype = "dashed",size = 1, color = "brown") +
annotate(x =5, y= 400, label = 'most active', geom = 'label', alpha = 0.2,
hjust = 0, vjust = 1 ) +
annotate(x =6, y= 350, label = 'least active', geom = 'label', alpha = 0.2,
hjust = 0, vjust = 1 ) +
labs(title = "Weekdays vs Activity type average",
x = "Weekdays",
y = "Average Active Times (minutes)") +
theme(
axis.text.x = element_text(size = 9, color = 'gray20'),
axis.text.y = element_text(size = 10, color = 'gray20'),
axis.title = element_text(size = 10, color = 'gray20'),
plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'),
plot.subtitle = element_text(size = 12, color = 'gray30'),
panel.background = element_rect(fill = 'white'),
)
#Same viz with stack plot
ggplot(data = avg_intens_by_day_long) +
geom_col(aes(x = day_of_week, y = avg_minutes, fill = activity_type), position = "stack") +
labs(title = "Weekdays vs Activity type average, stack version",
x = "Weekdays",
y = "Average Active Times (minutes)") +
geom_vline(xintercept = most_active_day, linetype = "dashed",
size = 1, color = "brown") +
geom_vline(xintercept = least_active_day, linetype = "dashed",
size = 1, color = "brown") +
annotate(x =5, y= 400, label = 'most active', geom = 'label', alpha = 0.2,
hjust = 0, vjust = 1 ) +
annotate(x =6, y= 350, label = 'least active', geom = 'label', alpha = 0.2,
hjust = 0, vjust = 1 ) +
scale_fill_brewer(palette = "Dark2") + theme_minimal()
According to the dataset, our calculations indicate that Tuesday recorded the highest activity level with 26,119 minutes, while Sunday marked the lowest activity, totaling 15,354 minutes throughout the month.
my goal is to determine whether there exists a direct and positive correlation between the time individuals spend sleeping and their physical activity duration. This will be explored by merging information from the ‘dailyActivity_merged.csv’ and ‘sleepDay_merged.csv’ datasets.
# Importing both csv files
daily_act_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
sleep_day_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
daily_act_df <- clean_names(daily_act_df)
sleep_day_df <- clean_names(sleep_day_df)
cat("# of users -Daily Activity: ", n_distinct(daily_act_df$id),
'-Sleep Activity: ', n_distinct(sleep_day_df$id), '\n')
## # of users -Daily Activity: 33 -Sleep Activity: 24
# Quick summary() for this two datasets
daily_act_df %>%
select ( total_steps,
total_distance,
sedentary_minutes
) %>%
summary()
## total_steps total_distance sedentary_minutes
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :1440.0
sleep_day_df %>%
select( total_minutes_asleep,
total_time_in_bed
) %>%
summary()
## total_minutes_asleep total_time_in_bed
## Min. : 58.0 Min. : 61.0
## 1st Qu.:361.0 1st Qu.:403.0
## Median :433.0 Median :463.0
## Mean :419.5 Mean :458.6
## 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :796.0 Max. :961.0
Bed time duration and actual sleep time comparesion.
ggplot(data=sleep_day_df, aes(x=total_minutes_asleep, y=total_time_in_bed)) +
geom_point() + geom_smooth() +
labs(title = 'Asleep Time Analysis Relative to Total Bed Rest',
subtitle = 'duration in minutes',
x = 'Total Sleep Duration',
y = 'Bedtime duration') + theme_minimal()
There is a positive correlation between bedtime duration and actual sleep time, This means that people who spend more time in bed tend to get more sleep on average.
Creating a scatter plot here to analyse the correlation between sedentary time and steps taken.
# Creating a scatter plot with a smooth line to explore the positive correlation between sedentary time and total steps
ggplot(data = daily_act_df) +
geom_smooth( mapping = aes(x=total_steps, y= sedentary_minutes)) +
labs(title = 'Total steps vs Sedentary time',
x = 'Total Steps',
y = 'Sedentary minutes') +
theme_bw()
There is a positive correlation observed between the number of steps taken and sedentary time. It appears that individuals who take more steps tend to have lower sedentary time.
We are merging both datasets by the ‘id’ numbers; dailyActivity_merged.csv’ and ‘sleepDay_merged.csv’ datasets.
#Inner join the datasets by id numbers
combined_data_df <- merge(sleep_day_df, daily_act_df)
colnames(combined_data_df)
## [1] "id" "sleep_day"
## [3] "total_sleep_records" "total_minutes_asleep"
## [5] "total_time_in_bed" "activity_date"
## [7] "total_steps" "total_distance"
## [9] "tracker_distance" "logged_activities_distance"
## [11] "very_active_distance" "moderately_active_distance"
## [13] "light_active_distance" "sedentary_active_distance"
## [15] "very_active_minutes" "fairly_active_minutes"
## [17] "lightly_active_minutes" "sedentary_minutes"
## [19] "calories"
cat('# of participants:', n_distinct(combined_data_df$id))
## # of participants: 24
# date starts and end
cat('Data -Start date: ', min(combined_data_df$activity_date),
' -End date',max(combined_data_df$activity_date ))
## Data -Start date: 4/12/2016 -End date 5/9/2016
#_______Compared Sleep duration vs Steps taken________
sleep_vs_steps_df <- combined_data_df %>%
group_by(id) %>%
summarise(
avg_steps_t = median(total_steps),
avg_a_sleep = median(total_minutes_asleep)
)
head(sleep_vs_steps_df) # lets see a tibble here
## # A tibble: 6 × 3
## id avg_steps_t avg_a_sleep
## <dbl> <dbl> <dbl>
## 1 1503960366 12207 340
## 2 1644430081 6684. 130.
## 3 1844505072 2237 644
## 4 1927972279 152 398
## 5 2026352035 5528 516.
## 6 2320127002 5057 61
model <- lm(avg_a_sleep ~ avg_steps_t, data = sleep_vs_steps_df) # Fit a linear model
residuals <- residuals(model) # Calculate residuals
# Identify outliers using a threshold - sd(residuals)
outlier_threshold <- .80 * sd(residuals) #the sensitivity of the outlier detection
sleep_vs_steps_df$outlier <- ifelse(abs(residuals) > outlier_threshold, "red", "black")
# Scatter plot with outliers in red
ggplot(data = sleep_vs_steps_df, aes(x = avg_steps_t, y = avg_a_sleep)) +
geom_point(aes(color = outlier), size = 3) +
geom_smooth() +
labs(title = "Scatter Plot: Average Steps vs. Average Sleep",
x = "Average Steps ", y = "Average Sleep") +
scale_color_manual(values = c("black", 'red')) + theme_bw()
# Density design
ggplot(data = sleep_vs_steps_df, aes(x = avg_steps_t, y = avg_a_sleep)) +
geom_point(alpha = 0.5) +
geom_density_2d() +
labs(title = "Density Plot: Average Steps vs. Average Sleep",
x = "Average Steps", y = "Average Sleep") +
theme_bw()
The study on the relationship between sleep duration and total steps demonstrates a weak positive correlation. The prevalence of outliers, in red, indicates a significant data spread, showing us the need for more information, particularly when referencing the density chart.
Bellabeat should provide applications designed specifically for different genders can be highly beneficial in promoting health and wellness, addressing the distinct needs and challenges faced by individuals of different genders.
Bellabeat app ahould ask users questions about their gender, weight, height, and general health upon installation in order to personalize their app experience and tailoring app features to individual needs. However, be transparent about how it will be used. Users maybe more willing to share information if they understand its purpose and how it benefits them.
Beyond the basic step counter, here is how Bellabeat users can benefit improving their habits for step counting and distance tracking:
Personalized goals and challenges: customers should be able to set daily, weekly, or monthly step goals based on individual fitness levels and aspirations. Bellabeat app shoud be able to offer fun and engaging challenges with friends or groups, or personalized “streak” challenges for consistency.
Variety and gamification: Bellabeat app should integrate mini-games or activities within the app that reward steps and distance covered. Offer different “environments” or virtual rewards to keep users engaged.
Progress visualization: Bellabeat app should have a clear and visually appealing dashboards showing daily, weekly, and monthly step trends, distance covered on maps, and achievement badges , and gently remind users periodically throughout the day about their progress toward achieving their goals.
Social interaction and support: Users should be able to share progress and achievements with friends or join online communities for encouragement and motivation. Compete in friendly challenges or team up for shared goals.
Rewards and integration: Bellabeat app should partner with local businesses or fitness services to offer discounts or rewards based on reaching step goals. Integrate with other health and fitness apps to create a holistic picture of progress.
Wearable integration: Bellabeat should make the app compatible with various wearables for seamless data tracking and convenience.
Smart reminders: Bellabeat app should send
personalized notifications at times when users are most likely to be
inactive, offering gentle nudges to get moving
making sure that ’focus on progress, not perfection’
andencourage consistency and celebrate small wins rather than dwelling
on missed goals.
Community events and activities: App should be interactive, be able to organize local walking groups, challenges, or scavenger hunts that use the app for tracking and engagement.
Beyond basic sleep tracking, here are some ideas to motivate users towards better sleep hygiene:
Personalized sleep goals and analysis: Bellabeat users should be able to set customized sleep schedules based on their needs and be able analyze sleep quality through metrics like sleep stages, heart rate variability, and snoring detection.
Relaxation and mindfulness tools: users should be able to access in app futures like, breathing exercises, and calming soundscapes to help them wind down before bed.
Bedroom environment monitoring: Bellabeat app should be able to integrate with smart devices to track temperature, light, and noise levels, and offer personalized recommendations for optimal sleep environment.
Progress visualization: App showing sleep trends over time, highlight improvements, and offer challenges to maintain healthy sleep patterns.
Integration with other health apps: Connect with fitness trackers, mood trackers, and stress management apps to understand how other factors influence sleep.
Here some ideas,
Comfortable sleep tracker: Design a discreet and comfortable tracker,
It could take the form of a chest, bracelet, or ankle tracker—something lightweight that people can comfortably wear throughout the night.
Snoring detection and analysis: Help users identify potential sleep apnea.
It can be a michrophone built in to smart sleep tracker by utilizing algorithms that identfy the specific frequency range of snoring sounds at the same time filter out other background noise.
Heart-Relates must have App featuers:
Monitoring and Tracking:
Resting heart rate (RHR) tracking: Bellabeat app should be able to monitor daily, hourly RHR and offer personalized insights on stress levels, fitness progress, and overall heart health.
Electrocardiogram (ECG) capabilities: Bellabeat app, as an advanced app, can incorporate wearable ECG technology for on-demand heart rhythm monitoring, potentially detecting arrhythmias.
Activity tracking: Combinig with the steps and distance traker app, Bellabeat app should offer to monitor steps, distance, and active minutes, highlighting their impact on heart health and motivating consistent movement.
Weight and body composition tracking: Optional feature to understand weight-heart health connection and encourage healthy weight management.
Also, it can have an integrated feature to connect users with doctors or telehealth services for personalized guidance and consultations if there is a potentially concerns and need immediate attention.
In summary, It is important to note that the key is to find approaches that are engaging, personalized, and sustainable for individual users.
Combined with sleep and step tracker apps, Bellabeat app can monitor sleep duration and quality, steps taken highlighting their impact on heart health and providing tips for improvement.
By offering a combination of technology, community, and healthy habit integration, Bellabeat app can help people develop lasting habits for step counting, distance tracking, and sleeping quality that lead to a healthier and happier life.