Another Capstone Project - Bellabeat case Study

Information about this Case Study

Bellabeat is a successful small company, a high-tech manufacturer of health-focused products for women but they have the potential to become a larger player in the global small device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing small device fitness data could help unlock new growth opportunities for the company.

Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

1. ASK

Key questions to ask;

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

1.1. Business Task

Within Bellabeat’s marketing analyst team, I, as a junior Data Analyst, am responsible for analyzing non-Bellabeat smart device usage data. The trends discovered are integrated into a presentation to guide the marketing strategy for Bellabeat customers.

1.2. Stakeholders

Urška Srše: Bellabeat’s co-founder and Chief Creative Officer

Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team

Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

2. PREPARE

2.1. Where is the Data stored?

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

In this phase, I downloaded the sample dataset and stored in my computer and examined it by using Google Sheets and Excel to identify various variables. I also verified the data and reached the conclusion that all the files are easily accessible in Google Sheets, Excel, BigQuery, and R.

2.2. How is the data organized? Is it in long or wide format?

Data is structured in both wide and long formats. I lean towards the long format, as it proves more convenient when handling dates.

2.3. Are there issues with bias or credibility in this data? Does your data ROCCC?

Regarding the data being bias, we lack information about the individuals and the criteria used for their selection in the survey, making it uncertain whether the survey is biased.

About the ROOC;

Reliability: data is coming from a reliable source, so it is credible,

Original: data is original, shared by Fitbit,

Comprehensive: Yes, most of the data is complete and detailed but in a sample size,

Current: Unfortunately data is not current, from 2016, out of date,

Citied: It is considered credible since it originates from an authentic source, Fitbit users, and a reputable website.

3. PROCESS

3.1. Tools:

I chose to focus on R to enhance my proficiency and understanding in this particular data processing tool while I am proficient in using Sheets, Excel, and SQL due to my existing knowledge.

3.2. Data Integrity:

The dataset exhibits missing values, and not all users contributed to certain analyses. Additionally, there are abnormalities in heart rate and sleeping time lead me to believe that certain CSV files may contain incomplete data, making accurate analysis challenging.

3.3. Data Manipulation

I realized that the data was mostly clean, requiring my attention mainly towards converting column variables into a clear, understandable and readable format, including the conversion of date & time variables to their appropriate form and handling decimal numbers.

The datasets I’m currently analyzing to gain insights for this project are:

dailyActivity_merged.csv: The file consolidates daily activity logs, combining information from different files,

heartrate_seconds_merged.csv: The data in this document records the heart rates of individuals at five-second intervals,

hourlyIntensities_merged.csv: Correlation between overall and average intensities in relation to hourly analyses,

weightLogInfo_merged.csv: Comparison of an individual’s daily weight,

dailyIntensities_merged.csv: daily physical activity intensities,

sleepDay_merged.csv: The duration of time spent in bed for sleeping.

4. ANALYZE & VISUALIZATION

Packages and Libraries used in the Project

install.packages(‘tidyverse’)
install.packages(“janitor”)
install.packages(‘readr’)
install.packages(“ggplot2”)
install.packages(“dplyr”)
install.packages(‘colorspace’)
library(colorspace)
library(tidyverse)
library(readr)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(babynames)
library(janitor)
library(tinytex)
library(DescTools)

4.1. dailyActivity_merged.csv

4.1.1 Daily Activities vs Calories burnt,

# Importing the dailyActivity_merged.csv file 
daily_activity_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Cleaning all the columns to make sure all clear, making them lower Case 
daily_activity_df<- clean_names(daily_activity_df)
head(daily_activity_df)    # Checking the table and variables as a tibble

## # A tibble: 6 × 15
##           id activity_date total_steps total_distance tracker_distance
##        <dbl> <chr>               <dbl>          <dbl>            <dbl>
## 1 1503960366 4/12/2016           13162           8.5              8.5 
## 2 1503960366 4/13/2016           10735           6.97             6.97
## 3 1503960366 4/14/2016           10460           6.74             6.74
## 4 1503960366 4/15/2016            9762           6.28             6.28
## 5 1503960366 4/16/2016           12669           8.16             8.16
## 6 1503960366 4/17/2016            9705           6.48             6.48
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>

View(daily_activity_df)    # Viewing the contents of our data frame 
str(daily_activity_df)      # Understanding the internal structure of data frames, lists, and other R objects

## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                        : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activity_date             : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ total_steps               : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ total_distance            : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ tracker_distance          : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ logged_activities_distance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ very_active_distance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ moderately_active_distance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ light_active_distance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ sedentary_active_distance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ very_active_minutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ fairly_active_minutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ lightly_active_minutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ sedentary_minutes         : num [1:940] 728 776 1218 726 773 ...
##  $ calories                  : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

glimpse(daily_activity_df)  # Inspecting the data type and the first few rows of each variable in a data set

## Rows: 940
## Columns: 15
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ activity_date              <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/1…
## $ total_steps                <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…

colnames(daily_activity_df) # Making sure we see all the Column names

##  [1] "id"                         "activity_date"             
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"

The goal is to identify the number of users engaged and the duration of this activity over a series of days,

# Using unique & n_distinct functions to be able to retrieve Unique user names
unique(daily_activity_df$id)

##  [1] 1503960366 1624580081 1644430081 1844505072 1927972279 2022484408
##  [7] 2026352035 2320127002 2347167796 2873212765 3372868164 3977333714
## [13] 4020332650 4057192912 4319703577 4388161847 4445114986 4558609924
## [19] 4702921684 5553957443 5577150313 6117666160 6290855005 6775888955
## [25] 6962181067 7007744171 7086361926 8053475328 8253242879 8378563200
## [31] 8583815059 8792009665 8877689391

# Here is the combined total number of users and the total number of days shown in this activity
user_id <- n_distinct(daily_activity_df$id) # Unique user Ids in total
days_total <- n_distinct(daily_activity_df$activity_date) # Examining the count of days in this activity
descr_columns <- c('Number_of_ users', 'Number_of_days')
numbers_total <- c(user_id, days_total)
number_of_users <- data.frame(descr_columns, numbers_total)
number_of_users

##      descr_columns numbers_total
## 1 Number_of_ users            33
## 2   Number_of_days            31

Changing the format of the activity_date column into DATE format in order to display the start and end date,

# Converting the format of the activity_date column into DATE format
daily_activity_df_v1 <- daily_activity_df %>%
  mutate(activity_date = as.Date(activity_date, format = '%m/%d/%Y')) 
  

min_date <- min(daily_activity_df_v1$activity_date)
max_date <- max(daily_activity_df_v1$activity_date)
base::paste0('-Starting Date: ', min_date, ' -Ending Date: ', max_date)

## [1] "-Starting Date: 2016-04-12 -Ending Date: 2016-05-12"

head(daily_activity_df_v1)

## # A tibble: 6 × 15
##           id activity_date total_steps total_distance tracker_distance
##        <dbl> <date>              <dbl>          <dbl>            <dbl>
## 1 1503960366 2016-04-12          13162           8.5              8.5 
## 2 1503960366 2016-04-13          10735           6.97             6.97
## 3 1503960366 2016-04-14          10460           6.74             6.74
## 4 1503960366 2016-04-15           9762           6.28             6.28
## 5 1503960366 2016-04-16          12669           8.16             8.16
## 6 1503960366 2016-04-17           9705           6.48             6.48
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>

View(daily_activity_df_v1)

4.1.2. Relationship between Total Distance vs Total Steps,

#Relationship between Total Distance vs Total Steps
ggplot(data = daily_activity_df_v1) +
  geom_point(mapping = aes(x = total_distance, y = total_steps, color = calories)) +
 geom_smooth(mapping = aes(x = total_distance, y = total_steps)) +
  labs(title = 'Total Distance vs. Total Steps',
           x = 'Total Distance',
           y = 'Total Steps')  +
theme(axis.text.x =  element_text(size = 10, color = 'gray40'),
      axis.text.y =  element_text(size = 10, color = 'gray40'),
      axis.title = element_text(size = 10, color = 'grey20'),
      plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'))

4.1.3. Relationship between Total Steps vs Calories burnt,

#Relationship between Total Steps vs Calories burnt

# cut() function here is used to create a new column/variable,'calories_category' in this dataset
daily_activity_df_v2 <- daily_activity_df_v1
daily_activity_df_v2$calories_category <- cut(daily_activity_df_v1$calories, 
                                       breaks = 3, 
                                       labels = c("Low", "Medium", "High"))
ggplot(data = daily_activity_df_v2) +
  geom_point(mapping = aes(x = total_steps, y = calories, color = calories_category), 
                           alpha = 0.7, stat = 'identity') +
  labs(title = 'Relationship between Steps and Calories', 
               x = 'Total Steps', 
               y = 'Total Calories') +
  theme(axis.text.x =  element_text(size = 10, color = 'gray40'),
        axis.text.y =  element_text(size = 10, color = 'gray40'),
        axis.title = element_text(size = 10, color = 'grey20'),
                plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'))

#Analyzing Total Steps vs Total Calories 
daily_activity_df_v3 <-daily_activity_df_v1
                                              
ggplot(data = daily_activity_df_v3) +
  geom_point(mapping = aes(x = calories, y = total_steps, fill = calories)) +
  geom_jitter(mapping = aes(x = calories, y= total_steps, color = total_steps, alpha = 0.35), na.rm = TRUE) +
  geom_smooth(mapping = aes(x = calories, y= total_steps )) +
labs(x = 'Calories', y = 'Total Steps', title= 'Total Steps vs. Calories burnt', size =14, color = 'red') +
theme(axis.text.x =  element_text( size = 10, color = 'gray40'),
       axis.title = element_text(size = 10, color = 'grey20'),
        plot.title = element_text(size = 14, color = 'darkblue', face = 'bold')) +
scale_color_gradient(low = "blue", high = "red") +  
  guides(fill = FALSE) # not to display the legend for the fill color

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

As you see here there is a direct connection is observed between calories burnt and the total steps taken, with an increase in steps corresponding to a higher calorie burn.

4.1.4. Relationship between Distance and Calories (Separate plots for each category),

Next viz, we are going for an alternative approach, presenting three distinct methods for categorizing calories burnt.

#Relationship between Distance and Calories (Separate plots for each category)
daily_activity_df_v2$calories_category <- cut(daily_activity_df_v1$calories, 
                                              breaks = 3, labels = c("Low", "Medium", "High"))
ggplot(data = daily_activity_df_v2) +
  geom_point(mapping = aes(x = total_distance, y = calories, color = calories_category), stat = 'identity') +
  labs(x = 'Total Distance', y = 'Total Calories', 
       title = 'Relationship between Distance vs Calories',
       subtitle ='Each category is shown individually here') +
  theme(axis.text.x =  element_text(angle = 90, size = 10, color = 'gray40'),
        axis.text.y =  element_text(size = 10, color = 'gray40'),
        axis.title = element_text(size = 10, color = 'gray20'),
        plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'),
        plot.subtitle = element_text(size = 12, color = 'gray30')) +
  facet_grid(~calories_category) + coord_flip()

As illustrated, I organized calorie levels into Low, Medium, and High categories to facilitate a comparison of distances, uncovering a positive correlation.

4.1.5. Average distance and calorie metrics for each person, calculated with mean values,

#Average Distance vs Per User
user_avg_activity_df<- daily_activity_df_v2 %>%
  group_by(id) %>%
  summarise(avg_distance = mean(total_distance),
            avg_calories = mean(calories),
            avg_steps = mean(total_steps))

user_avg_activity_df$id <- factor(user_avg_activity_df$id)
user_avg_activity_df$id <- reorder(user_avg_activity_df$id, user_avg_activity_df$avg_distance)
# type of average given mpre weight to smaller values to be more accurate  
harmonic_mean_dist <- round(1 / mean(1 / user_avg_activity_df$avg_distance))
subtitle = 'On average, individuals achieve a distance of'

ggplot(user_avg_activity_df) + #bar plot
  geom_bar(mapping = aes(x = avg_distance, y = id, fill= avg_distance ), stat = "identity") +
  geom_vline(xintercept = harmonic_mean_dist, linetype = "dashed", color = "brown") +
  
    labs(y = "Individuals", x = "Average Distance", title = "Average distance by each Individual",
         subtitle = paste(subtitle, harmonic_mean_dist, 'miles in their exercise routines')) +
  theme(axis.text.x =element_text(angle= 90, size = 9),
        axis.text.y =  element_text(angle = 0, size = 9, color = 'gray20'),
        axis.title = element_text(size = 10, color = 'gray40'),
        plot.title = element_text(size = 14, color = 'darkblue', face = 'bold')) + 
# Annotate the plot with a label positioned at coordinates (2150, 10) indicating the average calorie per individual 
     annotate( x =4, y= 7, label = 'mean distance per individual', geom = 'label', alpha = 0.2, hjust = 0, vjust = 1 ) + 
        scale_fill_gradient(low = 'green', high = 'red') + coord_flip()

#Average Calories per individual
user_avg_activity_df$avg_calories <- round(user_avg_activity_df$avg_calories)
user_avg_activity_df$id <- reorder(user_avg_activity_df$id, user_avg_activity_df$avg_calories)

harmonic_mean_cal <- round(1 / mean(1 / user_avg_activity_df$avg_calories))
subtitle <-'On average, individuals burn'

# geom_vline() is creating a vertical dashed line to represent the average.
ggplot(data = user_avg_activity_df) + 
  geom_bar(mapping = aes(x = avg_calories, y = id, fill = avg_calories), stat = 'identity') +
  geom_vline(xintercept = harmonic_mean_cal, linetype = "dashed", color = "brown") +
 
labs(x = "Average Calories", y = 'Individuals', title = "Average Calories by each Individual",
     subtitle = paste(subtitle, harmonic_mean_cal, 'calories')) +
  theme(axis.text.x =element_text(angle= 90, size = 9),
        axis.text.y =  element_text(angle = 0, size = 9, color = 'gray20'),
        axis.title = element_text(size = 10, color = 'gray40'),
        plot.title = element_text(size = 14, color = 'darkblue', face = 'bold')) +
  annotate( x =2150, y= 10, label = 'average calorie per individual', geom = 'label', 
        alpha = 0.2, hjust = 0, vjust = 1 ) +
     scale_fill_gradient(low = 'green', high = 'red') + coord_flip()

The average distance covered per person is around 4 miles, with an average calorie burn of 2150 calories per individual. This suggests that, on average, participants were engaging in a 4-mile activity and burning 2150 calories. Additionally, a positive correlation was noticed between increased distance and higher calorie expenditure. Nevertheless, details about age, gender, and environmental factors are lacking.

4.1.6. Total Steps by days,

We are analyzing to see which are the most active days of the month,

daily_activity_df_v_1 <- daily_activity_df_v1 %>%
  mutate(day_of_week = format(daily_activity_df_v1$activity_date, '%A')) 


View(daily_activity_df_v_1)

# Removed the  rows showing the values '0' in ordert to have more accurate readings 
daily_activity_df_v_2 <- daily_activity_df_v_1[daily_activity_df_v_1$total_steps != 0, ]
daily_activity_df_v_1 <- daily_activity_df_v_2


# Convert day_of_week to a factor with ordered weekdays
daily_activity_df_v_1$day_of_week <- factor(daily_activity_df_v_1$day_of_week, 
              levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), ordered = TRUE)

# Group by day_of_week and calculate total activity duration for each day
total_steps_by_day <- daily_activity_df_v_1 %>%
  group_by(day_of_week) %>%
  summarise(avg_steps = round(mean(total_steps, na.rm = TRUE)))

print(total_steps_by_day, n = 7)

## # A tibble: 7 × 2
##   day_of_week avg_steps
##   <ord>           <dbl>
## 1 Monday           8488
## 2 Tuesday          8949
## 3 Wednesday        8158
## 4 Thursday         8185
## 5 Friday           7821
## 6 Saturday         8947
## 7 Sunday           7627

most_steps <- mean(total_steps_by_day$avg_steps)

df_over <- total_steps_by_day %>%
  filter(avg_steps >= most_steps) %>%
 pull(day_of_week)


few_steps_over <- total_steps_by_day %>%
  filter(avg_steps >= most_steps) %>%
  pull(avg_steps)

# Print the results
cat('Most steps taken by Day:', as.character(df_over), "\n")

## Most steps taken by Day: Monday Tuesday Saturday

cat('Monthly step count for those days:', few_steps_over, "\n")

## Monthly step count for those days: 8488 8949 8947

 ggplot(data = total_steps_by_day) +
   geom_bar(mapping = aes(x = day_of_week, y = avg_steps, fill = avg_steps), stat =  'identity') +
   scale_fill_gradient(low = "lightblue", high = "coral") +
   
   # Added arrows to show most active days
   geom_segment(data = filter(total_steps_by_day, day_of_week %in% df_over),
                aes(x = day_of_week, y = avg_steps, xend = day_of_week, yend = avg_steps + 5000 ),
                arrow = arrow(length = unit(0.2, "cm")), color = "black") +
   
   labs(title = 'Average steps per day over the month', subtitle = 'Each arrow signifies the day with the highest step count over the month',
        y = 'Average Steps') +
   theme_minimal()

4.2. heartrate_seconds_merged.csv- Individuals’ average heart rate readings,

We are starting with cleaning the dataset making sure that column names in an easy format to process.

# Importing the heartrate_seconds_merged.csv file to RStudio
heart_rate_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')

# Standardizing column names by converting them to lowercase for consistency and clarity
heart_rate_df <- clean_names(heart_rate_df)
heart_rate_df_v1 <- heart_rate_df # We will use this later in another viz
head(heart_rate_df_v1)

## # A tibble: 6 × 3
##           id time                 value
##        <dbl> <chr>                <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM    97
## 2 2022484408 4/12/2016 7:21:05 AM   102
## 3 2022484408 4/12/2016 7:21:10 AM   105
## 4 2022484408 4/12/2016 7:21:20 AM   103
## 5 2022484408 4/12/2016 7:21:25 AM   101
## 6 2022484408 4/12/2016 7:22:05 AM    95

#Date conversion

heart_rate_df <- heart_rate_df %>%
  mutate(time = as.Date(time, format = '%m/%d/%Y')) 
  

start_date <- min(heart_rate_df$time)
end_date <- max(heart_rate_df$time)
base::paste0('Starting Date: ', min_date, '  Ending Date:', max_date)

## [1] "Starting Date: 2016-04-12  Ending Date:2016-05-12"

# Determine and list the number of unique users in this dataset
paste0('Total number of unique ids: ',n_distinct(heart_rate_df$id) )

## [1] "Total number of unique ids: 14"

# Utilize the matrix() and list() functions to display the IDs in a easly readable two-column format
list_names <- list('Unique id #s listed below:', matrix(unique(heart_rate_df$id), ncol = 2))
list_names

## [[1]]
## [1] "Unique id #s listed below:"
## 
## [[2]]
##            [,1]       [,2]
## [1,] 2022484408 5577150313
## [2,] 2026352035 6117666160
## [3,] 2347167796 6775888955
## [4,] 4020332650 6962181067
## [5,] 4388161847 7007744171
## [6,] 4558609924 8792009665
## [7,] 5553957443 8877689391

c('Column names : ', colnames(heart_rate_df)) # printing column names, easily readable

## [1] "Column names : " "id"              "time"            "value"

4.2.1 Examining the average heart rate readings for individual users,

Heart rate readings are captured every 5 seconds,

#Analyzing Individual User's Avg Heart Rate Readings  
avg_heartrate_readings <- heart_rate_df %>%
  group_by(id) %>%
  summarise(avg_heartrate = round(mean(value))) %>%
  arrange(factor(avg_heartrate))
# Reorder the id factor based on avg_heartrate
avg_heartrate_readings$id <- factor(avg_heartrate_readings$id, levels = avg_heartrate_readings$id)


 ggplot(data = avg_heartrate_readings) +
  geom_bar(mapping = aes( x = factor(id), y = avg_heartrate, fill= avg_heartrate),
           stat = 'identity') +
   
     labs(
       title = 'Average Heart Rate by Person',
           y = "Average Beats " , 
           x = 'Individual Person'
       ) +
       theme(
         axis.text.x = element_text(angle = 90),
         plot.title = element_text(color = 'darkblue'),
         axis.title.x = element_text( color = 'gray30'), 
         axis.title.y = element_text(color = 'gray30')
          ) + 
scale_fill_gradient(low ='green', high='red')

This plot allows for a visual comparison of average heart rates across different individuals. Observing the distribution of average heart rates can provide insights into the variability among individuals. and additional information is required to determine if there is an ongoing activity or identify the specific time of day when the average heart rate exceeds 100 beats per minute.

4.2.2. Plot points for individual heart rates greater than 100,

showcasing the.minimum, maximum, and average heart rates corresponding to each individual.

heart_rate_summary <- heart_rate_df %>%
    group_by(id) %>%
         summarise(
          min_rate = as.integer(min(value)),
          max_rate = as.integer(max(value)),
          avg_rate = as.integer(mean(value)),
          sd_rate  = as.integer(sd(value))
          )
  heart_rate_summary$id <- reorder(heart_rate_summary$id,heart_rate_summary$avg_rate)
  ggplot(data = heart_rate_summary) +
        geom_boxplot(mapping = aes(x = factor(id), ymin = min_rate, lower = avg_rate, 
                  middle = avg_rate, upper = avg_rate, 
                    ymax = max_rate ),          
                    stat = 'identity', 
                   color = "gray20" ) + 

    geom_point(data = filter(heart_rate_df, value > 100),
            mapping = aes(x = factor(id), y = value),
              color = "coral", position = position_dodge(width = 0.8 )
    ) +    
        labs(
        title = 'Individual Heart Rate Assessment', 
     subtitle = 'Min, max, avg rates summarized', 
            x = 'Individual Users',
            y = 'Heart Rate bpm ( Red for bpm > 100 ) \n Min, Avg, and Max bpm rates shown by the gray line'
      ) + 
    theme(
      plot.title = element_text(color = 'darkblue'),
      axis.title.x = element_text(color = 'gray30'),
      axis.title.y = element_text(color = 'gray30')
      ) +
  coord_flip() + scale_fill_identity() + scale_color_identity()

The plot serves to give an overview of the heart rate distribution among different users, emphasizing instances of elevated heart rates. The line within the box represents the median heart rate. The whiskers extend to the minimum and maximum values, excluding outliers.

Additionally, red points on the plot highlight instances where the heart rate exceeds 100 bpm. This provides a visual cue for elevated heart rates in certain observations.

4.2.3 Daily heartrate visulization,

heart_rate_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')
heart_rate_df <- clean_names(heart_rate_df)
heart_rate_df_v1 <- heart_rate_df 

  heart_rate_df_v1$time <- mdy_hms(heart_rate_df_v1$time)

    heart_rate_date_df <- heart_rate_df_v1 %>%
      mutate(time = as.Date(time))
 
    cat('Start Date: ', min(as.character(heart_rate_date_df$time)), 
          'End Date: ', max(as.character(heart_rate_date_df$time))
           )

## Start Date:  2016-04-12 End Date:  2016-05-12

   hourly_heartrate_df  <- heart_rate_df_v1  %>%
    group_by(day = as.Date(time), hour=hour(time))  %>%
    summarise( hourly_hrate = round(mean(value)) )

    min_value <- min(heart_rate_df_v1$value)
 max_value <- max(heart_rate_df_v1$value) 

ggplot(data = hourly_heartrate_df) +
  geom_boxplot(mapping =  aes(x = factor(hour), y = hourly_hrate), 
        outlier.color = 'red', outlier.shape = 16)  +
  geom_vline( xintercept =19, linetype ='dashed', color = 'brown' ) +
  annotate( x = 11, y= 60, label='max heart rate w/o outliers ', 
         geom = 'label', alpha = 0.2, hjust=0, vjust=1 ) +
  annotate('rect', xmin = 18.5, xmax = 19.5, ymin = -Inf, ymax = Inf, 
           fill = 'lightcoral', alpha = 0.3) +
  labs(title = "Monthly Average Heart Rate (Box Plot with Outliers)",
       subtitle = 'by Hour - red dots are outliers',
       x = "Hour of the Day",
       y = "Average Heart Rate") + theme_minimal()

Here, we used box plot. The line inside the box represents the median of the data. Lines extending from the box, whiskers, show the range of the data.

Red dots are used here to emphasize outliers that falls outside of the range, extend beyond the whiskers, providing additional clarity to the box plot representation.

Outliers Below represent hours where the average heart rate is unusually low compared to the majority of the data.
Outliers Above represent hours where the average heart rate is unusually high compared to the majority of the data

Outliers can provide valuable information about the data distribution. They might indicate unusual events, measurement errors, or specific patterns that differ from the overall trend.

Please see hourly intensity chart to compare intensity by hour comparasion on 4.3.1

4.3. hourlyIntensities_merged.csv,

As usual, the initial step is to import the data and then proceed with data cleaning, focusing on simplifying column names.

# Importing the data, hourlyintensities_merged.csv
hourly_intensities_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv')

# Cleaning the data, making the names unique and in lower case
hourly_intensities_df <- clean_names(hourly_intensities_df)

str(hourly_intensities_df)

## spc_tbl_ [22,099 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id               : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activity_hour    : chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ total_intensity  : num [1:22099] 20 8 7 0 0 0 0 0 13 30 ...
##  $ average_intensity: num [1:22099] 0.333 0.133 0.117 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityHour = col_character(),
##   ..   TotalIntensity = col_double(),
##   ..   AverageIntensity = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

c('Column names : ', colnames(hourly_intensities_df)) # check column names here

## [1] "Column names : "   "id"                "activity_hour"    
## [4] "total_intensity"   "average_intensity"

# Let's see how many unique Individuals here with their unique ids
list_names <- list('Unique id #s listed below:', matrix(unique(hourly_intensities_df$id), ncol = 3, byrow = TRUE))
list_names

## [[1]]
## [1] "Unique id #s listed below:"
## 
## [[2]]
##             [,1]       [,2]       [,3]
##  [1,] 1503960366 1624580081 1644430081
##  [2,] 1844505072 1927972279 2022484408
##  [3,] 2026352035 2320127002 2347167796
##  [4,] 2873212765 3372868164 3977333714
##  [5,] 4020332650 4057192912 4319703577
##  [6,] 4388161847 4445114986 4558609924
##  [7,] 4702921684 5553957443 5577150313
##  [8,] 6117666160 6290855005 6775888955
##  [9,] 6962181067 7007744171 7086361926
## [10,] 8053475328 8253242879 8378563200
## [11,] 8583815059 8792009665 8877689391

paste0('Total # of individuals: ', n_distinct(hourly_intensities_df$id))

## [1] "Total # of individuals: 33"

4.3.1 Analyzing Hourly activities vs Individuals,

Including peak and low activity hours.

# Analyzing Hourly activities vs Individuals
# Starting with converting activity_hour to POSTXct time 
hourly_intensities_df$activity_hour <- mdy_hms(hourly_intensities_df$activity_hour)
hour_to_pos <- hourly_intensities_df$activity_hour

 intensities_perhour_df <- hourly_intensities_df %>%
   group_by(hour = hour(hour_to_pos) )%>% 
   summarise(avg_intensity = mean(total_intensity)) 
  min_active <- min(round(intensities_perhour_df$avg_intensity, digits = 2))
  max_active <- max(round(intensities_perhour_df$avg_intensity, digits = 2))
 
 gg <- ggplot(data = intensities_perhour_df) +
   geom_area(mapping = aes(x = hour, y = avg_intensity,  fill= 'hour'), stat = 'identity', color = 'blue') +
   geom_vline(xintercept = 3.2, linetype = "dashed", size =1 , color = 'darkblue') +
   geom_vline(xintercept = 18, linetype = "dashed", size = 1, color = 'brown') +
   
 scale_x_continuous(labels = scales::date_format("%H"), 
                   breaks = unique(intensities_perhour_df$hour)) +
 labs(
   title = 'Hourly intensity chart',
   x = "Hour" , 
   y = 'Intensity '
 ) +
   theme(
        plot.title = element_text(color = 'darkblue'),
      axis.title.x = element_text( color = 'gray30'), 
      axis.title.y = element_text(color = 'gray30'),
     ) + 
 annotate( x =3.2, y= 7, label = 'least active', geom = 'label', alpha = 0.2,
             hjust = 0, vjust = 1 ) +
 annotate( x =18, y= 7, label = 'most active', geom = 'label', alpha = 0.2,
             hjust = 0, vjust = 1  ) +
guides(fill = FALSE) +  
scale_fill_manual(values = c('hour' = 'lightpink'))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(gg)

Based on the earlier box plot depicting Hourly Average Heart Rate, it was observed that heart rate tends to be elevated during the noon to afternoon period, reaching its peak at 6 pm. To gain further insights, I compared this data with an area plot illustrating peak activity hours. In order to have a deper understanding, I compared this data with an area plot here. Notably, I observed that the peak hours align, indicating a direct correlation between heart rate and activity levels.

According to our findings individuals exhibit the lowest levels of activity at 3:00 am, reach their highest activity levels between 10:00 am and 8:00 pm, with the peak of activity occurring at 6:00 pm as well as the peak heart rate per minute, please see chart 4.2.3, previous viz.

4.4. weightLogInfo_merged.csv,

The goal is to assess individual daily weight charts to determine whether there are notable variations on a weekly basis.

4.4.1 weightLogInfo_merged.csv

Refining and manipulating data involves, the process of cleaning and transforming, raw data to ensure accuracy, consistency.

# Importing weightLoginfo_merged here
weight_log_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')

#Cleaning the data, making the names unique and in lower case
weight_log_df <- clean_names(weight_log_df) 
# Converting date to POSTXct format 
weight_log_df$date <- mdy_hms(weight_log_df$date)

# Looking at the beginning and end date
min_date <- min(as.Date(weight_log_df$date))
max_date <- max(as.Date(weight_log_df$date))
paste0('Starting Date: ', min_date, ' - ', 'End Date:', max_date)

## [1] "Starting Date: 2016-04-12 - End Date:2016-05-12"

# Let's see how many unique Individuals with unique ids and the total entries for each individual 
entry_count_df <- weight_log_df %>%
  group_by(id) %>%
 summarise(entry_count = n()) 

result_df <- data.frame(Number_of_Users = unique(entry_count_df$id), 
                                Entries = entry_count_df$entry_count
                     )

 print(result_df)

##   Number_of_Users Entries
## 1      1503960366       2
## 2      1927972279       1
## 3      2873212765       2
## 4      4319703577       2
## 5      4558609924       5
## 6      5577150313       1
## 7      6962181067      30
## 8      8877689391      24

In this database, weight recordings are available for a total of 8 individuals, out of 33 participant. However, the reliability of the sample is compromised as 6 of them recorded their weight less than 5 days a month. This limitation limits our ability to analyze changes in average monthly weight based on the total steps taken by each individual.

4.4.2. Weighttracking chart for eight participants on a daily basis,

# Weight vs daily area chart for per Individual -There is no enaugh data here...
weight_log_df_v1 <- weight_log_df %>%
  group_by(date = date(date)) %>%
  select(-weight_kg, -fat, -bmi, -is_manual_report, -log_id) %>%
  mutate(date = date(date))

ggplot(data = weight_log_df_v1) +
  geom_point( mapping = aes(x = date(date), y = round(weight_pounds, 
                        digit = 1), color = 'coral'), stat = 'identity') + 
   facet_grid(~id) +
  
labs(title = 'Daily Weight Chart', x = 'Date', y = 'Weight') +
   theme(
    plot.title = element_text(color = 'darkblue'),
    axis.title.x = element_text(color = 'gray30'),
    axis.title.y = element_text(color = 'gray30'),
    axis.text.x = element_text(angle = 90)
  ) +
 scale_fill_identity() + scale_color_identity()

Insufficient data prevents a detailed analysis of weightLogInfo_merged.csv.

Nonetheless, the provided chart displays daily weight trends for individuals for the month. Notably, two participants consistently document their weight, indicating a stable and active lifestyle aimed at maintaining a nearly constant weight.

4.5. dailyIntensities_merged.csv,

4.5.1. Initially, let’s focus on preparing and validating our dataset for analysis,

# Importing the data, dailyintensities_merged.csv
daily_intensities_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
# Cleaning the data, making the names unique and in lower case
daily_intensities_df <- clean_names(daily_intensities_df)
c('Column names : ', colnames(daily_intensities_df)) # check column names here

##  [1] "Column names : "            "id"                        
##  [3] "activity_day"               "sedentary_minutes"         
##  [5] "lightly_active_minutes"     "fairly_active_minutes"     
##  [7] "very_active_minutes"        "sedentary_active_distance" 
##  [9] "light_active_distance"      "moderately_active_distance"
## [11] "very_active_distance"

# Let's see how many unique Individuals here with their unique ids
list_dnames <- list('Unique id #s listed below:', matrix(unique(daily_intensities_df$id), ncol = 3, byrow = TRUE))
list_dnames

## [[1]]
## [1] "Unique id #s listed below:"
## 
## [[2]]
##             [,1]       [,2]       [,3]
##  [1,] 1503960366 1624580081 1644430081
##  [2,] 1844505072 1927972279 2022484408
##  [3,] 2026352035 2320127002 2347167796
##  [4,] 2873212765 3372868164 3977333714
##  [5,] 4020332650 4057192912 4319703577
##  [6,] 4388161847 4445114986 4558609924
##  [7,] 4702921684 5553957443 5577150313
##  [8,] 6117666160 6290855005 6775888955
##  [9,] 6962181067 7007744171 7086361926
## [10,] 8053475328 8253242879 8378563200
## [11,] 8583815059 8792009665 8877689391

paste0('Total # of individuals: ', n_distinct(daily_intensities_df$id))

## [1] "Total # of individuals: 33"

# finding start and end date
date_range <- paste("-Starting Date:", min(daily_intensities_df$activity_day),
                    " -Ending Date:", 
                    max(daily_intensities_df$activity_day))
date_range

## [1] "-Starting Date: 4/12/2016  -Ending Date: 5/9/2016"

4.5.2. Analyzing the monthly physical activity patterns unique to each participant,

Let’s explore the average duration of different activity types for each participant over the course of the month,

# Importing the data, dailyintensities_merged.csv
daily_intensities_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
# Cleaning the data, making the names unique and in lower case
daily_intensities_df <- clean_names(daily_intensities_df)
daily_intensities_df$activity_day = as.Date(daily_intensities_df$activity_day, format = '%m/%d/%Y')
# Removed the  rows showing the values '0' in order to have a accurate readings 
daily_intensities_df_v4 <- daily_intensities_df[daily_intensities_df$sedentary_minutes != 0, ]
daily_intensities_df_v5 <- daily_intensities_df_v4[daily_intensities_df_v4$sedentary_minutes != 2, ]
daily_intensities_df_v6 <- 
  daily_intensities_df_v5[daily_intensities_df_v5$lightly_active_minutes != 0, ]
daily_intensities_df_v7 <- 
  daily_intensities_df_v6[daily_intensities_df_v6$fairly_active_minutes != 0, ]
daily_intensities_df_v8 <- 
  daily_intensities_df_v7[daily_intensities_df_v7$very_active_minutes != 0, ]

daily_intensities_df <- daily_intensities_df_v8

daily_intensities_df_v1 <- daily_intensities_df 


avg_intens_by_id_df <- daily_intensities_df_v1 %>%
  group_by(id) %>% 
  summarise( 
    sedentary = mean(sedentary_minutes),
    lightly_active = mean(lightly_active_minutes),
    fairly_active = mean(fairly_active_minutes),
    very_active = mean(very_active_minutes)
  ) 

avg_intens_long <- avg_intens_by_id_df %>%
  gather(key = "avg_activity_type", value = "avg_minutes", -id)
View(avg_intens_long ) 
avg_intens_long$id <- reorder(avg_intens_long$id, 
          avg_intens_long$avg_activity_type,
          FUN = function(x) mean(avg_intens_long$avg_minutes))


least_seden <- min(avg_intens_by_id_df$sedentary, na.rm = TRUE)
most_seden <-  max(avg_intens_by_id_df$sedentary, na.rm = TRUE)

total_activity_individuals_df <- avg_intens_by_id_df %>%
  group_by(id) %>%
  summarise(total_active_individual = sum(lightly_active + fairly_active + very_active)
            )
print(total_activity_individuals_df, n = 33)

## # A tibble: 33 × 2
##            id total_active_individual
##         <dbl>                   <dbl>
##  1 1503960366                    287.
##  2 1624580081                    239 
##  3 1644430081                    257.
##  4 1844505072                    266 
##  5 1927972279                     95 
##  6 2022484408                    321.
##  7 2026352035                    192 
##  8 2320127002                    199.
##  9 2347167796                    304.
## 10 2873212765                    307.
## 11 3372868164                    354.
## 12 3977333714                    263.
## 13 4020332650                    234 
## 14 4057192912                     97 
## 15 4319703577                    294.
## 16 4388161847                    291.
## 17 4445114986                    218.
## 18 4558609924                    322.
## 19 4702921684                    296.
## 20 5553957443                    284.
## 21 5577150313                    284.
## 22 6117666160                    416.
## 23 6290855005                    304.
## 24 6775888955                    117.
## 25 6962181067                    316.
## 26 7007744171                    387.
## 27 7086361926                    235.
## 28 8053475328                    258.
## 29 8253242879                    166.
## 30 8378563200                    243.
## 31 8583815059                    229.
## 32 8792009665                    252.
## 33 8877689391                    318.

most_active_individual <- max(total_activity_individuals_df$total_active_individual)
least_active_individual <- min(total_activity_individuals_df$total_active_individual)

# Filter rows where total_active_individual is equal to the maximum value
most_active_id <- total_activity_individuals_df %>%
  filter(total_active_individual == most_active_individual) %>%
  pull(id)

# Filter rows where total_active_individual is equal to the minimum value
least_active_id <- total_activity_individuals_df %>%
  filter(total_active_individual == least_active_individual) %>%
  pull(id)

cat("Most Active User ID:", most_active_id, "with total activity:", most_active_individual, "\n")

## Most Active User ID: 6117666160 with total activity: 416.3333

cat("Least Active User ID:", least_active_id, "with total activity:", least_active_individual, "\n")

## Least Active User ID: 1927972279 with total activity: 95

ggplot(data = avg_intens_long) +

 geom_bar( mapping =  aes( x = id, y = avg_minutes, 
   fill =  avg_activity_type),  stat = 'identity', position = "dodge") + 
labs(title = 'Monthly activity type vs Average duration per participants', subtitle = 'Monthly activity durations classified based on the type of activity', x = 'Participants', y = 'Avg minutes by activity category') +
  annotate( x =28, y= 1280, label = 'most active', 
            geom = 'label', alpha = 0.2, vjust = 0, hjust = 1 ) +   
  annotate( x =29.9, y= 1185, label = 'least sedentary', 
            geom = 'label', alpha = 0.2, vjust = 0, hjust = 1 ) +   
  annotate( x =11.2, y= 1280, label = 'least active', 
            geom = 'label', alpha = 0.2, vjust = 0, hjust = 1 ) +   
  annotate( x =13.3, y= 1185, label = 'most sedentary', 
            geom = 'label', alpha = 0.2, vjust = 0, hjust = 1 ) +   
  annotate('rect', xmin = 4.5, xmax = 5.7, ymin = -Inf, ymax = Inf, 
           fill = 'yellow', alpha = 0.3) +
  
  annotate('rect', xmin = 21.5, xmax = 22.5, ymin = -Inf, ymax = Inf, 
           fill = 'yellow', alpha = 0.3) +
  
  theme(
    axis.text.x =  element_text(angle = 90, size = 10, color = 'gray20'),
    axis.text.y =  element_text(size = 10, color = 'gray20'),
    axis.title = element_text(size = 10, color = 'gray20'),
    plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'),
    plot.subtitle = element_text(size = 12, color = 'gray30'),
    panel.background = element_rect(fill = "white" ) 
  )

Participants with lower sedentary behavior tend to be more active compared to those with higher sedentary behavior. For instance, individuals who spend less time sitting often engage in higher levels of physical activity than those who have a more sedentary lifestyle.

4.5.3. Our goal is to analyze how the duration of various activities,

varies on weekdays over the course of the month, pinpointing the day exhibiting the highest activity intensity,

#Weekdays vs Activity type 
daily_intensities_df <- read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
# Cleaning the data, making the names unique and in lower case
daily_intensities_df <- clean_names(daily_intensities_df)


daily_intensities_df_v6 <- 
  daily_intensities_df[daily_intensities_df$lightly_active_minutes != 0, ]
daily_intensities_df_v7 <- 
  daily_intensities_df_v6[daily_intensities_df_v6$fairly_active_minutes != 0, ]
daily_intensities_df_v8 <- 
  daily_intensities_df_v7[daily_intensities_df_v7$very_active_minutes != 0, ]

 
 least_seden <- min(avg_intens_by_id_df$sedentary, na.rm = TRUE)
 

daily_intensities_df <- daily_intensities_df_v8
daily_intensities_df$activity_day = as.Date(daily_intensities_df$activity_day, 
                           format = '%m/%d/%Y')
 
daily_intensities_df_v1 <- daily_intensities_df %>%
  mutate(day_of_week = format(daily_intensities_df$activity_day, '%A')) 
# Modifying column names - placing 1st column followed by the 10th 
daily_intensities_df_v1 <- daily_intensities_df_v1[, c(1, 11, 2:10)]


# Convert day_of_week to a factor with ordered weekdays
daily_intensities_df_v1$day_of_week <- factor(daily_intensities_df_v1$day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), ordered = TRUE)

avg_intens_by_day_long <- daily_intensities_df_v1 %>%
  group_by(day_of_week) %>% 
  summarise( 
    lightly_active = mean(lightly_active_minutes),
    fairly_active = mean(fairly_active_minutes),
      very_active = mean(very_active_minutes) ) %>%
 pivot_longer(cols = -day_of_week, names_to = 'activity_type' , values_to = 'avg_minutes')

total_activity_daily <- avg_intens_by_day_long %>%
  group_by(day_of_week) %>%
  summarise(
    total_activity_minutes = sum (avg_minutes )
         )

# Identify the most and least active days
 least_act = min(total_activity_daily$total_activity_minutes)
 most_act = max(total_activity_daily$total_activity_minutes)

 most_active_day <- total_activity_daily %>%
 filter(total_activity_minutes == max(total_activity_minutes)) %>%
 pull(day_of_week)
 
 least_active_day <- total_activity_daily %>%            
 filter(total_activity_minutes == min(total_activity_minutes)) %>%
 pull(day_of_week)
# Convert the numbers to day of week
 most_active_day <- as.character(most_active_day) 
 least_active_day <- as.character(least_active_day)
cat('Most Active Day:', most_active_day, 'with minutes', most_act, "\n")

## Most Active Day: Saturday with minutes 303.7576

cat('Least Active Day:', least_active_day, 'with minutes', least_act, "\n")

## Least Active Day: Sunday with minutes 260.2373

head(total_activity_daily, n = 7 )

## # A tibble: 7 × 2
##   day_of_week total_activity_minutes
##   <ord>                        <dbl>
## 1 Monday                        263.
## 2 Tuesday                       290.
## 3 Wednesday                     272.
## 4 Thursday                      271.
## 5 Friday                        275.
## 6 Saturday                      304.
## 7 Sunday                        260.

 ggplot(data = avg_intens_by_day_long) +
   geom_bar(mapping = aes(x = day_of_week, y = avg_minutes, fill = activity_type ), stat = 'identity', position = "dodge"  ) +
   geom_vline(xintercept = most_active_day, linetype = "dashed", size = 1, color = "brown") +
   geom_vline(xintercept = least_active_day, linetype = "dashed",size = 1, color = "brown") +
    annotate(x =5, y= 400, label = 'most active', geom = 'label', alpha = 0.2,
               hjust = 0, vjust = 1 )   +
    annotate(x =6, y= 350, label = 'least active', geom = 'label', alpha = 0.2,
               hjust = 0, vjust = 1 )   +
   labs(title = "Weekdays vs Activity type average",
        x = "Weekdays",
        y = "Average Active Times (minutes)") + 
   
    theme(
     axis.text.x = element_text(size = 9, color = 'gray20'),
     axis.text.y = element_text(size = 10, color = 'gray20'),
     axis.title = element_text(size = 10, color = 'gray20'),
     plot.title = element_text(size = 14, color = 'darkblue', face = 'bold'),
     plot.subtitle = element_text(size = 12, color = 'gray30'),
     panel.background = element_rect(fill = 'white'),
    
   )

 #Same viz with stack plot
  ggplot(data = avg_intens_by_day_long) +
    geom_col(aes(x = day_of_week, y = avg_minutes, fill = activity_type), position = "stack") +
          labs(title = "Weekdays vs Activity type average, stack version",
          x = "Weekdays",
          y = "Average Active Times (minutes)") +
    geom_vline(xintercept = most_active_day, linetype = "dashed", 
                       size = 1, color = "brown") +
    geom_vline(xintercept = least_active_day, linetype = "dashed", 
                      size = 1, color = "brown") +
   annotate(x =5, y= 400, label = 'most active', geom = 'label', alpha = 0.2,
               hjust = 0, vjust = 1 )   +
   annotate(x =6, y= 350, label = 'least active', geom = 'label', alpha = 0.2,
               hjust = 0, vjust = 1 )   +
   scale_fill_brewer(palette = "Dark2") + theme_minimal()

According to the dataset, our calculations indicate that Tuesday recorded the highest activity level with 26,119 minutes, while Sunday marked the lowest activity, totaling 15,354 minutes throughout the month.

4.6. sleepDay_merged.csv,

4.61. Individuals have different daily bed and rest times,

my goal is to determine whether there exists a direct and positive correlation between the time individuals spend sleeping and their physical activity duration. This will be explored by merging information from the ‘dailyActivity_merged.csv’ and ‘sleepDay_merged.csv’ datasets.

# Importing both csv files
daily_act_df <-  read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
sleep_day_df <-  read_csv('~/Desktop/Case Study/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
  
daily_act_df <- clean_names(daily_act_df)
sleep_day_df <- clean_names(sleep_day_df)


cat("# of users -Daily Activity: ", n_distinct(daily_act_df$id), 
    '-Sleep Activity: ', n_distinct(sleep_day_df$id), '\n')

## # of users -Daily Activity:  33 -Sleep Activity:  24

# Quick summary() for this two datasets
daily_act_df %>%
  select ( total_steps,
           total_distance,
           sedentary_minutes
           ) %>%
  summary()

##   total_steps    total_distance   sedentary_minutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   
##  Median : 7406   Median : 5.245   Median :1057.5   
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

sleep_day_df %>%
  select( total_minutes_asleep,
         total_time_in_bed
         ) %>%
summary()

##  total_minutes_asleep total_time_in_bed
##  Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:361.0        1st Qu.:403.0    
##  Median :433.0        Median :463.0    
##  Mean   :419.5        Mean   :458.6    
##  3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :796.0        Max.   :961.0

4.6.2. Combined bedtime period,

Bed time duration and actual sleep time comparesion.

ggplot(data=sleep_day_df, aes(x=total_minutes_asleep, y=total_time_in_bed)) + 
  geom_point() + geom_smooth()  + 
  labs(title = 'Asleep Time Analysis Relative to Total Bed Rest', 
       subtitle = 'duration in minutes',
              x = 'Total Sleep Duration', 
               y = 'Bedtime duration') + theme_minimal()

There is a positive correlation between bedtime duration and actual sleep time, This means that people who spend more time in bed tend to get more sleep on average.

4.6.3. Comparing Total Steps to Sedentary Minutes,

Creating a scatter plot here to analyse the correlation between sedentary time and steps taken.

# Creating a scatter plot with a smooth line to explore the positive correlation between sedentary time and total steps
ggplot(data = daily_act_df) +
  geom_smooth( mapping = aes(x=total_steps, y= sedentary_minutes)) + 
  labs(title = 'Total steps vs Sedentary time', 
       x = 'Total Steps', 
       y = 'Sedentary minutes') + 
  theme_bw()

There is a positive correlation observed between the number of steps taken and sedentary time. It appears that individuals who take more steps tend to have lower sedentary time.

4.6.4 Analyzing the Relationship Between Sleep Duration and Total Step Count,

We are merging both datasets by the ‘id’ numbers; dailyActivity_merged.csv’ and ‘sleepDay_merged.csv’ datasets.

#Inner join the datasets by id numbers
combined_data_df <- merge(sleep_day_df, daily_act_df)
colnames(combined_data_df)

##  [1] "id"                         "sleep_day"                 
##  [3] "total_sleep_records"        "total_minutes_asleep"      
##  [5] "total_time_in_bed"          "activity_date"             
##  [7] "total_steps"                "total_distance"            
##  [9] "tracker_distance"           "logged_activities_distance"
## [11] "very_active_distance"       "moderately_active_distance"
## [13] "light_active_distance"      "sedentary_active_distance" 
## [15] "very_active_minutes"        "fairly_active_minutes"     
## [17] "lightly_active_minutes"     "sedentary_minutes"         
## [19] "calories"

cat('# of participants:', n_distinct(combined_data_df$id))

## # of participants: 24

# date starts and end
cat('Data -Start date: ', min(combined_data_df$activity_date),
    ' -End date',max(combined_data_df$activity_date ))

## Data -Start date:  4/12/2016  -End date 5/9/2016

#_______Compared Sleep duration vs Steps taken________

sleep_vs_steps_df <- combined_data_df %>%
  group_by(id) %>%
  summarise( 
    avg_steps_t = median(total_steps),
    avg_a_sleep = median(total_minutes_asleep)
  )
head(sleep_vs_steps_df) # lets see a tibble here

## # A tibble: 6 × 3
##           id avg_steps_t avg_a_sleep
##        <dbl>       <dbl>       <dbl>
## 1 1503960366      12207         340 
## 2 1644430081       6684.        130.
## 3 1844505072       2237         644 
## 4 1927972279        152         398 
## 5 2026352035       5528         516.
## 6 2320127002       5057          61

model <- lm(avg_a_sleep ~ avg_steps_t, data = sleep_vs_steps_df) # Fit a linear model

residuals <- residuals(model) # Calculate residuals

# Identify outliers using a threshold - sd(residuals)
outlier_threshold <- .80 * sd(residuals) #the sensitivity of the outlier detection
sleep_vs_steps_df$outlier <- ifelse(abs(residuals) > outlier_threshold, "red", "black")

# Scatter plot with outliers in red
ggplot(data = sleep_vs_steps_df, aes(x = avg_steps_t, y = avg_a_sleep)) +
  geom_point(aes(color = outlier), size = 3) +
  geom_smooth() +
  labs(title = "Scatter Plot: Average Steps vs. Average Sleep",
       x = "Average Steps ", y = "Average Sleep") +
  scale_color_manual(values = c("black", 'red'))  + theme_bw()

# Density design
ggplot(data = sleep_vs_steps_df, aes(x = avg_steps_t, y = avg_a_sleep)) +
  geom_point(alpha = 0.5) +
  geom_density_2d() +
  labs(title = "Density Plot: Average Steps vs. Average Sleep",
       x = "Average Steps", y = "Average Sleep") +
  theme_bw()

The study on the relationship between sleep duration and total steps demonstrates a weak positive correlation. The prevalence of outliers, in red, indicates a significant data spread, showing us the need for more information, particularly when referencing the density chart.

5. SHARE

5.1. dailyActivity_merged.csv,

There exists a positive correlation among distance covered, steps taken, and calories burned. Nevertheless, this sample dataset lacks crucial factors like age, gender, demographics, and environmental variables.

5.2 heartrate_seconds_merged.csv &

5.3 hourlyIntensities_merged.csv,

I discovered a positive correlation among heart rate, activity, and calories burnt when i combined the data. Specifically, when analyzing the Monthly Average Heart Rate chart, alongside the Hourly Intensity chart (see sections 4.2.3 and 4.3.1), the positive correlation is evident. The least active times display lower heart rates, while the most active times of the day show higher heart rates.

It’s important to note that the red dots in this model (refer to 4.2.3) represent outliers. Remember that outliers might signify unusual incidents, errors in measurements, or particular patterns that deviate from the broader trend.

5.4. weightLogInfo_merged.csv,

As previously shown, weight recordings are accessible for only 8 out of 33 participants in the database, with 6 individuals recording their weight less than 5 days a week. This limitation interfere our ability to analyze monthly weight changes based on total steps. The insufficient data in weightLogInfo_merged.csv prevents a detailed analysis.

However, the chart illustrates daily weight trends for the month, highlighting two participants with consistent weight documentation (refer to 4.4.2), suggesting a stable and active lifestyle helps at maintaining nearly constant weight.

5.5 dailyIntensities_merged.csv,

Those who limit their sedentary behavior by incorporating more movement into their daily routines typically participate in increased physical activity, unlike others who adopt a more sedentary lifestyle. Please refer to chart 4.5.2 and examine the highlighted section, where you can identify both the least and most active periods, along with the associated types of activities.

According to charts at section 4.5.3, I observed a significant disparity where the percentage of lightly_active instances is notably higher than the corresponding percentages for very active and fairly active states.

The dataset indicates that Tuesday recorded the highest activity level with 26,119 minutes, while Sunday marked the lowest activity, totaling 15,354 minutes throughout the month, not including sedentary_active cases.

5.6. sleepDay_merged.csv & dailyActivity_merged.csv

As I implied before, there is a positive correlation between bedtime duraton and actual sleep time. HoweverI observed a few outliers, but their values are only slightly beyond the expected range. Therefor the observed outliers are considered within the context of the incomplete information available in the dataset. However, their interpretation is constrained by the limited availability of data; regarding factors like age, health conditions, and stress levels.

The charts (refer to 4.6.4) seem to show a weak to mild positive correlation between average steps taken and total average sleep. This means that participants who took more steps on average also tended to sleep more on average, but the relationship is not very strong. The charts show correlation, but they don’t necessarily imply causation. It’s possible that other factors could be influencing both sleep and activity levels, and not directly causing one to affect the other.

6. ACT

Bellabeat should provide applications designed specifically for different genders can be highly beneficial in promoting health and wellness, addressing the distinct needs and challenges faced by individuals of different genders.

Bellabeat app ahould ask users questions about their gender, weight, height, and general health upon installation in order to personalize their app experience and tailoring app features to individual needs. However, be transparent about how it will be used. Users maybe more willing to share information if they understand its purpose and how it benefits them.

Motivating Steps and Distance Tracking: App Features and Beyond

Beyond the basic step counter, here is how Bellabeat users can benefit improving their habits for step counting and distance tracking:

Bellabeat App Features:

Personalized goals and challenges: customers should be able to set daily, weekly, or monthly step goals based on individual fitness levels and aspirations. Bellabeat app shoud be able to offer fun and engaging challenges with friends or groups, or personalized “streak” challenges for consistency.

Variety and gamification: Bellabeat app should integrate mini-games or activities within the app that reward steps and distance covered. Offer different “environments” or virtual rewards to keep users engaged.

Progress visualization: Bellabeat app should have a clear and visually appealing dashboards showing daily, weekly, and monthly step trends, distance covered on maps, and achievement badges , and gently remind users periodically throughout the day about their progress toward achieving their goals.

Social interaction and support: Users should be able to share progress and achievements with friends or join online communities for encouragement and motivation. Compete in friendly challenges or team up for shared goals.

Rewards and integration: Bellabeat app should partner with local businesses or fitness services to offer discounts or rewards based on reaching step goals. Integrate with other health and fitness apps to create a holistic picture of progress.

Beyond the Bellabit App:

Wearable integration: Bellabeat should make the app compatible with various wearables for seamless data tracking and convenience.

Smart reminders: Bellabeat app should send personalized notifications at times when users are most likely to be inactive, offering gentle nudges to get moving
making sure that ’focus on progress, not perfection’ andencourage consistency and celebrate small wins rather than dwelling on missed goals.

Community events and activities: App should be interactive, be able to organize local walking groups, challenges, or scavenger hunts that use the app for tracking and engagement.

Encouraging Healthy Sleep Habits

Beyond basic sleep tracking, here are some ideas to motivate users towards better sleep hygiene:

App Features:

Personalized sleep goals and analysis: Bellabeat users should be able to set customized sleep schedules based on their needs and be able analyze sleep quality through metrics like sleep stages, heart rate variability, and snoring detection.

Relaxation and mindfulness tools: users should be able to access in app futures like, breathing exercises, and calming soundscapes to help them wind down before bed.

Bedroom environment monitoring: Bellabeat app should be able to integrate with smart devices to track temperature, light, and noise levels, and offer personalized recommendations for optimal sleep environment.

Progress visualization: App showing sleep trends over time, highlight improvements, and offer challenges to maintain healthy sleep patterns.

Integration with other health apps: Connect with fitness trackers, mood trackers, and stress management apps to understand how other factors influence sleep.

Sleep Habit Recording:

Here some ideas,

Comfortable sleep tracker: Design a discreet and comfortable tracker,

It could take the form of a chest, bracelet, or ankle tracker—something lightweight that people can comfortably wear throughout the night.

Snoring detection and analysis: Help users identify potential sleep apnea.

It can be a michrophone built in to smart sleep tracker by utilizing algorithms that identfy the specific frequency range of snoring sounds at the same time filter out other background noise.

Heart-Relates must have App featuers:

Monitoring and Tracking:

Resting heart rate (RHR) tracking: Bellabeat app should be able to monitor daily, hourly RHR and offer personalized insights on stress levels, fitness progress, and overall heart health.

Electrocardiogram (ECG) capabilities: Bellabeat app, as an advanced app, can incorporate wearable ECG technology for on-demand heart rhythm monitoring, potentially detecting arrhythmias.

Activity tracking: Combinig with the steps and distance traker app, Bellabeat app should offer to monitor steps, distance, and active minutes, highlighting their impact on heart health and motivating consistent movement.

Weight and body composition tracking: Optional feature to understand weight-heart health connection and encourage healthy weight management.

Also, it can have an integrated feature to connect users with doctors or telehealth services for personalized guidance and consultations if there is a potentially concerns and need immediate attention.

In summary, It is important to note that the key is to find approaches that are engaging, personalized, and sustainable for individual users.

Combined with sleep and step tracker apps, Bellabeat app can monitor sleep duration and quality, steps taken highlighting their impact on heart health and providing tips for improvement.

By offering a combination of technology, community, and healthy habit integration, Bellabeat app can help people develop lasting habits for step counting, distance tracking, and sleeping quality that lead to a healthier and happier life.