ASK:

Business Task: I am analyzing Bellabeat’s data usage. They are a smart device manufacturing company targeted towards women’s health. The goal is to determine trends/consumer behavior from consumer data that prove useful to the marketing team’s strategy towards improved user engagement.

Features of Bellabeat’s smart device:
Alongside the watch, there is also the Bellabeat app which is downloadable, and they both include the following features:
Leaf: a tracker that records sleep, activity, and stress
Time: a watch feature that offers insight on daily wellness and tracks activity as well

Key Stakeholders:
Urška Sršen: Bellabeat’s co-founder, Chief Creative Officer
Sando Mur: Mathematician, Bellabeat’s cofounder, key member of the Bellabeat executive team
Bellabeat Marketing Analytics Team: Tasked with sifting through data and offering their insights to the marketing team

Objectives:
* find out how consumers use their product; through pattern recognition or trends
* suggest improvements for Bellabeat smart devices based on consumer data

PREPARE:

Data Source:
Kaggle’s ‘Fitbit Fitness Tracker Data’(CCO: Public Domain) is a dataset created by Mobius, where there is recorded information on 30 fitbit wearers, which includes daily activity, calories burned, and more that is related to health improvement.
Website Link: https://www.kaggle.com/datasets/arashnic/fitbit/data

Data Structure:
For this dataset, We are focusing on the four most important csv files: daily_steps, daily_sleep, daily_activity, and weight_log_info

#inserting necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
## daily steps csv into data.frame
daily_steps <- read.csv("C:/Users/andre/Downloads/bellabeat_data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
#Key variables: Id, ActivityDay, StepTotal
head(daily_steps)
##           Id ActivityDay StepTotal
## 1 1503960366   4/12/2016     13162
## 2 1503960366   4/13/2016     10735
## 3 1503960366   4/14/2016     10460
## 4 1503960366   4/15/2016      9762
## 5 1503960366   4/16/2016     12669
## 6 1503960366   4/17/2016      9705
## daily sleep csv into data.frame
daily_sleep <- read.csv("C:/Users/andre/Downloads/bellabeat_data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
#Key variables: Id, SleepDay, TotalMinutesAsleep
head(daily_sleep)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
## daily activity csv into data.frame
daily_activity <- read.csv("C:/Users/andre/Downloads/bellabeat_data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
#Key variables: Id, ActivityDate, TotalSteps, SedentaryMinutes, Calories, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes
head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
## weight log csv into data.frame
weight_log <- read.csv("C:/Users/andre/Downloads/bellabeat_data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
#Key variables: Id, Date, WeightPounds, IsManualReport
head(weight_log)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12

PROCESS:

Objective:
Checking for nulls, duplicates, and converting to correct data types for each csv.

Daily_steps:

#check for missing values
summary(daily_steps)
##        Id            ActivityDay          StepTotal    
##  Min.   :1.504e+09   Length:940         Min.   :    0  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3790  
##  Median :4.445e+09   Mode  :character   Median : 7406  
##  Mean   :4.855e+09                      Mean   : 7638  
##  3rd Qu.:6.962e+09                      3rd Qu.:10727  
##  Max.   :8.878e+09                      Max.   :36019
#checking data type
class(daily_steps$Id)
## [1] "numeric"
class(daily_steps$ActivityDay)
## [1] "character"
#change from character to date data type
daily_steps$ActivityDay <- mdy(daily_steps$ActivityDay)
class(daily_steps$ActivityDay)
## [1] "Date"
class(daily_steps$StepTotal)
## [1] "integer"
cleaned_daily_steps <- daily_steps %>%
  group_by(Id) %>% # groups by ID
  drop_na %>% #removes missing values
  distinct() %>% #removes duplicates
  rename(Date = ActivityDay) %>%
  head()

Daily_sleep:

summary(daily_sleep)
##        Id              SleepDay         TotalSleepRecords TotalMinutesAsleep
##  Min.   :1.504e+09   Length:413         Min.   :1.000     Min.   : 58.0     
##  1st Qu.:3.977e+09   Class :character   1st Qu.:1.000     1st Qu.:361.0     
##  Median :4.703e+09   Mode  :character   Median :1.000     Median :433.0     
##  Mean   :5.001e+09                      Mean   :1.119     Mean   :419.5     
##  3rd Qu.:6.962e+09                      3rd Qu.:1.000     3rd Qu.:490.0     
##  Max.   :8.792e+09                      Max.   :3.000     Max.   :796.0     
##  TotalTimeInBed 
##  Min.   : 61.0  
##  1st Qu.:403.0  
##  Median :463.0  
##  Mean   :458.6  
##  3rd Qu.:526.0  
##  Max.   :961.0
#checking data type
class(daily_sleep$Id)
## [1] "numeric"
class(daily_sleep$SleepDay)
## [1] "character"
daily_sleep$SleepDay <- mdy_hms(daily_sleep$SleepDay)
class(daily_sleep$SleepDay)
## [1] "POSIXct" "POSIXt"
class(daily_sleep$TotalMinutesAsleep)
## [1] "integer"
cleaned_daily_sleep <- daily_sleep %>%
  select(Id, SleepDay, TotalMinutesAsleep) %>% ##selecting the important variables
  group_by(Id) %>% ## groups by ID
  drop_na() %>%## removes missing values
  distinct() %>% ##removes duplicates
  rename(Date = SleepDay) %>%
  head()

Daily_Activity:
- created a new variable, TotalMinutesActive, by summarizing the values of LightlyActiveMinutes, FairlyActiveMinutes, and VeryActiveMinutes from each observation
- covert ActivityDate from ‘chr’ to ‘date’

summary(daily_activity)
##        Id            ActivityDate         TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Length:940         Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Mode  :character   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09                      Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09                      3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09                      Max.   :36019   Max.   :28.030  
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900
class(daily_activity$Id)
## [1] "numeric"
class(daily_activity$ActivityDate)
## [1] "character"
daily_activity$ActivityDate <- mdy(daily_activity$ActivityDate)
class(daily_activity$ActivityDate)
## [1] "Date"
class(daily_activity$TotalSteps)
## [1] "integer"
class(daily_activity$Calories)
## [1] "integer"
class(daily_activity$SedentaryMinutes)
## [1] "integer"
class(daily_activity$TotalSteps)
## [1] "integer"
class(daily_activity$VeryActiveMinutes)
## [1] "integer"
class(daily_activity$FairlyActiveMinutes)
## [1] "integer"
class(daily_activity$LightlyActiveMinutes)
## [1] "integer"
cleaned_daily_activity <- daily_activity %>%
  select(Id, ActivityDate, TotalSteps, Calories, SedentaryMinutes, FairlyActiveMinutes, VeryActiveMinutes, LightlyActiveMinutes) %>%
  group_by(Id) %>% ## groups by ID
  drop_na() %>%## removes missing values
  distinct() %>% ##removes duplicates
  mutate(TotalMinutesActive = FairlyActiveMinutes + LightlyActiveMinutes + VeryActiveMinutes) %>%
  rename(Date = ActivityDate) %>%
  rename(StepTotal = TotalSteps) %>%
  head()

weight_log:

summary(weight_log)
##        Id                Date              WeightKg       WeightPounds  
##  Min.   :1.504e+09   Length:67          Min.   : 52.60   Min.   :116.0  
##  1st Qu.:6.962e+09   Class :character   1st Qu.: 61.40   1st Qu.:135.4  
##  Median :6.962e+09   Mode  :character   Median : 62.50   Median :137.8  
##  Mean   :7.009e+09                      Mean   : 72.04   Mean   :158.8  
##  3rd Qu.:8.878e+09                      3rd Qu.: 85.05   3rd Qu.:187.5  
##  Max.   :8.878e+09                      Max.   :133.50   Max.   :294.3  
##                                                                         
##       Fat             BMI        IsManualReport         LogId          
##  Min.   :22.00   Min.   :21.45   Length:67          Min.   :1.460e+12  
##  1st Qu.:22.75   1st Qu.:23.96   Class :character   1st Qu.:1.461e+12  
##  Median :23.50   Median :24.39   Mode  :character   Median :1.462e+12  
##  Mean   :23.50   Mean   :25.19                      Mean   :1.462e+12  
##  3rd Qu.:24.25   3rd Qu.:25.56                      3rd Qu.:1.462e+12  
##  Max.   :25.00   Max.   :47.54                      Max.   :1.463e+12  
##  NA's   :65
class(weight_log$Id)
## [1] "numeric"
class(weight_log$Date)
## [1] "character"
weight_log$Date <- mdy_hms(weight_log$Date)
class(weight_log$Date)
## [1] "POSIXct" "POSIXt"
class(weight_log$IsManualReport)
## [1] "character"
weight_log$IsManualReport <- as.logical(weight_log$IsManualReport)
class(weight_log$IsManualReport)
## [1] "logical"
cleaned_weight_log <- weight_log %>%
  mutate(changed_weight_date = as.Date(weight_log$Date,format = "%m/%d/%Y")) %>% #converting the dttm to Date data type
  select(Id, changed_weight_date, WeightPounds,IsManualReport) %>%
  group_by(Id) %>% # groups by ID
  drop_na() %>% #removes missing values
  distinct() %>% #removes duplicates
  rename(Date = changed_weight_date) %>% #renames to 'Date'
  head()

Merging:

merged_steps_sleep <- merge(x=cleaned_daily_steps, y=cleaned_daily_sleep, by = c("Id", "Date")) 
head(merged_steps_sleep)
##           Id       Date StepTotal TotalMinutesAsleep
## 1 1503960366 2016-04-12     13162                327
## 2 1503960366 2016-04-13     10735                384
## 3 1503960366 2016-04-15      9762                412
## 4 1503960366 2016-04-16     12669                340
## 5 1503960366 2016-04-17      9705                700
merged_steps_activity <- merge(x=cleaned_daily_steps, y=cleaned_daily_activity, by = c("Id", "Date", "StepTotal")) 
head(merged_steps_activity)
##           Id       Date StepTotal Calories SedentaryMinutes FairlyActiveMinutes
## 1 1503960366 2016-04-12     13162     1985              728                  13
## 2 1503960366 2016-04-13     10735     1797              776                  19
## 3 1503960366 2016-04-14     10460     1776             1218                  11
## 4 1503960366 2016-04-15      9762     1745              726                  34
## 5 1503960366 2016-04-16     12669     1863              773                  10
## 6 1503960366 2016-04-17      9705     1728              539                  20
##   VeryActiveMinutes LightlyActiveMinutes TotalMinutesActive
## 1                25                  328                366
## 2                21                  217                257
## 3                30                  181                222
## 4                29                  209                272
## 5                36                  221                267
## 6                38                  164                222
merged_activity_weight <- merge(x=cleaned_daily_activity, y=cleaned_weight_log, by = c("Id","Date"))
head(merged_activity_weight)
##  [1] Id                   Date                 StepTotal           
##  [4] Calories             SedentaryMinutes     FairlyActiveMinutes 
##  [7] VeryActiveMinutes    LightlyActiveMinutes TotalMinutesActive  
## [10] WeightPounds         IsManualReport      
## <0 rows> (or 0-length row.names)
merged_total <- merge(x=merged_steps_sleep, y=merged_activity_weight, by = c("Id","Date"))
head(merged_total)
##  [1] Id                   Date                 StepTotal.x         
##  [4] TotalMinutesAsleep   StepTotal.y          Calories            
##  [7] SedentaryMinutes     FairlyActiveMinutes  VeryActiveMinutes   
## [10] LightlyActiveMinutes TotalMinutesActive   WeightPounds        
## [13] IsManualReport      
## <0 rows> (or 0-length row.names)

ANALYZE/SHARE:

I graphed the Relationship between Steps and Minutes Asleep, and the trend line shows that there is no correlation between ‘Total Steps’ and “Total Minutes Asleep’. The steps vary even though the total minutes asleep was fairly constant. This shows that regardless of the amount of sleep each user gets, the ‘StepTotal’ is independent of that variable.

ggplot(merged_steps_sleep, aes(x = StepTotal, y = TotalMinutesAsleep)) +
  geom_point() + # Scatter points
  geom_smooth() + # Add a regression line
  labs(title = "Relationship between Steps and Minutes Asleep", x = "Total Steps", y = "Total Minutes Asleep") +
  scale_x_continuous(breaks = c(0, 5000, 10000, 15000, 20000)) + 
  scale_y_continuous(breaks = c(200, 400, 600, 800))


I graphed the relationship between TotalMinutesActive vs SedentaryMinutes Daily in order to be able to compare the daily patterns of users’ ‘Active Minutes’ to their ‘Sedentary Minutes’. The average amount of minutes sedentary was found to be ~991.2, and the average minutes active was ~227.5. These values are important because it shows that ~1/5 of the time, users are active in comparision to being sedentary.

#calculating the average amount of minutes active
average_active <- mean(merged_steps_activity$TotalMinutesActive)

#calculating the average amount of minutes sedentary
average_sedentary <- mean(merged_steps_activity$SedentaryMinutes)

#graphing the relationship between TotalMinutesActive vs SedentaryMinutes Daily
ggplot(merged_steps_activity, aes(x = Date)) +
  geom_line(aes(y = TotalMinutesActive, color = 'Active Minutes')) + 
  geom_line(aes(y= SedentaryMinutes, color = 'Sedentary Minutes')) + 
  labs(title = "Active and Sedentary Minutes Over Time", x = "Date", y = "Minutes") +
  geom_hline(aes(yintercept = average_active, color = paste('Average Active:', round(average_active, 1))),linetype = "dotted", size = 1.9) +
  geom_hline(aes(yintercept = average_sedentary, color = paste('Average Sedentary:', round(average_sedentary, 1))),linetype = "dotted", size = 1.9)


I graphed the relationship between ‘Average Steps by Weekday’. I found that Sunday was when most users averaged their least amount of recorded steps.

# Add a new column for the weekday
merged_steps_activity$Weekday <- weekdays(merged_steps_activity$Date)

# Bar plot showing average steps by weekday
ggplot(merged_steps_activity, aes(x = Weekday, y = StepTotal)) +
  stat_summary(fun = mean, geom = "bar", fill = "skyblue") +
  labs(title = "Average Steps by Weekday", x = "Weekday", y = "Average Steps"
  ) 


For every individual ‘Id’, I graphed a scatterplot between ‘Very Active Minutes and Sedentary Minutes’. I found that there was no strong correlation between the two variables despite hoping to see an inverse relationship, assuming that users with more active minutes would be less sedentary.

minutes_greater_filter <- merged_steps_activity %>%
  filter(VeryActiveMinutes >= 21) 
str(minutes_greater_filter)
## 'data.frame':    6 obs. of  9 variables:
##  $ Id                  : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ Date                : Date, format: "2016-04-12" "2016-04-13" ...
##  $ StepTotal           : int  13162 10735 10460 9762 12669 9705
##  $ Calories            : int  1985 1797 1776 1745 1863 1728
##  $ SedentaryMinutes    : int  728 776 1218 726 773 539
##  $ FairlyActiveMinutes : int  13 19 11 34 10 20
##  $ VeryActiveMinutes   : int  25 21 30 29 36 38
##  $ LightlyActiveMinutes: int  328 217 181 209 221 164
##  $ TotalMinutesActive  : int  366 257 222 272 267 222
# Filter the dataset to include only rows where TotalMinutesActive >= 30
filtered_merged_steps_activity <- minutes_greater_filter %>%
  filter(Id %in% unique(Id)[1:9]) # Select only the first 10 unique Ids
# Scatter plot of steps vs. active minutes, faceted by Id
ggplot(filtered_merged_steps_activity, aes(x = VeryActiveMinutes, y = SedentaryMinutes)) +
  geom_point() + # Scatter points
  geom_smooth(method = "lm", color = 'red', se = FALSE) + # Add a regression line
  labs(
    title = "Relationship between Very Active Minutes and Sedentary Minutes by User",
    x = "Very Active Minutes",
    y = "Sedentary Minutes"
  ) +
  facet_wrap(~Id)

Insights:

Steps_vs_Minutes_Asleep: Due to the fact that the ‘StepTotal’ has no impact on ‘Total Minutes Asleep’, it would be best for Bellabeat to prioritize developing apps towards factors affecting sleep. This would be more useful for users.

Active_vs_Sedentary_Minutes: Because of the overwelming average difference between Sedentary and Active minutes, Bellabeat should prioritize trying to urge users to split up some of their sedentary time with some minutes active, which would be beneficial towards users’ health, and it would help lessen the gap between the averages.

Average_Steps_by_Weekday:
The sudden dip in average steps on Sundays is to be expected due to that being most peoples’ days of rest. Bellabeat can try and increase this total by introducing incentive such as weekly goals or friendly competitions against others (via social media).

Very_Active_vs_Sedentary_minutes: The results suggest that being highly active at times does not necessarily reduce overall sedentary behavior. Because being active for at least 21 minutes is enough exercise daily, Bellabeat can try to incorporate mini workouts throughout the day to try and fulfill this goal, ensuring users maintain good health.
Source:https://www.cnbc.com/2022/08/30/this-is-how-much-you-should-exercise-to-achieve-the-maximum-benefit.html#:~:text=According%20to%20their%20research%2C%20funded,(21%2D42%20minutes%20daily)

ACT:

With the information recieved from analysis, we can conclude that Bellabeat should target improving current features, especially surrounding the Time feature because it would better benefit users’ overall health. It would also be in their best interest to add new features that compliment their existing ones such as an app that allows for designing periodic workouts, which would help break the wide degree of minutes being sedentary into smaller chunks. These suggestions would not only boost user interactivity, but satisfaction and higher ratings due to the heavy emphasis on health.