1. Ask

Bellabeat has an exciting business task ahead: unlocking new growth opportunities by diving into the world of smart device fitness data. I will be delving into how consumers are using their smart devices, uncovering trends, & exploring how these insights can shape the firm’s marketing strategy. Our goal is to provide 3 high-level recommendations that will help the marketing team make informed data-driven decisions.

Problem Statement: The problem we are trying to solve is to gain insights into how consumers use non-Bellabeat smart devices and then applying these insights to improve the marketing strategy of the Leaf.

By analysing smart device usage data, we can understand how consumers interact with similar products in the market. This can provide valuable insights for Bellabeat to improve their own product offerings, marketing campaigns, and customer engagement. These insights can then help drive the following business decisions:

Product Enhancements: By understanding the features and functionalities that consumers value the most in non-Bellabeat smart devices, we can identify areas for improvement and prioritise the development of new features or enhancements for the Leaf Bellabeat product.

Marketing Strategy: Insights into consumer behaviors and preferences can inform the marketing strategy for the Leaf product. This includes identifying the most effective marketing channels, optimising digital advertising campaigns, and tailoring messaging and communication to resonate with the target audience.

Customer Engagement: Understanding how consumers use non-Bellabeat smart devices can help improve the overall customer experience of the Leaf wellness tracker. This may involve enhancing user interfaces, introducing new features based on consumer preferences, and providing personalised recommendations or insights to drive user engagement and retention.

Key stakeholders: Key stakeholders for this project include Urška Sršen (co-founder), Sando Mur (co-founder), the Marketing Analytics team, Product Development team, Marketing team, and potentially external partners or vendors involved in marketing activities.

2. Prepare

Fitness enthusiasts willingly shared their personal tracker data through a distributed survey on Amazon Mechanical Turk. From March 12th to May 12th, 2016, a sample size of thirty Fitbit users consented to share their second-level, minute-level, hour-level, and day-level data, revealing heart rate, burned calories, activity intensity, MET, steps, and sleep activity. Each data point was beautifully timestamped and linked to the corresponding user by their unique ID.

Now, let’s dive into the treasure trove of insights, but first, a key disclaimer! Although the data packs a punch, it’s worth noting that it represents a snapshot of thirty users within a 2 month0 window from 2016. It’s crucial to recognise that times have changed, and trends might have evolved since then. Moreover, while this data might be a fitness aficionado’s dream, it doesn’t provide demographic information, making it impossible to guarantee it represents Bellabeat’s female-focused audience. Nonetheless, let’s seek to uncover the gems within.

Constraints & ROCCC: The data used in this analysis comes with some intriguing challenges that we need to address. One of the major concerns is the lack of info about users’ gender, which could potentially introduce biases and affect our conclusions. We’ll have to carefully consider how this might impact our insights.

Another aspect to keep in mind is that the data is not up-to-date, as it was collected back in 2016. Since user habits and behaviors might have evolved since then, we need to be cautious when drawing conclusions based on this older data.

For this analysis, we’ve chosen to work with two fascinating datasets: dailyActivity_merged and sleepDay_merged. These datasets hold valuable information about users’ activity patterns and sleep habits, and we’re excited to explore the trends and patterns hidden within them using the powerful R programming language in RStudio. Let’s dive into the data and uncover some valuable insights.

R Packages used: tidyverse lubridate dplyr ggplot2 tidyr rmarkdown

3. Installing packages

install.packages(c("tidyverse", "lubridate", "ggplot2", "tidyr", "dplyr"))

## Installing packages into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

4. Loading packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)

library(dplyr)

library(ggplot2)

library(tidyr)

5. Loading datasets in R

dailyActivity <- read.csv("dailyActivity_merged.csv")
sleep <- read.csv("sleepDay_merged.csv")

6. Reviewing Data, Columns and Structure

I will review the data imported to familiarise myself with headings and data structure

head(dailyActivity)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

colnames(dailyActivity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

str(dailyActivity)

## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

Findings - Date column in the dailyActivity df is set to chr

Cleaning Duplicates

any(duplicated(dailyActivity))

## [1] FALSE

any(duplicated(sleep))

## [1] TRUE

7. Identifying duplicates

which(duplicated(sleep))

## [1] 162 224 381

Findings: There are duplicated rows in the sleep dataset in rows 162, 224, & 381.

8. Modifying date columns

dailyActivity$ActivityDate <- as.Date(dailyActivity$ActivityDate, format = "%m/%d/%Y")
sleep$SleepDay <- as.Date(sleep$SleepDay, format = "%m/%d/%Y")

9. Removing duplicates

clean_Sleep <- sleep[!duplicated(sleep), ]
str(clean_Sleep)

## 'data.frame':    410 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

Findings: 3 rows were removed from data set as observation count went from 413 to 410.

10. Identifying null or n/a values

sum(is.na(dailyActivity))

## [1] 0

sum(is.null(dailyActivity))

## [1] 0

sum(is.na(clean_Sleep))

## [1] 0

sum(is.null(clean_Sleep))

## [1] 0

Findings: Zero null or n/a values in datasets

11. Analysing the Fitness Data

Summary of the data - summary of both data sets

summary(dailyActivity)

##        Id             ActivityDate          TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Median :2016-04-26   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019   Max.   :28.030  
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900

summary(clean_Sleep)

##        Id               SleepDay          TotalSleepRecords TotalMinutesAsleep
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :1.00      Min.   : 58.0     
##  1st Qu.:3.977e+09   1st Qu.:2016-04-19   1st Qu.:1.00      1st Qu.:361.0     
##  Median :4.703e+09   Median :2016-04-27   Median :1.00      Median :432.5     
##  Mean   :4.995e+09   Mean   :2016-04-26   Mean   :1.12      Mean   :419.2     
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:1.00      3rd Qu.:490.0     
##  Max.   :8.792e+09   Max.   :2016-05-12   Max.   :3.00      Max.   :796.0     
##  TotalTimeInBed 
##  Min.   : 61.0  
##  1st Qu.:403.8  
##  Median :463.0  
##  Mean   :458.5  
##  3rd Qu.:526.0  
##  Max.   :961.0

12. Bar Chart showing Fitness smartwatch user activity

Calculating user activity by percentage

Total_Activity_Minutes <- sum(dailyActivity$SedentaryMinutes) + 
  sum(dailyActivity$LightlyActiveMinutes) + 
  sum(dailyActivity$FairlyActiveMinutes) + 
  sum(dailyActivity$VeryActiveMinutes)

Activity_Percent <- data.frame(
  Sedentary = sum(dailyActivity$SedentaryMinutes)/sum (Total_Activity_Minutes) * 100,
  LightlyActive = sum(dailyActivity$LightlyActiveMinutes)/sum (Total_Activity_Minutes) * 100,
  FairlyActive = sum(dailyActivity$FairlyActiveMinutes)/sum (Total_Activity_Minutes) * 100,
  VeryActive = sum(dailyActivity$VeryActiveMinutes)/sum (Total_Activity_Minutes) * 100
)

13. Creating vectors from the results

Activity_Type <- c("Sedentary", "Lightly_Active", "Fairly_Active", "Very_Active")
Activity_Type_percentage <- c(81.32, 15.82, 1.11, 1.72)

14. Creating data frame

Activity_Type_Data <- data.frame(
  Activity_Type, Activity_Type_percentage
)

15. Creating Activity Bar chart

Activity_Bar_Chart <- ggplot(
  Activity_Type_Data, aes(x = Activity_Type, y = Activity_Type_percentage, fill = factor(Activity_Type,))) + geom_bar(stat = 'identity') + theme(axis.text.x = element_text(angle = 40)) + labs(
    title = "User Activity Percentage"
  )

Activity_Bar_Chart

16. Stacked Bar Chart by Week Day

Day_Week1 <- dailyActivity
Day_Week1$ActivityDate <- weekdays(Day_Week1$ActivityDate)

17. Activity by week day (from Lightly Active to Very Active)

Day_Week1 <- dailyActivity
Day_Week1$ActivityDate <- weekdays(Day_Week1$ActivityDate)

Activity_chart <- Day_Week1 %>% 
  group_by(ActivityDate) %>% 
  summarize(lightly_active = sum(LightlyActiveMinutes), fairly_active = sum(FairlyActiveMinutes), very_active = sum(VeryActiveMinutes)) %>%
  pivot_longer(-ActivityDate, names_to = "Activities") %>% 
  ggplot(aes(ActivityDate, value, fill = Activities)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 40)) +
  labs(title = "User Activity by Day of the Week",
       text = "Tuesday is the busiest day of the week",
       y = "Minutes",
       x = "Days of the Week")

# Add the annotation on the right-hand side
Activity_chart +
  annotate("text", x = Inf, y = Inf, hjust = 1, vjust = 1, label = "Tuesday is the busiest with Sundays & Mondays being the least active.")

18. Correlation between calories lost and average total steps

library(ggplot2)

average_activity <- dailyActivity

average_activity %>% 
  group_by (Id) %>%
  summarise(mean_calories = mean(Calories), mean_total_steps = mean(TotalSteps)) %>% 
  ggplot(average_activity, mapping = (aes(mean_calories, mean_total_steps))) + 
  geom_point(stat = "summary",
             fun = "mean") + geom_smooth() + 
  theme_minimal() + 
  labs(
    title = "Calories Vs Total Steps",
    x = "Average Calories",
    y= "Average Steps")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Finding: The correlation between the average steps taken and the average calories lost shows a positive relatiionship, where the more steps moved, the more calories burnt.

19. Correlation between Very Active minutes and Calories

# Correlation between calories vs active minutes
calories_v_activity <- dailyActivity

  calories_v_activity %>% 
  group_by (Id) %>%
  summarise(mean_calories = mean(Calories), mean_very_active = mean(VeryActiveMinutes)) %>% 
  ggplot(calories_v_activity, mapping = (aes(mean_calories, mean_very_active))) + 
  geom_point(stat = "summary",
             fun = "mean") + geom_smooth(color = "green") + 
  theme_minimal() + 
  labs(
    title = "Calories Vs Very Active Minutes",
    x = "Average Calories",
    y= "Average Active Minutes")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Findings: The data analysis reveals a very strong positive relationship between the average active minutes and the average calories lost. This finding suggests that increased activity levels lead to greater calorie expenditure. In other words, the more active individuals are, the more calories they tend to burn.

20. Correlation between Fairly Active minutes and Calories

# Correlation between calories vs fairly minutes
  
  calories_f_activity <- dailyActivity
  
  calories_f_activity %>% 
    group_by (Id) %>%
    summarise(mean_calories = mean(Calories), mean_fairly_active = mean(FairlyActiveMinutes)) %>% 
    ggplot(calories_f_activity, mapping = (aes(mean_calories, mean_fairly_active))) + 
    geom_point(stat = "summary",
               fun = "mean") + geom_smooth(color = "purple") + 
    theme_minimal() + 
    labs(
      title = "Calories Vs Fairly Active Minutes",
      x = "Average Calories",
      y= "Average Fairly Minutes")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Findings: In the case study data analysis, we observe a clear pattern where the amount of calories lost decreases as the intensity and duration of the activity drop. This trend is evident in the relationship between the average calories and the average fairly active minutes. As individuals engage in less intense and shorter activities, their calorie expenditure decreases accordingly.

21. Correlation between Lightly Active Minutes and Calories

# Correlation between calories vs lightly minutes
  
  calories_l_activity <- dailyActivity
  
  calories_l_activity %>% 
    group_by (Id) %>%
    summarise(mean_calories = mean(Calories), mean_lightly_active = mean(LightlyActiveMinutes)) %>% 
    ggplot(calories_l_activity, mapping = (aes(mean_calories, mean_lightly_active))) + 
    geom_point(stat = "summary",
               fun = "mean") + geom_smooth(color = "red") + 
    theme_minimal() + 
    labs(
      title = "Calories Vs Lightly Active Minutes",
      x = "Average Calories",
      y= "Average Lightly Minutes")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Findings: In our data analytics case study, we uncover an interesting finding: the relationship between lightly active minutes and the average calories lost is no longer positive. Unlike the previous trend we observed, where more active minutes led to more calories lost, the relationship appears to change for lightly active minutes. This insight prompts us to investigate further and explore the factors influencing calorie expenditure during lighter activities.

22. Correlation between Sleep and Calories

Correlation between calories and sleep

  AVG_time_wasted_in_bed <- inner_join(dailyActivity, clean_Sleep, by = "Id")

## Warning in inner_join(dailyActivity, clean_Sleep, by = "Id"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

  AVG_time_wasted_in_bed %>% 
    group_by(Id) %>% 
    summarise(mean_calories = mean(Calories), mean_min_asleep = mean(TotalMinutesAsleep)) %>% 
    ggplot(AVG_time_wasted_in_bed, mapping = (aes(mean_calories, mean_min_asleep))) +
    geom_point(
      stat = "summary",
      fun = "mean") + geom_smooth(color = "cyan") +
    labs(
      title = "Calories vs Time Asleep", 
      x = "Average Calories",
      y = "Avearge Minutes Asleep"
    ) + theme_minimal()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Findings: There is very little relationship the average minutes spent sleeping and the average amount of calories lost.

23. Clustered Bar Chart Showing Sleep Activities by Week Days

# Grouped bar chart showing time spent in bed activities by days of the week.
  
  time_wasted_bar_chart <- as.data.frame(clean_Sleep)
  time_wasted_bar_chart$SleepDay <- weekdays(time_wasted_bar_chart$SleepDay)
  
  dodge_chart <- time_wasted_bar_chart %>% 
    group_by(SleepDay) %>% 
    summarise(total_minutes = sum(TotalTimeInBed), minutes_asleep = sum(TotalMinutesAsleep), wasted_minutes_in_bed = sum(TotalTimeInBed) - sum(TotalMinutesAsleep)) %>% 
    pivot_longer(-SleepDay, names_to = "sleep_activities")
    
    ggplot(dodge_chart, aes(fill = sleep_activities, x = SleepDay, y = value)) +
      geom_bar(position = "dodge", stat = "identity") +
      theme(axis.text.x = element_text(angle = 45)) +
      labs(
        title = "Time Spent In Bed",
        x = "Days of the Week", 
        y = "Minutes"
      )

Findings: In our data analyst case study, we’ve discovered an intriguing contrast. The graph reveals that Tuesdays and Wednesdays are the days when users spend the most time in bed sleeping, while simultaneously being the most active days. Surprisingly, on Monday, the day with the least amount of sleep, users seem to be the least active. This insight leaves us wondering what factors contribute to this fascinating pattern of sleep and activity levels and could do with further study.

Recommendations

Having users be able to create their own personalised activity goals, tailored to their unique interests and passions. Whether it’s yoga, running, or dancing, the app can encourages users to take charge of their fitness journey and achieve greatness on their terms facilitating consistency and habit formation.

Offering personalised bedtime recommendations and gentle reminders to ensure they get the rest they deserve. No more late-night distractions, just rejuvenating sleep to fuel their most active days in particular!

Creating a reward programme that gives back to your most dedicated users! With this system in place, consistent engagement from users will see them collecting valuable points that can be traded in for great perks - whether it be discounts on products, services or gift cards.

For users that are less active or not that active at all, our recommendation would be to offer target promotions on least active days of Sundays and Mondays when activity tends to dip. These targeted promotions should hopefully help to motivate and inspire, transforming sluggish days into triumph!

Credits & Thank you’s

Data used for analysis: https://www.kaggle.com/datasets/arashnic/fitbit

Bellabeat_Case_Study_by_Josiah-T_Kirton

2023-07-19