Bellabeat has an exciting business task ahead: unlocking new growth opportunities by diving into the world of smart device fitness data. I will be delving into how consumers are using their smart devices, uncovering trends, & exploring how these insights can shape the firm’s marketing strategy. Our goal is to provide 3 high-level recommendations that will help the marketing team make informed data-driven decisions.
Problem Statement: The problem we are trying to solve is to gain insights into how consumers use non-Bellabeat smart devices and then applying these insights to improve the marketing strategy of the Leaf.
By analysing smart device usage data, we can understand how consumers interact with similar products in the market. This can provide valuable insights for Bellabeat to improve their own product offerings, marketing campaigns, and customer engagement. These insights can then help drive the following business decisions:
Product Enhancements: By understanding the features and functionalities that consumers value the most in non-Bellabeat smart devices, we can identify areas for improvement and prioritise the development of new features or enhancements for the Leaf Bellabeat product.
Marketing Strategy: Insights into consumer behaviors and preferences can inform the marketing strategy for the Leaf product. This includes identifying the most effective marketing channels, optimising digital advertising campaigns, and tailoring messaging and communication to resonate with the target audience.
Customer Engagement: Understanding how consumers use non-Bellabeat smart devices can help improve the overall customer experience of the Leaf wellness tracker. This may involve enhancing user interfaces, introducing new features based on consumer preferences, and providing personalised recommendations or insights to drive user engagement and retention.
Key stakeholders: Key stakeholders for this project include Urška Sršen (co-founder), Sando Mur (co-founder), the Marketing Analytics team, Product Development team, Marketing team, and potentially external partners or vendors involved in marketing activities.
Fitness enthusiasts willingly shared their personal tracker data through a distributed survey on Amazon Mechanical Turk. From March 12th to May 12th, 2016, a sample size of thirty Fitbit users consented to share their second-level, minute-level, hour-level, and day-level data, revealing heart rate, burned calories, activity intensity, MET, steps, and sleep activity. Each data point was beautifully timestamped and linked to the corresponding user by their unique ID.
Now, let’s dive into the treasure trove of insights, but first, a key disclaimer! Although the data packs a punch, it’s worth noting that it represents a snapshot of thirty users within a 2 month0 window from 2016. It’s crucial to recognise that times have changed, and trends might have evolved since then. Moreover, while this data might be a fitness aficionado’s dream, it doesn’t provide demographic information, making it impossible to guarantee it represents Bellabeat’s female-focused audience. Nonetheless, let’s seek to uncover the gems within.
Constraints & ROCCC: The data used in this analysis comes with some intriguing challenges that we need to address. One of the major concerns is the lack of info about users’ gender, which could potentially introduce biases and affect our conclusions. We’ll have to carefully consider how this might impact our insights.
Another aspect to keep in mind is that the data is not up-to-date, as it was collected back in 2016. Since user habits and behaviors might have evolved since then, we need to be cautious when drawing conclusions based on this older data.
For this analysis, we’ve chosen to work with two fascinating datasets: dailyActivity_merged and sleepDay_merged. These datasets hold valuable information about users’ activity patterns and sleep habits, and we’re excited to explore the trends and patterns hidden within them using the powerful R programming language in RStudio. Let’s dive into the data and uncover some valuable insights.
R Packages used: tidyverse lubridate dplyr ggplot2 tidyr rmarkdown
install.packages(c("tidyverse", "lubridate", "ggplot2", "tidyr", "dplyr"))
## Installing packages into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
dailyActivity <- read.csv("dailyActivity_merged.csv")
sleep <- read.csv("sleepDay_merged.csv")
I will review the data imported to familiarise myself with headings and data structure
head(dailyActivity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
colnames(dailyActivity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
str(dailyActivity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
Findings - Date column in the dailyActivity df is set to chr
Cleaning Duplicates
any(duplicated(dailyActivity))
## [1] FALSE
any(duplicated(sleep))
## [1] TRUE
which(duplicated(sleep))
## [1] 162 224 381
Findings: There are duplicated rows in the sleep dataset in rows 162, 224, & 381.
dailyActivity$ActivityDate <- as.Date(dailyActivity$ActivityDate, format = "%m/%d/%Y")
sleep$SleepDay <- as.Date(sleep$SleepDay, format = "%m/%d/%Y")
clean_Sleep <- sleep[!duplicated(sleep), ]
str(clean_Sleep)
## 'data.frame': 410 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : Date, format: "2016-04-12" "2016-04-13" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
Findings: 3 rows were removed from data set as observation count went from 413 to 410.
sum(is.na(dailyActivity))
## [1] 0
sum(is.null(dailyActivity))
## [1] 0
sum(is.na(clean_Sleep))
## [1] 0
sum(is.null(clean_Sleep))
## [1] 0
Findings: Zero null or n/a values in datasets
Summary of the data - summary of both data sets
summary(dailyActivity)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Min. :2016-04-12 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Median :2016-04-26 Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean :2016-04-26 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :2016-05-12 Max. :36019 Max. :28.030
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
summary(clean_Sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## Min. :1.504e+09 Min. :2016-04-12 Min. :1.00 Min. : 58.0
## 1st Qu.:3.977e+09 1st Qu.:2016-04-19 1st Qu.:1.00 1st Qu.:361.0
## Median :4.703e+09 Median :2016-04-27 Median :1.00 Median :432.5
## Mean :4.995e+09 Mean :2016-04-26 Mean :1.12 Mean :419.2
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:1.00 3rd Qu.:490.0
## Max. :8.792e+09 Max. :2016-05-12 Max. :3.00 Max. :796.0
## TotalTimeInBed
## Min. : 61.0
## 1st Qu.:403.8
## Median :463.0
## Mean :458.5
## 3rd Qu.:526.0
## Max. :961.0
Calculating user activity by percentage
Total_Activity_Minutes <- sum(dailyActivity$SedentaryMinutes) +
sum(dailyActivity$LightlyActiveMinutes) +
sum(dailyActivity$FairlyActiveMinutes) +
sum(dailyActivity$VeryActiveMinutes)
Activity_Percent <- data.frame(
Sedentary = sum(dailyActivity$SedentaryMinutes)/sum (Total_Activity_Minutes) * 100,
LightlyActive = sum(dailyActivity$LightlyActiveMinutes)/sum (Total_Activity_Minutes) * 100,
FairlyActive = sum(dailyActivity$FairlyActiveMinutes)/sum (Total_Activity_Minutes) * 100,
VeryActive = sum(dailyActivity$VeryActiveMinutes)/sum (Total_Activity_Minutes) * 100
)
Activity_Type <- c("Sedentary", "Lightly_Active", "Fairly_Active", "Very_Active")
Activity_Type_percentage <- c(81.32, 15.82, 1.11, 1.72)
Activity_Type_Data <- data.frame(
Activity_Type, Activity_Type_percentage
)
Activity_Bar_Chart <- ggplot(
Activity_Type_Data, aes(x = Activity_Type, y = Activity_Type_percentage, fill = factor(Activity_Type,))) + geom_bar(stat = 'identity') + theme(axis.text.x = element_text(angle = 40)) + labs(
title = "User Activity Percentage"
)
Activity_Bar_Chart
Day_Week1 <- dailyActivity
Day_Week1$ActivityDate <- weekdays(Day_Week1$ActivityDate)
Day_Week1 <- dailyActivity
Day_Week1$ActivityDate <- weekdays(Day_Week1$ActivityDate)
Activity_chart <- Day_Week1 %>%
group_by(ActivityDate) %>%
summarize(lightly_active = sum(LightlyActiveMinutes), fairly_active = sum(FairlyActiveMinutes), very_active = sum(VeryActiveMinutes)) %>%
pivot_longer(-ActivityDate, names_to = "Activities") %>%
ggplot(aes(ActivityDate, value, fill = Activities)) +
geom_col() +
theme(axis.text.x = element_text(angle = 40)) +
labs(title = "User Activity by Day of the Week",
text = "Tuesday is the busiest day of the week",
y = "Minutes",
x = "Days of the Week")
# Add the annotation on the right-hand side
Activity_chart +
annotate("text", x = Inf, y = Inf, hjust = 1, vjust = 1, label = "Tuesday is the busiest with Sundays & Mondays being the least active.")
library(ggplot2)
average_activity <- dailyActivity
average_activity %>%
group_by (Id) %>%
summarise(mean_calories = mean(Calories), mean_total_steps = mean(TotalSteps)) %>%
ggplot(average_activity, mapping = (aes(mean_calories, mean_total_steps))) +
geom_point(stat = "summary",
fun = "mean") + geom_smooth() +
theme_minimal() +
labs(
title = "Calories Vs Total Steps",
x = "Average Calories",
y= "Average Steps")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Finding: The correlation between the average steps taken and the average
calories lost shows a positive relatiionship, where the more steps
moved, the more calories burnt.
# Correlation between calories vs active minutes
calories_v_activity <- dailyActivity
calories_v_activity %>%
group_by (Id) %>%
summarise(mean_calories = mean(Calories), mean_very_active = mean(VeryActiveMinutes)) %>%
ggplot(calories_v_activity, mapping = (aes(mean_calories, mean_very_active))) +
geom_point(stat = "summary",
fun = "mean") + geom_smooth(color = "green") +
theme_minimal() +
labs(
title = "Calories Vs Very Active Minutes",
x = "Average Calories",
y= "Average Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Findings: The data analysis reveals a very strong positive relationship between the average active minutes and the average calories lost. This finding suggests that increased activity levels lead to greater calorie expenditure. In other words, the more active individuals are, the more calories they tend to burn.
# Correlation between calories vs fairly minutes
calories_f_activity <- dailyActivity
calories_f_activity %>%
group_by (Id) %>%
summarise(mean_calories = mean(Calories), mean_fairly_active = mean(FairlyActiveMinutes)) %>%
ggplot(calories_f_activity, mapping = (aes(mean_calories, mean_fairly_active))) +
geom_point(stat = "summary",
fun = "mean") + geom_smooth(color = "purple") +
theme_minimal() +
labs(
title = "Calories Vs Fairly Active Minutes",
x = "Average Calories",
y= "Average Fairly Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Findings: In the case study data analysis, we observe a clear pattern where the amount of calories lost decreases as the intensity and duration of the activity drop. This trend is evident in the relationship between the average calories and the average fairly active minutes. As individuals engage in less intense and shorter activities, their calorie expenditure decreases accordingly.
# Correlation between calories vs lightly minutes
calories_l_activity <- dailyActivity
calories_l_activity %>%
group_by (Id) %>%
summarise(mean_calories = mean(Calories), mean_lightly_active = mean(LightlyActiveMinutes)) %>%
ggplot(calories_l_activity, mapping = (aes(mean_calories, mean_lightly_active))) +
geom_point(stat = "summary",
fun = "mean") + geom_smooth(color = "red") +
theme_minimal() +
labs(
title = "Calories Vs Lightly Active Minutes",
x = "Average Calories",
y= "Average Lightly Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Findings: In our data analytics case study, we uncover an interesting finding: the relationship between lightly active minutes and the average calories lost is no longer positive. Unlike the previous trend we observed, where more active minutes led to more calories lost, the relationship appears to change for lightly active minutes. This insight prompts us to investigate further and explore the factors influencing calorie expenditure during lighter activities.
Correlation between calories and sleep
AVG_time_wasted_in_bed <- inner_join(dailyActivity, clean_Sleep, by = "Id")
## Warning in inner_join(dailyActivity, clean_Sleep, by = "Id"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
AVG_time_wasted_in_bed %>%
group_by(Id) %>%
summarise(mean_calories = mean(Calories), mean_min_asleep = mean(TotalMinutesAsleep)) %>%
ggplot(AVG_time_wasted_in_bed, mapping = (aes(mean_calories, mean_min_asleep))) +
geom_point(
stat = "summary",
fun = "mean") + geom_smooth(color = "cyan") +
labs(
title = "Calories vs Time Asleep",
x = "Average Calories",
y = "Avearge Minutes Asleep"
) + theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Findings: There is very little relationship the average minutes spent sleeping and the average amount of calories lost.
# Grouped bar chart showing time spent in bed activities by days of the week.
time_wasted_bar_chart <- as.data.frame(clean_Sleep)
time_wasted_bar_chart$SleepDay <- weekdays(time_wasted_bar_chart$SleepDay)
dodge_chart <- time_wasted_bar_chart %>%
group_by(SleepDay) %>%
summarise(total_minutes = sum(TotalTimeInBed), minutes_asleep = sum(TotalMinutesAsleep), wasted_minutes_in_bed = sum(TotalTimeInBed) - sum(TotalMinutesAsleep)) %>%
pivot_longer(-SleepDay, names_to = "sleep_activities")
ggplot(dodge_chart, aes(fill = sleep_activities, x = SleepDay, y = value)) +
geom_bar(position = "dodge", stat = "identity") +
theme(axis.text.x = element_text(angle = 45)) +
labs(
title = "Time Spent In Bed",
x = "Days of the Week",
y = "Minutes"
)
Findings: In our data analyst case study, we’ve discovered an intriguing
contrast. The graph reveals that Tuesdays and Wednesdays are the days
when users spend the most time in bed sleeping, while simultaneously
being the most active days. Surprisingly, on Monday, the day with the
least amount of sleep, users seem to be the least active. This insight
leaves us wondering what factors contribute to this fascinating pattern
of sleep and activity levels and could do with further study.
Having users be able to create their own personalised activity goals, tailored to their unique interests and passions. Whether it’s yoga, running, or dancing, the app can encourages users to take charge of their fitness journey and achieve greatness on their terms facilitating consistency and habit formation.
Offering personalised bedtime recommendations and gentle reminders to ensure they get the rest they deserve. No more late-night distractions, just rejuvenating sleep to fuel their most active days in particular!
Creating a reward programme that gives back to your most dedicated users! With this system in place, consistent engagement from users will see them collecting valuable points that can be traded in for great perks - whether it be discounts on products, services or gift cards.
For users that are less active or not that active at all, our recommendation would be to offer target promotions on least active days of Sundays and Mondays when activity tends to dip. These targeted promotions should hopefully help to motivate and inspire, transforming sluggish days into triumph!
Data used for analysis: https://www.kaggle.com/datasets/arashnic/fitbit