Founded in 2014, Bellabeat is the company that developed one of the first wearable specifically designed for women and has since gone on to create a portfolio of digital products for tracking and improving the health of women.
Focusing on creating innovative health and wellness products for women, their mission is to empower women to take control of their health by providing them with technology-driven solutions that blend design and function.
Determine possible areas for expansion and suggestions for enhancing Bellabeat’s marketing approach based on usage patterns for smart devices.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The dataset used for this analysis can be found here.
daily_activity <- read_csv("C:/R Code/Bellabeat/dailyActivity_merged.csv")
hourly_steps <- read_csv("C:/R Code/Bellabeat/hourlySteps_merged.csv")
str(daily_activity)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(hourly_steps)
## spc_tbl_ [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ StepTotal : num [1:22099] 373 160 151 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityHour = col_character(),
## .. StepTotal = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
any(is.na(daily_activity))
## [1] FALSE
any(is.na(hourly_steps))
## [1] FALSE
any(duplicated(daily_activity))
## [1] FALSE
any(duplicated(hourly_steps))
## [1] FALSE
No NA values or duplicates found, great!
I noticed that the date and time columns were incorrectly
formatted as chr. This needs to be changed to the correct date and time
format.
# Change activity date/hour columns to date/time format
daily_activity$ActivityDate <- as.Date(daily_activity$ActivityDate, format="%m/%d/%Y")
hourly_steps$ActivityHour <- as.POSIXct(hourly_steps$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
head(daily_activity)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
head(hourly_steps)
## # A tibble: 6 × 3
## Id ActivityHour StepTotal
## <dbl> <dttm> <dbl>
## 1 1503960366 2016-04-12 00:00:00 373
## 2 1503960366 2016-04-12 01:00:00 160
## 3 1503960366 2016-04-12 02:00:00 151
## 4 1503960366 2016-04-12 03:00:00 0
## 5 1503960366 2016-04-12 04:00:00 0
## 6 1503960366 2016-04-12 05:00:00 0
According to this article on Medicine.net, guidelines as per the 10,000-step protocol is as follows:
Sedentary: Less than 5,000 steps daily
Low active: About 5,000 to 7,499 steps daily
Somewhat active: About 7,500 to 9,999 steps daily
Active: More than 10,000 steps daily
Highly active: More than 12,500 steps daily
I will be using this information to categorize users into different activity levels.
# Create a new data frame to categorize users based on their average daily steps
activity_levels <- daily_activity %>%
group_by(Id) %>%
summarise(
AvgSteps = mean(TotalSteps, na.rm = TRUE)
) %>%
mutate(
ActivityLevel = case_when(
AvgSteps < 5000 ~ "Sedentary",
AvgSteps >= 5000 & AvgSteps < 7500 ~ "Low Active",
AvgSteps >= 7500 & AvgSteps < 10000 ~ "Somewhat Active",
AvgSteps >= 10000 & AvgSteps < 12500 ~ "Active",
AvgSteps >= 12500 ~ "Highly Active"
)
)
# Calculate the percentage of users in each activity level
activity_distribution <- activity_levels %>%
group_by(ActivityLevel) %>%
summarise(Count = n()) %>%
mutate(Percentage = floor((Count / sum(Count)) * 100)) %>%
select(ActivityLevel, Percentage) %>%
arrange(desc(Percentage))
# Display the activity distribution
activity_distribution
## # A tibble: 5 × 2
## ActivityLevel Percentage
## <chr> <dbl>
## 1 Low Active 27
## 2 Somewhat Active 27
## 3 Sedentary 24
## 4 Active 15
## 5 Highly Active 6
As we can see, most of the users are fairly active, lightly active or sedentary. Much fewer are active or highly active.
# Create scatter plot of steps taken vs calorie burned
ggplot(data = daily_activity, aes(x = TotalSteps, y = Calories)) +
geom_point(size = 3, alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
labs(title = "Relationship Between Daily Total Steps and Daily Calories Burned",
x = "Total Steps Taken",
y = "Calories Burned") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14),
axis.title = element_text(size = 12),
legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'
While this might seem obvious, it shows a clear relationship that more steps = more calories burned!
# Filter for hours between 7 am and 9 pm only
hourly_steps_filtered <- hourly_steps %>%
mutate(Hour = lubridate::hour(as.POSIXct(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"))) %>%
filter(Hour >= 7 & Hour <= 21) %>%
group_by(Hour) %>%
summarise(
AvgSteps = mean(StepTotal, na.rm = TRUE),
MedianSteps = median(StepTotal, na.rm = TRUE)
) %>%
arrange(AvgSteps)
# Create a line chart of hourly steps
ggplot(hourly_steps_filtered, aes(x = Hour, y = MedianSteps)) +
geom_line(color = "blue", lwd = 1.15) +
geom_point(color = "blue", size = 3.25) +
labs(
title = "Median Steps by Hour (7 AM to 9 PM)",
x = "Hour of Day",
y = "Median Steps"
) +
scale_x_continuous(breaks = seq(7, 21, 1), labels = c("7 AM", "8 AM", "9 AM", "10 AM", "11 AM", "12 PM",
"1 PM", "2 PM", "3 PM", "4 PM", "5 PM",
"6 PM", "7 PM", "8 PM", "9 PM")) +
theme_minimal()
We can see that the median steps per hour starts to decline after 1 PM and 6 PM.
# Create new column to contain weekday name
daily_activity <- daily_activity %>%
mutate(DayOfWeek = weekdays(as.Date(ActivityDate, format = "%m/%d/%Y")))
# Group dataframe by dayofweek and calculate median steps for each day
median_steps <- daily_activity %>%
group_by(DayOfWeek) %>%
summarise(MedianSteps = median(TotalSteps, na.rm = TRUE)) %>%
arrange(match(DayOfWeek, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")))
# Custom order for ggplot to order from Sun-Sat
median_steps$DayOfWeek <- factor(median_steps$DayOfWeek,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# Create a bar chart of median steps per day
ggplot(median_steps, aes(x = DayOfWeek, y = MedianSteps)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Day of the Week", y = "Median Steps", title = "Median Steps by Day of the Week") +
theme_minimal()
We can see that Sunday is when users have the least amount of steps, while Tuesday has the most.
The analysis of smart device data reveals key trends in user activity, providing valuable insights into opportunities for Bellabeat to grow its customer base and enhance its marketing approach:
The majority of users are either in the “Low Active” (27%) or “Somewhat Active” (27%) categories, with only a small percentage reaching the “Highly Active” level (6%). Bellabeat could leverage this insight by positioning its products as tools to help users gradually increase their activity levels, potentially through customizable reminders or goal-setting features that encourage incremental improvements. Marketing messaging can focus on how Bellabeat devices support users in building sustainable activity habits, with achievable targets for those in the sedentary to moderately active ranges.
The positive correlation between steps taken and calories burned presents a clear opportunity to emphasize how Bellabeat products can aid in managing or improving physical health by increasing daily steps. This could be highlighted in campaigns or user stories that focus on fitness and wellness, showing how Bellabeat’s devices support healthy, active lifestyles.
Given that user activity generally increases throughout the morning and peaks around 1 PM, Bellabeat could enhance user engagement by sending encouraging notifications in line with these natural activity patterns. For example, sending reminders around mid-morning could motivate users just as they are beginning to be more active, while late-afternoon reminders could sustain activity levels before they begin to wind down in the evening. This timing strategy aligns with users’ daily rhythms, making notifications feel timely and relevant.
With Sunday showing the lowest median steps, and Tuesday the highest, Bellabeat has an opportunity to encourage users to stay active on weekends. Marketing initiatives might include weekend challenges or social media campaigns that motivate users to maintain consistent activity, even on less active days. This can be reinforced by Bellabeat app features like activity streaks or badges for maintaining daily step goals across the entire week, helping users see weekends as an extension of their wellness routine.
Bellabeat can position its products as accessible, health-enhancing tools that adapt to a wide range of activity levels and lifestyles. By focusing on personalized guidance and promoting the health benefits of small, consistent activity increases, Bellabeat can attract a broader audience. Targeted advertising, informative content on the benefits of daily movement, and time-sensitive notifications could increase engagement, helping Bellabeat expand its market presence among women seeking to improve their wellness habits in sustainable ways.