Bellabeat, a high-tech wellness company founded in 2013 by Urška Sršen and Sando Mur, focuses on creating health-oriented smart products for women. Combining Sršen’s artistic vision and technology, Bellabeat designs products that empower women by tracking activity, sleep, stress, and reproductive health. By 2016, the company had expanded globally, launched multiple products, and diversified its sales through online retailers and its website.
My task is to analyze smart device fitness data for one of Bellabeat’s health-focused products. Bellabeat, a small but growing company, aims to become a major player in the global smart device market. Cofounder and Chief Creative Officer Urška Sršen believes that insights from this analysis could reveal new growth opportunities. My findings should help shape Bellabeat’s marketing strategy.
Note: Setting up my R environment by loading needed packages.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(ggplot2)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
library(dplyr)
library(Tmisc)
library(rmarkdown)
library(quarto)
The data I will be using is the Fitbit Fitness Tracker link (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
Notes: I have already prepped and merged most of the data through Google Sheets and have decided I will be only using these 5/29 file, also to add this data gives no information on gender, lifestyle, location so any recommendation maybe effected since BellaBeat’s target audience is women.
DailyActivity_merged <- read.csv("/cloud/project/BELLABEAT/BELLA BEAT/dailyActivity_merged - dailyActivity_merged.csv")
HourlyCalories_merged <- read.csv("/cloud/project/BELLABEAT/BELLA BEAT/dailyActivity_merged - hourlyCalories_merged (1).csv")
HourlyIntensities_merged <- read.csv("/cloud/project/BELLABEAT/BELLA BEAT/dailyActivity_merged - hourlyIntensities_merged.csv")
HourlySteps_merged <- read.csv("/cloud/project/BELLABEAT/BELLA BEAT/dailyActivity_merged - hourlySteps_merged.csv")
Sleepday_merged <- read.csv("/cloud/project/BELLABEAT/BELLA BEAT/Sleepday_merged - sleepDay_merged.csv")
Notes: Once data was uploaded I realized the “ActivityHour” format needed to be converted and split it into time and date columns. Also HourlyIntensities_merged had a format issue with its “Total Intensity”, and “Average Intensity” columns.
#calories
HourlyCalories_merged$ActivityHour=as.POSIXct(HourlyCalories_merged$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
HourlyCalories_merged$time <- format(HourlyCalories_merged$ActivityHour, format = "%H:%M:%S")
HourlyCalories_merged$date <- format(HourlyCalories_merged$ActivityHour, format = "%m/%d/%y")
HourlyCalories_merged%>%
select(-ActivityHour)
HourlyCalories <- HourlyCalories_merged%>%
select(-ActivityHour)
#dailyactivity
DailyActivity_merged$ActivityDate=as.POSIXct(DailyActivity_merged$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
DailyActivity_merged$date <- format(DailyActivity_merged$ActivityDate, format = "%m/%d/%y")
DailyActivity_merged%>%
select(-ActivityDate)
DailyActivity <- DailyActivity_merged%>%
select(-ActivityDate)
#intensity
HourlyIntensities_merged$TotalIntensity <- as.numeric(HourlyIntensities_merged$TotalIntensity)
## Warning: NAs introduced by coercion
HourlyIntensities_merged$AverageIntensity <- as.numeric(HourlyIntensities_merged$AverageIntensity)
## Warning: NAs introduced by coercion
HourlyIntensities_merged$ActivityHour=as.POSIXct(HourlyIntensities_merged$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
HourlyIntensities_merged$time <- format(HourlyIntensities_merged$ActivityHour, format = "%H:%M:%S")
HourlyIntensities_merged$date <- format(HourlyIntensities_merged$ActivityHour, format = "%m/%d/%y")
HourlyIntensities_merged%>%
select(-ActivityHour)
HourlyItensities <- HourlyIntensities_merged%>%
select(-ActivityHour)
#steps
HourlySteps_merged$ActivityHour=as.POSIXct(HourlySteps_merged$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
HourlySteps_merged$time <- format(HourlySteps_merged$ActivityHour, format = "%H:%M:%S")
HourlySteps_merged$date <- format(HourlySteps_merged$ActivityHour, format = "%m/%d/%y")
HourlySteps_merged%>%
select(-ActivityHour)
HourlySteps <- HourlySteps_merged%>%
select(-ActivityHour)
#sleep
Sleepday_merged$SleepDay=as.POSIXct(Sleepday_merged$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Sleepday_merged$time <- format(Sleepday_merged$SleepDay, format = "%H:%M:%S")
Sleepday_merged$date <- format(Sleepday_merged$SleepDay, format = "%m/%d/%y")
Sleepday_merged %>%
select(-SleepDay, -time)
Sleepday <- Sleepday_merged %>%
select(-SleepDay, -time)
n_distinct(DailyActivity$Id)
## [1] 36
n_distinct(HourlyCalories$Id)
## [1] 35
n_distinct(HourlyItensities$Id)
## [1] 36
n_distinct(HourlySteps$Id)
## [1] 36
n_distinct(Sleepday$Id)
## [1] 24
Note: This information tells us about number participants in each data sets.
There are 36 participants in the activity, steps, and intensities data sets, 24 in the sleep, and only 35 in the calories data set
#activity
DailyActivity %>%
select(TotalSteps,
TotalDistance,
Calories) %>%
summary()
## TotalSteps TotalDistance Calories
## Min. : 0 Min. : 0.000 Min. : 0
## 1st Qu.: 3146 1st Qu.: 2.170 1st Qu.:1799
## Median : 6999 Median : 4.950 Median :2114
## Mean : 7281 Mean : 5.219 Mean :2266
## 3rd Qu.:10544 3rd Qu.: 7.500 3rd Qu.:2770
## Max. :36019 Max. :28.030 Max. :4900
## NA's :1 NA's :1 NA's :1
DailyActivity %>%
select(VeryActiveMinutes,
FairlyActiveMinutes,
SedentaryMinutes, LightlyActiveMinutes) %>%
summary()
## VeryActiveMinutes FairlyActiveMinutes SedentaryMinutes LightlyActiveMinutes
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 729.0 1st Qu.:111.0
## Median : 2.00 Median : 6.0 Median :1057.0 Median :195.0
## Mean : 19.68 Mean : 13.4 Mean : 992.5 Mean :185.4
## 3rd Qu.: 30.00 3rd Qu.: 18.0 3rd Qu.:1244.0 3rd Qu.:262.0
## Max. :210.00 Max. :660.0 Max. :1440.0 Max. :720.0
## NA's :1 NA's :1 NA's :1 NA's :1
DailyActivity %>%
select(Id) %>%
count(Id)
#checking how how often participants used their device
TotalUses<- DailyActivity %>%
select(Id) %>%
count(Id)
TotalUses%>%
rename("Total" = "n")
TotalUses<-TotalUses%>%
rename("Total" = "n")
head(TotalUses %>%
select(Total) %>%
summary())
## Total
## Min. : 1.00
## 1st Qu.:38.00
## Median :42.50
## Mean :38.83
## 3rd Qu.:43.00
## Max. :63.00
TotalUses2 <- TotalUses%>%
mutate(User_Type = case_when(
Total < 20 ~ "Light_User",
Total >= 21 & Total <= 40 ~ "Average_User",
Total >= 41 ~ "Active_User"))
view(TotalUses2)
#intensities
HourlyItensities%>%
select(TotalIntensity,
AverageIntensity) %>%
summary()
## TotalIntensity AverageIntensity
## Min. : 0.0 Min. :0.00000
## 1st Qu.: 0.0 1st Qu.:0.00000
## Median : 2.0 Median :0.03333
## Mean : 11.4 Mean :0.19008
## 3rd Qu.: 15.0 3rd Qu.:0.25000
## Max. :180.0 Max. :3.00000
## NA's :1 NA's :1
#steps
HourlySteps %>%
select(StepTotal) %>%
summary()
## StepTotal
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 20.0
## Mean : 302.5
## 3rd Qu.: 322.0
## Max. :10565.0
## NA's :1
#calories
head(HourlyCalories %>%
select(Calories) %>%
summary())
## Calories
## Min. : 42.00
## 1st Qu.: 62.00
## Median : 80.00
## Mean : 95.76
## 3rd Qu.:106.00
## Max. :948.00
#sleep
head(Sleepday %>%
select(TotalTimeInBed,
TotalMinutesAsleep) %>%
summary())
## TotalTimeInBed TotalMinutesAsleep
## Min. : 61.0 Min. : 58.0
## 1st Qu.:403.0 1st Qu.:361.0
## Median :463.0 Median :433.0
## Mean :458.6 Mean :419.5
## 3rd Qu.:526.0 3rd Qu.:490.0
## Max. :961.0 Max. :796.0
Note: This data has showed us that
*Avg steps is 7281 which makes most participants fall under the lightly active lifestyle according to the pub med article link.
*Sedentary time should be around 9 hrs according to the link BMC medicine article and with the avg sedentary time being 992.5 (16hrs 32mins) so that needs to be lowered.
*Most users utilized their device so there was no issue with that.
*A web md article link states that women burn about 1600-1950 calories naturally and since we are unclear on gender in this data men naturally burn about 2000-2450 so avg would be about 2025 of naturally burned calories, with that information there’s not enough physical activity happening to burn more than the naturally burned calories.
*CDC link recommends 150 mins a week for moderate levels of intensity, the max avg for participants is 3 mins.
*CDC link recommends 7-9 hrs of sleep, and the avg participant sleeps 419.5 (6hr 59mins) should be fine but definetly sleep more.
# (Activity) Adding day of the week column to data
DailyActivity$date <- as.POSIXct(DailyActivity$date, format = "%m/%d/%Y")
DailyActivity$day_of_week <- wday(DailyActivity$date, label = TRUE)
Weekly_Sedentary <- DailyActivity %>%
group_by(Id,day_of_week) %>%
summarize(SedentaryMinutes = sum(SedentaryMinutes), .groups = "drop")
#Adding all active minutes together and making a new data set with sedentary minutes
TotalActiveMinutes <- DailyActivity %>%
mutate(TotalActiveMinutes = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes)
TotalActiveMinutes <- TotalActiveMinutes %>%
select (Id, date, day_of_week, TotalActiveMinutes, SedentaryMinutes)
# (Steps)Adding day of the week column to data
HourlySteps$date <- as.POSIXct(HourlySteps$date, format = "%m/%d/%Y")
HourlySteps$day_of_week <- wday(HourlySteps$date, label = TRUE)
Weekly_Steps <- HourlySteps %>%
group_by(Id,day_of_week,date) %>%
summarize(total_steps = sum(StepTotal), .groups = "drop")
# (Intensity)Adding day of the week column to data
HourlyItensities$date <- as.POSIXct(HourlyItensities$date, format = "%m/%d/%Y")
HourlyItensities$day_of_week <- wday(HourlyItensities$date, label = TRUE)
Weekly_Intensity <- HourlyItensities %>%
group_by(Id,day_of_week) %>%
summarize(
TotalIntensity = sum(TotalIntensity),
AverageIntensity = sum(AverageIntensity),
.groups = "drop"
)
Sleepday$date <- as.POSIXct(Sleepday$date, format = "%m/%d/%Y")
Sleepday$day_of_week <- wday(Sleepday$date, label = TRUE)
Weekly_Sleep <- Sleepday %>%
group_by(Id,day_of_week) %>%
summarize(TotalMinutesAsleep, TotalTimeInBed = sum(TotalMinutesAsleep, TotalTimeInBed), .groups = "drop")
ggplot(data = TotalUses2, aes(x = Total, fill = User_Type)) +
geom_bar() +
labs(
title = "Device Usage",
x = "Total Usage",
y = "Count"
) +
theme_minimal()
ggplot(Weekly_Steps, aes(x = factor(day_of_week, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")),
y = total_steps, fill = day_of_week)) +
geom_bar(stat = "identity") +
labs(
title = "Total Steps by Day of the Week",
x = "Day of the Week",
y = "Total Steps"
) +
theme_minimal()
ggplot(Weekly_Sedentary, aes(x = factor(day_of_week, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")),
y = SedentaryMinutes, fill = day_of_week)) +
geom_bar(stat = "identity") +
labs(
title = "Sedentary Time by Day of the Week",
x = "Day of the Week",
y = "Sedentary Minutes"
) +
theme_minimal()
ggplot(TotalActiveMinutes, aes(x = factor(day_of_week, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")),
y = TotalActiveMinutes, fill = day_of_week)) +
geom_bar(stat = "identity") +
labs(
title = "Total Active Minutes by Day of the Week",
x = "Day of the Week",
y = "Total Active Minutes"
) +
theme_minimal()
ggplot(HourlyCalories, aes(x = time, y = Calories)) +
geom_bar(stat = "identity", fill = "lightgreen") +
labs(
title = "Hourly Calories",
x = "Time",
y = "Calories"
) +
theme_minimal()
ggplot(TotalActiveMinutes, aes(x = TotalActiveMinutes, y = SedentaryMinutes)) +
geom_point(color = "lightgreen", size = 1) +
geom_smooth(method = "lm", color = "black", size = 1) +
labs(
title = "Total Active Minutes vs Sedentary Minutes",
x = "Total Active Minutes",
y = "Sedentary Minutes"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Sleepday, aes(x = TotalTimeInBed, y = TotalMinutesAsleep)) +
geom_point(stat = "identity", color = "lightgreen", size = 1) + geom_smooth(color = "black", size = 1)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
labs(
title = "Time In Bed Vs Total Minutes Asleep",
x = "Total Time In Bed",
y = "Total Minutes Asleep"
) +
theme_minimal()
## NULL
ggsave("BellaBeat_CaseStudy.png")
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Note: *After visualizing I notice that the graph supports previous observation about device usage which proves to me that there isn’t an issue with customers consistently using their devices.
*Tuesdays and Saturdays seem to be the most active as far as steps, and active minutes.
*Tuesday ironically also has the highest sedentary minutes. I want to further explore this by seeing the hourly difference.
*Calories seem to burn the highest at 12pm, 6pm, and 7pm which makes me believe that most participants are doing some type of full time work.
*The Total Activity Minutes vs Sedentary Minutes graph shows a downward slope which makes sense because if the participants were to spend more time being active their sedentary time would lessen.
*The Time In Bed Vs Total Minutes Asleep suggests a positive relationship between time in bed and time asleep, but this doesn’t always mean that every user is sleeping just because they’re in bed.
ggplot(HourlySteps, aes(x = time, y = StepTotal)) +
geom_bar(stat = "identity", fill = "lightgreen") +
labs(
title = "Hourly Steps",
x = "Time",
y = "Steps"
) +
theme_minimal()
ggplot(HourlyItensities, aes(x = time, y = TotalIntensity)) +
geom_bar(stat = "identity", fill = "lightgreen") +
labs(
title = "Hourly Intensities",
x = "Time",
y = "Intensity"
) +
theme_minimal()
Notes: These hourly graphs further prove these users lean towards
working class and may only allow them to get active on break and after
work hours.
Since Bellabeat’s target audience is women and we know that this specific data set doesn’t specify gender we are just going to apply the data we found to support Bellabeat’s target audience this may impact the accuracy of the findings. We have learned that these women are of working class and don’t have much time during the week for intense activties. Some of them seem to not have consistent healthy habits which might be the cause of sitting a lot at work, and/or not having the best sleeping schedule. Which if you add busy work life to not getting enough sleep energy levels will be too low to make healthy decisions.
They still seem to have some healthy habits but staying consistent seems difficult and probably don’t have any guidance on how to stay active, burn a higher amount of calories, and lowering sedentary time, and improve the amount of sleep. Since the users already utilize their devices the membership would be beneficial and they should definitely find use in a 24/7 personalized guide thats also a supportive partner while navigating busy work schedules and family obligations.
Bellabeats marketing team should lean into work/life balance. Many of these women likely sit at a desk for long hours or juggle multiple responsibilities, leaving them with limited time for intense activities. So the marketing team should highlight how the membership gives the user gentle reminders to stay on path throughout their day to day.
*Reminders should include ways to make sure the user doesn’t fall lower than a high level of somewhat active with daily steps which would be 7500 steps and eventually push them to 10,000 steps to lead them to an active lifestyle (sited from pub md med link).
*Encouraging an increase in intense activity by having more short burst of intense activities like taking the stairs instead of an elevator, speed walking/running/jogging etc. CDC has a guideline that could be referenced link
*The correlation from the “The Time In Bed Vs Total Minutes Asleep” graph shows a positive relationship but as previously stated doesn’t mean everyone is sleeping as soon as they get in bed so to improve sleeping habits, and the probability of exercising while also lowering their sedentary time Bellabeat should highlight sleep-tracking insights which identify patterns of insufficient sleep and actionable suggestions.
*“Hourly Intensity and Hourly Steps” graphs show that highest activity time is around 5-7pm. Bellabeat should make sure they highlight that the membership provides different easy/convenient activities based around that time of day since users are most likely just getting off work and trying out new activities after a monotonous day of work could get them more eager to stay on track of their goals.
*To further watch calories recommend different nutritional options throughout the day. Especially Sundays since it seems that is the users singular relax day and they would have more time to cook and try new recipes.
*Also remind users to support their self-care to further emphasize work/life balance. While we don’t have data for this in this data set link a Southern New Hampshire University article states self-care positively affects an individual in many aspects of life, and since the membership also provides beauty tips this will be ideal for balance and relaxing.