This is a Case Study on figuring out “How can a wellness company play it smart?”.
Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused sma products. Sršen used her background as an aist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
How can Bellabeat find more opportunities to grow by analyzing smart device usage data to understand how to improve their health and wellness.
The key stakeholders include:
head(dataframe),
str(dataframe) and glimpse(dataframe).For this case Im working with the following dataframes:
Using sheets, I removed duplicates. sleepDay_merged.csv had 3 duplicates. Also renamed the columns names and checked for whitespaces to facilitate further coding with R.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse) # For data manipulation and visualization
library(janitor) # For data cleaning functions
library(dplyr) # For data manipulation functions
library(lubridate) #For parse dates
library(ggplot2) #For viz
library(tidyr) #For cleaning and preparing data
daily_activity <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailyActivity_merged.csv") %>% clean_names()
daily_calories <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailyCalories_merged.csv") %>% clean_names()
daily_intensitie <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailyIntensities_merged.csv") %>% clean_names()
daily_steps <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_dailySteps_merged.csv") %>% clean_names()
daily_sleep <- read_csv("~/Documents/Data Science/Data /bellabeat_casestudy_andressa/data/Copy_sleepDay_merged.csv") %>% clean_names()
n_distinct(daily_activity$id)
## [1] 33
n_distinct(daily_calories$id)
## [1] 33
n_distinct(daily_intensitie$id)
## [1] 33
n_distinct(daily_steps$id)
## [1] 33
n_distinct(daily_sleep$id)
## [1] 24
Not every participant responded daily sleep.The difference in participant counts between your sleep dataset (24 users) and other datasets (33 users) is significant and worth exploring. Which we’ll dive into possibilities why further into the Analyze phase. Let’s create an unique dataframe to work on, to easy the coding:
dataset_list <- list(
daily_activity,
daily_calories,
daily_intensitie,
daily_sleep,
daily_steps
)
merged_data <- dataset_list %>%
reduce(full_join, by = c("id", "activitydate"))
glimpse(merged_data)
## Rows: 940
## Columns: 26
## $ id <dbl> 1503960366, 1503960366, 1503960366, 15039…
## $ activitydate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4…
## $ total_steps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 1…
## $ total_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59,…
## $ tracker_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59,…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ very_active_distance.x <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25,…
## $ moderately_active_distance.x <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64,…
## $ light_active_distance.x <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71,…
## $ sedentary_active_distance.x <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ very_active_minutes.x <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 6…
## $ fairly_active_minutes.x <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27…
## $ lightly_active_minutes.x <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 2…
## $ sedentary_minutes.x <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775,…
## $ calories.x <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921,…
## $ calories.y <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921,…
## $ sedentary_minutes.y <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775,…
## $ lightly_active_minutes.y <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 2…
## $ fairly_active_minutes.y <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27…
## $ very_active_minutes.y <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 6…
## $ sedentary_active_distance.y <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ light_active_distance.y <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71,…
## $ moderately_active_distance.y <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64,…
## $ very_active_distance.y <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25,…
## $ total_time_in_bed <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377…
## $ step_total <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 1…
sum(is.na(merged_data$id))
## [1] 0
sum(is.na(merged_data$activitydate))
## [1] 0
merged_clean <- merged_data %>%
clean_names() %>%
rename_with(~str_remove_all(., "\\W+")) %>%
rename(
id = matches("^id$|^i_d$|participant"),
activity_date = matches("activitydate|date|^day$")
)
merged_clean <- merged_clean %>%
mutate(
activity_date = parse_date_time(
activity_date,
orders = c("ymd", "mdy", "dmy", "Y-m-d", "m/d/Y")
) %>% as.Date()
)
day_of_week, is_weekend,
week_of_year and activity_level.merged_analysis <- merged_clean %>%
mutate(
day_of_week = weekdays(activity_date),
is_weekend = day_of_week %in% c("Saturday", "Sunday"),
week_of_year = week(activity_date),
activity_level = case_when(
step_total > 10000 ~ "high",
step_total > 5000 ~ "medium",
TRUE ~ "low"
))
write_csv(merged_analysis, "fitbit_data_merged.csv")
daily_sleep.merged_analysis %>%
mutate(has_sleep_data = !is.na(total_time_in_bed)) %>%
group_by(has_sleep_data) %>%
summarise(avg_steps = mean(total_steps, na.rm = TRUE),
avg_active_mins = mean(very_active_minutes_x, na.rm = TRUE))
## # A tibble: 2 × 3
## has_sleep_data avg_steps avg_active_mins
## <lgl> <dbl> <dbl>
## 1 FALSE 6959. 18.2
## 2 TRUE 8515. 25.0
ggplot(merged_analysis, aes(x = total_steps, fill = activity_level)) +
geom_density(alpha = 0.6) +
labs(title = "Step Count Distribution by Activity Level",
x = "Total Daily Steps",
y = "Density",
fill = "Activity Level") +
scale_fill_manual(
values = c("low" = "#d62828",
"medium" = "#fcbf49",
"high" = "#003049"),
name = "Activity Level"
) +
theme_minimal()
ggplot(heatmap_data, aes(x = day_of_week, y = intensity, fill = minutes)) +
geom_tile(color = "#c7f9cc") +
scale_fill_gradient(low = "#fdcc6d", high = "#e75414") +
labs(title = "Average Activity Intensity by Day of Week",
x = "",
y = "Activity Intensity",
fill = "Minutes") +
theme_minimal() +
geom_text(aes(label = round(minutes)), color = "black", size = 3)
ggplot(merged_analysis, aes(x = total_steps, y = total_distance)) +
geom_point(aes(color = activity_level), alpha = 0.6) +
geom_smooth(method = "lm", color = "black") +
labs(title = "Relationship Between Steps and Distance",
x = "Total Steps",
y = "Total Distance (miles/km)",
color = "Activity Level") +
scale_color_manual(
values = c("low" = "#d62828",
"medium" = "#fcbf49",
"high" = "#003049"),
name = "Activity Level"
)
## `geom_smooth()` using formula = 'y ~ x'