Bellabeat is a technology company that specializes in creating health-focused tech products specifically designed for women. Among its products are the Bellabeat app, the Leaf wearable wellness tracker, the Time wellness watch, the Spring smart water bottle, and the Bellabeat membership. For the purpose of this study, the focus will be on the Leaf product, which tracks activity, sleep, and stress levels.
The dataset used for this study is sourced from Kaggle and contains data on the FitBit Fitness Tracker, a similar wellness tracker product to Bellabeat’s Leaf. It is important to note that this dataset is listed under public domain, which means that it is free to use for research purposes.
The goal of this study is to explore the data to identify trends in smart device usage and provide insights on how these trends may be relevant to Bellabeat customers. By analyzing these trends, it is hoped that this study can inform Bellabeat’s marketing strategy, leading to more effective targeting of potential customers and a better understanding of their needs and preferences.
Installing necessary packages and loading them.
#Loading packages
library(tidyverse)
library(janitor)
library(lubridate)
library(ggplot2)
Load the data
filepaths <- list.files(path = "~/R/case_studies/fitabase_data", pattern ="*.csv", full.names=TRUE)
filenames <- list.files(path = "~/R/case_studies/fitabase_data", pattern ="*.csv") %>%
str_sub(1, -12)
filecontents <- filepaths %>%
# Using the path supplied by each element in filecontents, read in csvs
map(read_csv) %>%
# Rename each element using names from filenames vector
set_names(filenames)
# Add each list item from filecontents into the Global environment
list2env(filecontents, envir = .GlobalEnv)
## <environment: R_GlobalEnv>
Running a for loop to identify all the unique ids to tell us which datasets are missing entries
for (i in names(filecontents)) {
df <- filecontents[[i]]
unique_ids <- n_distinct(df$Id)
print(paste(i, "has", unique_ids, "unique IDs"))
}
## [1] "dailyActivity has 33 unique IDs"
## [1] "dailyCalories has 33 unique IDs"
## [1] "dailyIntensities has 33 unique IDs"
## [1] "dailySteps has 33 unique IDs"
## [1] "heartrate_seconds has 14 unique IDs"
## [1] "hourlyCalories has 33 unique IDs"
## [1] "hourlyIntensities has 33 unique IDs"
## [1] "hourlySteps has 33 unique IDs"
## [1] "minuteCaloriesNarrow has 33 unique IDs"
## [1] "minuteCaloriesWide has 33 unique IDs"
## [1] "minuteIntensitiesNarrow has 33 unique IDs"
## [1] "minuteIntensitiesWide has 33 unique IDs"
## [1] "minuteMETsNarrow has 33 unique IDs"
## [1] "minuteSleep has 24 unique IDs"
## [1] "minuteStepsNarrow has 33 unique IDs"
## [1] "minuteStepsWide has 33 unique IDs"
## [1] "sleepDay has 24 unique IDs"
## [1] "weightLogInfo has 8 unique IDs"
This output shows that a few data frames are missing some unique ids because there should be 33 unique ids since the data is from 33 Fitbit users. Given this information we will be focusing on daily activity, hourly calories, hourly steps, and sleep data. The sleep data only had 24 unique ids but since Bellabeat has products that also track sleep its important we at least explore the data.
We will now take a peak of the data frames
head(dailyActivity)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
head(hourlyCalories)
## # A tibble: 6 × 3
## Id ActivityHour Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 81
## 2 1503960366 4/12/2016 1:00:00 AM 61
## 3 1503960366 4/12/2016 2:00:00 AM 59
## 4 1503960366 4/12/2016 3:00:00 AM 47
## 5 1503960366 4/12/2016 4:00:00 AM 48
## 6 1503960366 4/12/2016 5:00:00 AM 48
head(hourlySteps)
## # A tibble: 6 × 3
## Id ActivityHour StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
head(sleepDay)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalT…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## # … with abbreviated variable name ¹TotalTimeInBed
Using a combination of tools from tidyverse and janitor to clean, process, filter, and sort the data. The lubridate package we also used to manipulate dates for easier analysis.
Reformatting dates, adding columns for week day, dropping unnecessary columns, and shorten the 10 digit id
dailyNewActivity<- dailyActivity %>%
clean_names() %>%
mutate(activity_date = mdy(activity_date), day_week = weekdays(activity_date)) %>%
rename(date = activity_date) %>%
select(-c(5:14))
dailyNewActivity$id <- as.numeric(factor(dailyNewActivity$id, levels = unique(dailyNewActivity$id)))
head(dailyNewActivity)
## # A tibble: 6 × 6
## id date total_steps total_distance calories day_week
## <dbl> <date> <dbl> <dbl> <dbl> <chr>
## 1 1 2016-04-12 13162 8.5 1985 Tuesday
## 2 1 2016-04-13 10735 6.97 1797 Wednesday
## 3 1 2016-04-14 10460 6.74 1776 Thursday
## 4 1 2016-04-15 9762 6.28 1745 Friday
## 5 1 2016-04-16 12669 8.16 1863 Saturday
## 6 1 2016-04-17 9705 6.48 1728 Sunday
Merging both hourlySteps and hourly calories, reformatting dates, creating the weekday column, and shortening ids
hourlyNewActivity <- hourlySteps %>%
left_join(hourlyCalories, by = c("Id", "ActivityHour")) %>%
clean_names() %>%
mutate(activity_hour = mdy_hms(activity_hour), day_week = weekdays(activity_hour)) %>%
separate(col = activity_hour, into = c("date", "time"), sep = " ") %>%
mutate(date = ymd(date))
hourlyNewActivity$id <- as.numeric(factor(hourlyNewActivity$id, levels = unique(hourlyNewActivity$id)))
head(hourlyNewActivity)
## # A tibble: 6 × 6
## id date time step_total calories day_week
## <dbl> <date> <chr> <dbl> <dbl> <chr>
## 1 1 2016-04-12 00:00:00 373 81 Tuesday
## 2 1 2016-04-12 01:00:00 160 61 Tuesday
## 3 1 2016-04-12 02:00:00 151 59 Tuesday
## 4 1 2016-04-12 03:00:00 0 47 Tuesday
## 5 1 2016-04-12 04:00:00 0 48 Tuesday
## 6 1 2016-04-12 05:00:00 0 48 Tuesday
Reformatting dates, renaming columns, changing format of hours into minutes, dropping unnecessary columns, and shortening ids
newSleepDay <- sleepDay %>%
clean_names() %>%
mutate(sleep_day = mdy_hms(sleep_day), sleep_day = ymd(sleep_day)) %>%
rename("sleep_records" = "total_sleep_records",
"minutes_asleep" = "total_minutes_asleep",
"total_bed_minutes" = "total_time_in_bed") %>%
mutate(hours_asleep = floor(minutes_asleep / 60),
minutes_asleep = minutes_asleep %% 60,
total_bed_hours = floor(total_bed_minutes / 60),
bed_minutes = total_bed_minutes %% 60) %>%
mutate(sleep_time = paste0(hours_asleep, ":", sprintf("%02d", minutes_asleep)),
total_bed_time = paste0(total_bed_hours, ":", sprintf("%02d", bed_minutes)))%>%
select(-c("minutes_asleep", "hours_asleep", "total_bed_hours", "bed_minutes"))
newSleepDay$id <- as.numeric(factor(newSleepDay$id, levels = unique(newSleepDay$id)))
head(newSleepDay)
## # A tibble: 6 × 6
## id sleep_day sleep_records total_bed_minutes sleep_time total_bed_time
## <dbl> <date> <dbl> <dbl> <chr> <chr>
## 1 1 2016-04-12 1 346 5:27 5:46
## 2 1 2016-04-13 2 407 6:24 6:47
## 3 1 2016-04-15 1 442 6:52 7:22
## 4 1 2016-04-16 2 367 5:40 6:07
## 5 1 2016-04-17 1 712 11:40 11:52
## 6 1 2016-04-19 1 320 5:04 5:20
Created a new data frame as backup where only necessary columns are shown, deleting duplicates and entries with 0, and aggregating the data into a new data frame where averages are calculated for the steps.
dailyFinal <- dailyNewActivity %>%
select(id, date, day_week, total_steps, total_distance, calories)
dailyFinal <- subset(dailyNewActivity, total_steps != 0 & total_distance != 0 & calories != 0 & !duplicated(dailyFinal))
dailyFinal$day_week <- as.factor(dailyFinal$day_week)
head(dailyFinal)
## # A tibble: 6 × 6
## id date total_steps total_distance calories day_week
## <dbl> <date> <dbl> <dbl> <dbl> <fct>
## 1 1 2016-04-12 13162 8.5 1985 Tuesday
## 2 1 2016-04-13 10735 6.97 1797 Wednesday
## 3 1 2016-04-14 10460 6.74 1776 Thursday
## 4 1 2016-04-15 9762 6.28 1745 Friday
## 5 1 2016-04-16 12669 8.16 1863 Saturday
## 6 1 2016-04-17 9705 6.48 1728 Sunday
ggplot(dailyFinal, aes(x = id)) +
geom_bar(stat = "count") +
xlab("ID") +
ylab("Number of Entries") +
ggtitle("Number of Entries per ID") +
scale_x_discrete(limits = factor(unique(dailyFinal$id)))
It is important to note that each data point in the plot represents a daily entry, which implies that 20 out of the 33 participants had 30 or more entries, while the remaining participants had around 15-25 entries. Ideally, it would be desirable to have complete data for all the days of the month from all participants. Nonetheless, the available data will be further explored, and conclusions will be drawn by calculating relevant averages.
In the next code chunk, an aggregated version of the dailyFinal data frame is created to allow for the generation of a plot that displays the average number of steps taken by day of the week. This will provide insight into whether certain days of the week exhibit distinct activity patterns, enabling the identification of potential trends or patterns that may inform further analysis.
dailyFinal_agg <- dailyFinal %>%
group_by(id, day_week) %>%
summarise(average_steps = mean(total_steps))
head(dailyFinal_agg)
## # A tibble: 6 × 3
## # Groups: id [1]
## id day_week average_steps
## <dbl> <fct> <dbl>
## 1 1 Friday 11466.
## 2 1 Monday 13781.
## 3 1 Saturday 13426.
## 4 1 Sunday 10102.
## 5 1 Thursday 11876.
## 6 1 Tuesday 13947.
Creating aggregated data frame for the hourly data.
hourlyNewActivity_agg <- hourlyNewActivity%>%
mutate(hour = as.numeric(format(strptime(paste(date, time), "%Y-%m-%d %H:%M:%S"), "%H"))) %>%
group_by(id, hour) %>%
summarize(step_total = sum(step_total))%>%
group_by(id) %>%
mutate(min_hour = hour[which.min(step_total)],
max_hour = hour[which.max(step_total)])
head(hourlyNewActivity_agg)
## # A tibble: 6 × 5
## # Groups: id [1]
## id hour step_total min_hour max_hour
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 4280 5 18
## 2 1 1 1503 5 18
## 3 1 2 870 5 18
## 4 1 3 355 5 18
## 5 1 4 108 5 18
## 6 1 5 63 5 18
The current structure of the data frame suggests that each unique identifier is associated with 24 rows, corresponding to each hour of the day. For instance, the first row indicates that id 1 recorded a total of 4280 steps at 0 (12:00 AM). Furthermore, the min hour and max hour columns provide information on the specific hours during which the minimum and maximum number of steps were taken over the course of the month.
To build upon these initial observations, an additional data frame was constructed that aggregates the hourly activity data and calculates the average number of steps taken for each hour of the day across the entire month. This provides a more comprehensive view of how activity levels vary across different times of the day, allowing for the identification of patterns and trends that may be useful for further analysis and interpretation.
#Average steps taken at each hour of the day
hourly_average <- hourlyNewActivity_agg %>%
group_by(hour) %>%
summarize(avg_steps = mean(step_total))
head(hourly_average)
## # A tibble: 6 × 2
## hour avg_steps
## <dbl> <dbl>
## 1 0 1194.
## 2 1 653.
## 3 2 484.
## 4 3 182.
## 5 4 359.
## 6 5 1239.
In a prior analysis, it was observed that the sleep data frame contained only 24 unique identifiers, indicating the presence of missing data. Nonetheless, given the importance of sleep tracking for BellaBeat’s health and wellness monitoring, it was deemed necessary to explore this metric further. In order to achieve this without introducing biases in the analysis, the minutes in bed across all sleep records were aggregated to obtain a total figure. Subsequently, this total was transformed into a more meaningful metric, namely the total time spent in bed, through appropriate calculations. Additionally, the average time spent in bed was calculated to uncover potential trends and patterns. These exploratory findings will then be communicated through visualizations to enable effective data interpretation.
#Totaling the amount of sleep records and time by id
sleep_record_counts <- newSleepDay %>%
group_by(id) %>%
summarize(total_sleep_records = sum(sleep_records),
total_bed_minutes = sum(total_bed_minutes))
sleep_record_counts$total_time <- sleep_record_counts$total_bed_minutes / 60
sleep_record_counts$average_time_in_bed <- as.numeric(sprintf("%.2f", sleep_record_counts$total_time / sleep_record_counts$total_sleep_records))
head(sleep_record_counts)
## # A tibble: 6 × 5
## id total_sleep_records total_bed_minutes total_time average_time_in_bed
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 27 9580 160. 5.91
## 2 2 4 1384 23.1 5.77
## 3 3 3 2883 48.0 16.0
## 4 4 8 2189 36.5 4.56
## 5 5 28 15054 251. 8.96
## 6 6 1 69 1.15 1.15
Despite the limitations present in the FitBit user data, Bellabeat can leverage key activity trends to optimize engagement with its users. By analyzing historical averages and predicting future spikes or declines in activity, the company can proactively push notifications to encourage healthy habits and provide useful tips. By establishing a digital relationship with users, Bellabeat can integrate itself into their daily routines and foster greater engagement with the app and device. This, in turn, can generate more self-reported data from users, creating a virtuous cycle of data generation and analysis.
Health data is a valuable resource that can yield a wealth of insights and information. If Bellabeat were to build its own dataset, integrating both automatically tracked and self-reported health data, the company could leverage sophisticated algorithms to provide personalized health advice and warnings. By aggregating this data and presenting it to users via visually compelling data visualizations, Bellabeat could create an experience similar to Spotify’s “Wrapped” feature. These visualizations could showcase user progress, goals achieved, and other noteworthy accomplishments, creating an engaging and interactive user experience.