| title: “Bellabeat Case Study (R)” |
| output: html_document |
Bellabeat was founded by Urška Sršen and Sando Mur, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
Urška Sršen: Cofounder and Bellabeat’s CEO
Sando Mur: Cofounder and Bellabeat’s Mathematician
Bellabeat’s marketing analytics team: A team of data analysts responsible for collecting, analyzing and reporting data that helps guiding Bellabeat’s marketing strategy.
The analyzed dataset is called FitBit Fitness Tracker Data, it can be found in this link. It is a public dataset (CC0: Public Domain) and features the collected data of 30 eligible fitbit users
The dataset is composed of 18 csv files with different info collected by the fitbit tracker, it has both wide and narrow format for some of the csv files and daily, hourly and minute long reports for some of the tracker data.
After a quick analysis, there are some limitations to the dataset. It is outdated, being dated at 2016, the dataset is inaccurate to 2023 fitness habits
It lacks demographic information, Bellabeat goal to provide a health tracker designed to their female users, it has no information about the gender or age of the users.
The data sample size is not adequate, a higher number would provide more information to avoid biases.
library(arsenal)
library(readr)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ stringr 1.5.1
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ lubridate::is.Date() masks arsenal::is.Date()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pandoc)
For the case study we’re using all the daily logs from activity, calories, intensities, steps and sleep, and also weightLog.
After a quick check on all the datasets, we can see that dailyActivity already has dailyCalories, dailyIntensities, dailySteps data included, to make sure it isn’t missing any extra data we’re doing a comparison using the “comparedf” function from the arsenal package
daily_activity <- read_csv("dailyActivity_merged.csv")
daily_calories <- read_csv("dailyCalories_merged.csv")
comparedf(daily_activity, daily_calories)
## Compare Object
##
## Function Call:
## comparedf(x = daily_activity, y = daily_calories)
##
## Shared: 2 non-by variables and 940 observations.
## Not shared: 14 variables and 0 observations.
##
## Differences found in 0/2 variables compared.
## 0 variables compared have non-identical attributes.
As we can see there’s no difference in the columns both datasets had in common, the same happens to all the others, making some of the merging unecessary, with this in mind we’re adding Sleep and Weight
daily_sleep <- read_csv("sleepDay_merged.csv")
weight <- read_csv("weightLogInfo_merged.csv")
Then we check how many different ids we have
## [1] 33
## [1] 24
## [1] 8
With only eight unique IDs Weight doesn’t have enough data to be meaningful in the case study for this we are removing it
After checking the unique IDs, we can also check for duplicates
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
Duplicates were found in daily_sleep, so we are removing them
daily_sleep <- unique(daily_sleep)
Before merging the datasets, we can see that the date fields are on the wrong type, to avoid future problems we’re going to change them to the date type
daily_activity <- daily_activity %>%
rename(date = ActivityDate) %>%
mutate(date = as.Date(date , format= "%m/%d/%Y"))
daily_sleep <- daily_sleep %>%
rename(date = SleepDay) %>%
mutate(date = as.Date(date , format= "%m/%d/%Y"))
In the daily_activity dataset we can observe some obsolete columns, we are removing them before merging
daily_activity <- daily_activity %>% select(-c(TrackerDistance, LoggedActivitiesDistance))
And then, finally merge the datasets
tracker_data <- merge(daily_activity, daily_sleep, by = c("Id", "date") )
Now that our table is clean, properly merged and with all the data we need we can continue to the next step.
One way to better organize the data for interpretation is the creation of user profiles, based on some of their data we can group some of those users to make working with them easier. In this study we have some important factors like, the activity of the user, number of calories consumed and their sleep pattern.
One of the popular ways to measure activity is by counting the number of daily steps, this way of measuring is used specially on pedometers, for this we’re using the following classification:
tracker_data <- tracker_data %>%
mutate(UserType = case_when (
TotalSteps < 5000 ~ "Sedentary",
TotalSteps >= 5000 & TotalSteps < 7499 ~ "Normal",
TotalSteps >= 7500 & TotalSteps < 9999 ~ "Somewhat active",
TotalSteps >= 10000 ~ "Active") )
Source for the classification can be found in this link
ggplot(data=tracker_data)+geom_bar(mapping=aes(x=UserType, fill=UserType))
Another common way to measure physical activity is based on the number of minutes doing a higher level of physical activity, in this case we’re using the minutes with higher level of activity and medium level of activity as it is recommended by the CDC
tracker_data <- tracker_data %>%
mutate(ActivityMinutes = case_when (
(VeryActiveMinutes + FairlyActiveMinutes) < 149 ~ "Less_than",
(VeryActiveMinutes + FairlyActiveMinutes) >= 150 ~ "More_than"
) )
ggplot(data=tracker_data)+geom_bar(mapping=aes(x=ActivityMinutes, fill=ActivityMinutes))
As we can see the sample data for those who do at least 150 minutes of more intense physical activity is quite low (only happening 11 times), but due the importance of doing higher intensity activities we’re keeping this information and using them for our final considerations.
And for last we can create a profile based on the sleep data, having good sleep quality is essential for a balanced and healthy lifestyle.
tracker_data <- tracker_data %>%
mutate(SleepQuality = case_when (
TotalMinutesAsleep >= 480 ~ "Fully Rested",
TotalMinutesAsleep >= 420 ~ "Rested",
TotalMinutesAsleep >= 360 ~ "Poorly Rested",
TotalMinutesAsleep < 360 ~ "Not Rested"
) )
ggplot(data=tracker_data)+geom_bar(mapping=aes(x=SleepQuality, fill=SleepQuality))
The ideal ammount of sleep may vary from people to people for the classification we’re using the average values where 7-9 hours are the ideal ammount of sleep more information can be found in this link
With those profiles in mind we can start exploring the relationships between data
ggplot(data=tracker_data)+ geom_point(mapping=aes(x=TotalSteps, y=Calories, color=UserType))
For obvious reasons those who walk more tend to burn more calories, showing that even lighter activities have their benefits on the health and daily calorie loss of the user.
Sleep Quality and User Activity
ggplot(data=tracker_data)+ geom_bar(mapping=aes(x=SleepQuality, fill=UserType))
Apparently there isn’t a very clear relationship between sleep quality and activity, one would expect the more active users would get more tired and then sleep better, but there’s clearly more factors involved in this than just doing physical activities.
ggplot(data=tracker_data)+ geom_point(mapping=aes(x=TotalMinutesAsleep, y=TotalTimeInBed, color=SleepQuality))
There’s not much oddities on this scatter plot, it is pretty much linear with only a few outliers. It is safe to assume that most users doesn’t take very long to sleep once they get to bed.