bella_study.knit

1. Ask

1.1 Study Objective

Analyze smart device usage data to gain insights into how consumers use non-Bellabeat smart devices.

1.2 Stakeholders

Urška Sršen: Cofounder and Bellabeat’s CEO
Sando Mur: Cofounder and Bellabeat’s Mathematician
Bellabeat’s marketing analytics team: A team of data analysts responsible for collecting, analyzing and reporting data that helps guiding Bellabeat’s marketing strategy.

2. Prepare

The analyzed dataset is called FitBit Fitness Tracker Data, it can be found in this link. It is a public dataset (CC0: Public Domain) and features the collected data of 30 eligible fitbit users

Data Organization

The dataset is composed of 18 csv files with different info collected by the fitbit tracker, it has both wide and narrow format for some of the csv files and daily, hourly and minute long reports for some of the tracker data.

Data Integrity

After a quick analysis, there are some limitations to the dataset. It is outdated, being dated at 2016, the dataset is inaccurate to 2023 fitness habits

It lacks demographic information, Bellabeat goal to provide a health tracker designed to their female users, it has no information about the gender or age of the users.

The data sample size is not adequate, a higher number would provide more information to avoid biases.

Used packages

library(arsenal)
library(readr)
library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ stringr   1.5.1
## ✔ forcats   1.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()      masks stats::filter()
## ✖ lubridate::is.Date() masks arsenal::is.Date()
## ✖ dplyr::lag()         masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(pandoc)

3. Process

3.1 Verifying the Data

For the case study we’re using all the daily logs from activity, calories, intensities, steps and sleep, and also weightLog.

After a quick check on all the datasets, we can see that dailyActivity already has dailyCalories, dailyIntensities, dailySteps data included, to make sure it isn’t missing any extra data we’re doing a comparison using the “comparedf” function from the arsenal package

daily_activity <- read_csv("dailyActivity_merged.csv")

daily_calories <- read_csv("dailyCalories_merged.csv")

comparedf(daily_activity, daily_calories)

## Compare Object
## 
## Function Call: 
## comparedf(x = daily_activity, y = daily_calories)
## 
## Shared: 2 non-by variables and 940 observations.
## Not shared: 14 variables and 0 observations.
## 
## Differences found in 0/2 variables compared.
## 0 variables compared have non-identical attributes.

As we can see there’s no difference in the columns both datasets had in common, the same happens to all the others, making some of the merging unecessary, with this in mind we’re adding Sleep and Weight

daily_sleep <- read_csv("sleepDay_merged.csv")
weight <- read_csv("weightLogInfo_merged.csv")

Then we check how many different ids we have

## [1] 33

## [1] 24

## [1] 8

3.2 Cleaning the data

With only eight unique IDs Weight doesn’t have enough data to be meaningful in the case study for this we are removing it

After checking the unique IDs, we can also check for duplicates

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(daily_sleep))

## [1] 3

Duplicates were found in daily_sleep, so we are removing them

daily_sleep <- unique(daily_sleep)

Before merging the datasets, we can see that the date fields are on the wrong type, to avoid future problems we’re going to change them to the date type

daily_activity <- daily_activity %>% 
  rename(date = ActivityDate) %>%
  mutate(date = as.Date(date , format= "%m/%d/%Y"))
daily_sleep <- daily_sleep %>%
  rename(date = SleepDay) %>%
  mutate(date = as.Date(date , format= "%m/%d/%Y"))

In the daily_activity dataset we can observe some obsolete columns, we are removing them before merging

daily_activity <- daily_activity %>% select(-c(TrackerDistance, LoggedActivitiesDistance))

And then, finally merge the datasets

tracker_data <- merge(daily_activity, daily_sleep, by = c("Id",  "date") )

4. Analyze

Now that our table is clean, properly merged and with all the data we need we can continue to the next step.

4.1 Profiling Data

One way to better organize the data for interpretation is the creation of user profiles, based on some of their data we can group some of those users to make working with them easier. In this study we have some important factors like, the activity of the user, number of calories consumed and their sleep pattern.

Number of steps:

One of the popular ways to measure activity is by counting the number of daily steps, this way of measuring is used specially on pedometers, for this we’re using the following classification:

tracker_data <- tracker_data %>%
  mutate(UserType = case_when (
    TotalSteps < 5000 ~ "Sedentary",
    TotalSteps >= 5000 & TotalSteps < 7499 ~ "Normal", 
    TotalSteps >= 7500 & TotalSteps < 9999 ~ "Somewhat active", 
    TotalSteps >= 10000 ~ "Active") )

Source for the classification can be found in this link

ggplot(data=tracker_data)+geom_bar(mapping=aes(x=UserType, fill=UserType))

Minutes in high/moderate physical activity:

Another common way to measure physical activity is based on the number of minutes doing a higher level of physical activity, in this case we’re using the minutes with higher level of activity and medium level of activity as it is recommended by the CDC

tracker_data <- tracker_data %>%
  mutate(ActivityMinutes = case_when (
    
    (VeryActiveMinutes + FairlyActiveMinutes) < 149 ~ "Less_than",
    (VeryActiveMinutes + FairlyActiveMinutes) >= 150 ~ "More_than"
  ) )

ggplot(data=tracker_data)+geom_bar(mapping=aes(x=ActivityMinutes, fill=ActivityMinutes))

As we can see the sample data for those who do at least 150 minutes of more intense physical activity is quite low (only happening 11 times), but due the importance of doing higher intensity activities we’re keeping this information and using them for our final considerations.

Sleep Quality:

And for last we can create a profile based on the sleep data, having good sleep quality is essential for a balanced and healthy lifestyle.

tracker_data <- tracker_data %>%
  mutate(SleepQuality = case_when (
    
     TotalMinutesAsleep >= 480 ~ "Fully Rested",
     TotalMinutesAsleep >= 420 ~ "Rested",
     TotalMinutesAsleep >= 360 ~ "Poorly Rested",
     TotalMinutesAsleep < 360  ~ "Not Rested"
  ) )

ggplot(data=tracker_data)+geom_bar(mapping=aes(x=SleepQuality, fill=SleepQuality))

The ideal ammount of sleep may vary from people to people for the classification we’re using the average values where 7-9 hours are the ideal ammount of sleep more information can be found in this link

4.2 Data Correlations

With those profiles in mind we can start exploring the relationships between data

Total Steps x Daily Calories

ggplot(data=tracker_data)+ geom_point(mapping=aes(x=TotalSteps, y=Calories, color=UserType))

For obvious reasons those who walk more tend to burn more calories, showing that even lighter activities have their benefits on the health and daily calorie loss of the user.

Sleep Quality and User Activity

ggplot(data=tracker_data)+ geom_bar(mapping=aes(x=SleepQuality, fill=UserType))

Apparently there isn’t a very clear relationship between sleep quality and activity, one would expect the more active users would get more tired and then sleep better, but there’s clearly more factors involved in this than just doing physical activities.

Sleep Quality and Time Spent on the bed

ggplot(data=tracker_data)+ geom_point(mapping=aes(x=TotalMinutesAsleep, y=TotalTimeInBed,  color=SleepQuality))

There’s not much oddities on this scatter plot, it is pretty much linear with only a few outliers. It is safe to assume that most users doesn’t take very long to sleep once they get to bed.

5. Share

Based on the data we found we can start to think of ways to improve the user experience:

Despite the higher number of active entries, it is still low considering that 10.000 steps is the most advised number to a healthier lifestyle, it is further reflected on the very low number of users that do 150 minutes of higher intensity activities. The app could help users to reach those objectives by creating goals and weekly achievements, those are known to motivate users, also an option to share the results would not only further motivate the users letting them show what they achieved but also help others to know bellabeat app/products.
The comparison of the sleep quality and time spent on bed data show that not many users suffer from insomnia, but yet many users doesn’t sleep enough per day, the app could help by giving the option to show notifications before certain times, this could be set by the user or done automatically based on the sleep data collected by the app.
Sadly the data study couldn’t give us more insight focused on the target audience of female users, considering that bellabeat already works on this area and have data on how menstrual cycles affect female users, preferred activities and diets that may help the target audience, the combination those with our study on activity and sleep may prove to be useful to the target audience.

Introduction

Table of Contents