Google Analytics Capstone Project: Bellabeat Case Study!

2022-03-18

Brandon Stephens

About Bellabeat
Analysis Objectives
Business Task
Environment Setup
- Importing Datasets
Data Clean and Preparation
- Data Summarization
Data Visualization
- Searching for Correlations
Summary of Key Findings
Recommendations
References

About Bellabeat

Here at Bellabeat, women’s health is our passion. Bellabeat is a high-tech company that manufactures health-focused smart products worldwide. Urška Sršen and Sando Mur founded Bellabeat in 2013, with the intent to develop beautifully designed technology that informs and inspires women around the world.

Analysis Objectives

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Business Task

Utilize the Fitbit Fitness Tracker dataset to derive potential growth opportunities and make analysis based recommendations to our marketing operations team.

Environment Setup

install.packages(“tidyverse”) install.packages(“lubridate”) install.packages(“dplyr”) install.packages(“ggplot2”) install.packages(“tidyr”) install.packages(“viridisLite”) install.packages(“scales”) install.packages(“devtools”) devtools::install_github(“hadley/devtools”) remotes::install_github(“gadenbuie/cleanrmd”)

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(dplyr)
library(ggplot2)
library(tidyr)
library(viridisLite)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(devtools)

## Loading required package: usethis

Importing Datasets

For our analysis, only the following csv files will be nessesary: dailyActivity, hourlyCalories, hourlyIntensities, sleepDay, weightLogInfo

activity <- read.csv(file = "Bellabeat Case Study/dailyActivity_merged.csv",header = TRUE, sep = ",")
hourly_calories <- read.csv(file = "Bellabeat Case Study/hourlyCalories_merged.csv",header = TRUE, sep = ",")
hourly_intensities <- read.csv(file = "Bellabeat Case Study/hourlyIntensities_merged.csv",header = TRUE, sep = ",")
sleep <- read.csv(file = "Bellabeat Case Study/sleepDay_merged.csv",header = TRUE, sep = ",")
weightlog <- read.csv(file = "Bellabeat Case Study/weightLogInfo_merged.csv",header = TRUE, sep = ",")

Data Clean and Preparation

Now that we have our data loaded in, we will check our population per dataset.

n_distinct(activity$Id)

## [1] 33

n_distinct(hourly_calories$Id)

## [1] 33

n_distinct(hourly_intensities$Id)

## [1] 33

n_distinct(sleep$Id)

## [1] 24

n_distinct(weightlog$Id)

## [1] 8

Based on our total population of 33 users, the weightlog dataset will have an insufficient sample size to be used in this analysis.

Now we want to check for duplicates.

sum(duplicated(sleep))

## [1] 3

sum(duplicated(activity))

## [1] 0

sum(duplicated(hourly_intensities))

## [1] 0

sum(duplicated(hourly_calories))

## [1] 0

We see that the sleep data contains duplicates and needs to be cleaned.

sleep <- unique(sleep)
sum(duplicated(sleep))

## [1] 0

Now that we have cleaned our data, we will standardize the data’s column names.

activity <- rename_with(activity, tolower)
sleep <- rename_with(sleep, tolower)
hourly_calories <- rename_with(hourly_calories, tolower)
hourly_intensities <- rename_with(hourly_intensities, tolower)

Next we want to standardize our date and time format throughout our datasets.

activity <- activity %>% 
  rename(date= activitydate) %>% 
  mutate(date= as_date(date, format= "%m/%d/%Y"))
  
sleep <- sleep %>%
  rename(date= sleepday) %>%
  mutate(date= as_date(date, format= "%m/%d/%Y  %I:%M:%S %p", tz= Sys.timezone()))

## Warning: `tz` argument is ignored by `as_date()`

hourly_intensities <- hourly_intensities %>% 
  rename(date_time= activityhour) %>% 
  mutate(date_time= as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz= Sys.timezone()))

hourly_calories <- hourly_calories %>% 
  rename(date_time= activityhour) %>% 
  mutate(date_time= as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz= Sys.timezone()))

Now that our data is consistent, we will merge our data.

hourly_calories_intensities <- merge(x = hourly_calories, y = hourly_intensities, by = c("id","date_time"))
activity_sleep <- merge( x = activity, y = sleep, by = c("id", "date"))

Data Summarization

Next, let us get a better understanding of our data. We will do this by summarizing our datasets.

activity %>% 
  select(totalsteps, totaldistance,calories) %>% 
  summary()

##    totalsteps    totaldistance       calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :4900

activity %>% 
  select(veryactiveminutes, fairlyactiveminutes, lightlyactiveminutes, sedentaryminutes) %>% 
  summary()

##  veryactiveminutes fairlyactiveminutes lightlyactiveminutes sedentaryminutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0

sleep %>% 
  select(totalminutesasleep) %>% 
  summary()

##  totalminutesasleep
##  Min.   : 58.0     
##  1st Qu.:361.0     
##  Median :432.5     
##  Mean   :419.2     
##  3rd Qu.:490.0     
##  Max.   :796.0

hourly_calories_intensities %>% 
  select(totalintensity, averageintensity, calories) %>% 
  summary()

##  totalintensity   averageintensity    calories     
##  Min.   :  0.00   Min.   :0.0000   Min.   : 42.00  
##  1st Qu.:  0.00   1st Qu.:0.0000   1st Qu.: 63.00  
##  Median :  3.00   Median :0.0500   Median : 83.00  
##  Mean   : 12.04   Mean   :0.2006   Mean   : 97.39  
##  3rd Qu.: 16.00   3rd Qu.:0.2667   3rd Qu.:108.00  
##  Max.   :180.00   Max.   :3.0000   Max.   :948.00

Key Findings:

The population has an average daily step count of 7638. This is low compared to the CDC recommended step count of 10,000.
The average daily distance traveled was 5.49 miles.
The total user very active and fairly active minutes was 34.72 or 0.59 hours while user’s lightly active and sedentary minutes was 1184 minutes or 19.73 hours.
The average sleep minutes for users was 419.2 or 6.99 hours, which is just under the CDC recommended 7 hours for adults 18-60 years old.

It will be necessary for us to associate the day of the week with our data so we will add a weekday column to both of our data sets.

For our hourly data, we will also separate time from the date column.

hourly_calories_intensities <- hourly_calories_intensities %>% 
  separate(date_time, into= c('date', 'time'), sep= c(' ')) %>% 
  mutate(date= ymd (date))
  
hourly_calories_intensities$weekday <- weekdays(hourly_calories_intensities$date)

Group weekday by time.

hourly_calories_intensities_day_time <- (hourly_calories_intensities) %>% 
  group_by(weekday, time)%>% 
  summarize(mean_avg_intensity= mean(averageintensity, na.rm = TRUE))

## `summarise()` has grouped output by 'weekday'. You can override using the
## `.groups` argument.

Organize with Monday as starting weekday.

hourly_calories_intensities_day_time$weekday <- factor(hourly_calories_intensities_day_time$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

Data Visualization

Now that our data is cleaned and prepared, we will begin visualizing our data in order to derive correlations and important findings.

We will begin by examining the relationship between the day of the week, time and user intensity output.

ggplot(hourly_calories_intensities_day_time, aes(time, weekday))+
  theme(axis.text.x= element_text(angle = 90))+
  labs(title= "Daily Intensity Output", x = " ", y = " ", fill = "Average Intensity Output", caption = 'Data Source: Fitabase Data 4.1.2.16-5.12.16')+
  geom_tile(color = "black", aes(fill = mean_avg_intensity))+
  scale_fill_gradient(low= "grey", high= "deeppink4")+
  theme(plot.title = element_text(hjust = 0.5, size = 16))

Key findings

The data shows us that our population is more active earlier on week days than weekends. We also see that, on average, higher levels of intensity were out later in the day versus early in the day. We see that users are most active on Saturday, between 11:00am - 2:00pm, and Wednesday, between 5:00pm and 6:00pm.

Searching for Correlations

Next we will analyze user activity level and search for correlations.

We will be categorizing user’s activity level according to the NIH sponsored, peer-reviewed article, “Physical activity for campus employees: a university worksite wellness program”. The article uses average step count to categorize activity level as such: sedentary (< 5000 steps/day), low active (5000–7499), somewhat active (7500–9999), active (10,000–12,499), or highly active (≥ 12,500).

Note: Based on our sample size we will reference highly active and active users as one group.

activity_sleep$user_steps <- " "

activity_sleep_grouped <- activity_sleep %>% 
  group_by (id) %>% 
  summarize(average_totalsteps = mean(totalsteps),
            average_totalcalories = mean(calories),
            average_totaldistance = mean(totaldistance),
            average_minutesasleep = mean(totalminutesasleep, na.rm = TRUE)) %>% 
  mutate(user_steps = case_when(
            average_totalsteps >= 10000 ~ "Highly Active/Active",
            average_totalsteps >= 7500 & average_totalsteps < 10000 ~ "Somewhat Active",
            average_totalsteps >= 5000 & average_totalsteps < 7500 ~ "Low Active",
            average_totalsteps < 5000 ~ "Sedentary"))

activity_sleep <- subset(activity_sleep, select = -user_steps)

activity_sleep_grouped <- merge(activity_sleep, activity_sleep_grouped, by= c("id"))

activity_sleep_grouped$user_steps <- factor(activity_sleep_grouped$user_steps, levels = c("Sedentary", "Low Active", "Somewhat Active", "Highly Active/Active"))

Activity level vs daily sleep minutes

ggplot(activity_sleep_grouped, aes(user_steps, totalminutesasleep))+
  geom_boxplot(aes(fill= user_steps))+
  geom_point(alpha = 0.5, aes(size = calories, color = calories))+
  labs(title = "Activity Level vs Daily Sleep Minutes", x = "Activity Level", y = "Daily Sleep Minutes", fill= "Activity Level", color= "Daily Calories Burned", caption= "Data Source: 
Physical activity for campus employees: a university worksite wellness program")+
  coord_flip()+
  scale_fill_brewer(palette="PiYG")+
  scale_color_gradient(low= "grey2", high= "red")+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5, size = 16))+
  theme(plot.caption = element_text(hjust = 1.75))+
  guides(size = "none",fill ="none")

Key Insights:

We can see that there is no significant correlation between activity level and daily sleep minutes.

Activity level vs total daily steps.

ggplot(activity_sleep_grouped, aes(user_steps, totalsteps))+
  geom_boxplot(aes(fill= user_steps))+
  geom_point(alpha = 0.5, aes(size = calories, color = calories))+
  labs(title = "Activity Level vs Daily Steps", x = "Activity Level", y = "Daily Steps", fill= "Activity Level", size= "", color= "Daily Calories Burned", caption= "Data Source: Physical activity for campus employees: a university worksite wellness program")+
  coord_flip()+
  scale_fill_brewer(palette="PiYG")+
  scale_color_gradient(low= "grey2", high= "red")+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5, size = 16))+
  theme(plot.caption = element_text(hjust = 1.75))+
  guides(size = "none",fill ="none")

Key Insights

There is a high correlation between activity level and the max amount of daily steps users take.
We also see that users in the more active groups have a more spread out range of user steps.
Users in the more active groups burn more calories per step, with somewhat active users burning the highest caloric burn.

Activity level vs total calories burned.

ggplot(activity_sleep_grouped, aes(user_steps, average_totalcalories))+
  geom_boxplot(aes(fill= user_steps))+
  geom_point(alpha = 0.5, aes(size = average_totalcalories, color = average_totalcalories))+
  labs(title = "Activity Level vs Total Calories Lost", x = "Activity Level", y = "Total Calories Lost", fill= "Activity Level", size= "", color= "Total Calories", caption= "Data Source: Physical activity for campus employees: a university worksite wellness program")+
  coord_flip()+
  scale_fill_brewer(palette="PiYG")+
  scale_color_gradient(low= "grey2", high= "red")+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5, size = 16))+
  theme(plot.caption = element_text(hjust = 1.75))+
  guides(size = "none",fill ="none")

Key Insights:

There is a high correlation between activity level and the max amount of daily steps users take.
We also see that users in the more active groups have a more spread out range of user steps.
Users in the more active groups burn more calories per step, with somewhat active users burning the highest caloric burn.

Activity level vs total distance traveled.

ggplot(activity_sleep_grouped, aes(x= user_steps, y= average_totaldistance))+
  geom_point(alpha = 0.5, aes(size = average_totalcalories, color = average_totalcalories))+
  geom_segment(aes(x= user_steps,
                    xend= user_steps,
                    y= min(average_totaldistance),
                    yend= max(average_totaldistance)),
                linetype= "dashed",
                size= 0.1)+ 
  labs(title = "Activity Level vs Distance traveled", x= "Activity Level", y= "Miles", size= "", color= "Total Calories", caption= 'Data Source: Fitabase Data 4.1.2.16-5.12.16')+
  coord_flip()+
  scale_color_gradient(low= "grey2", high= "red")+
  theme_set(theme_classic())+
  theme(plot.title = element_text(hjust = 0.5, size = 16))+
  theme(plot.caption = element_text(hjust = 1.75))+
  guides(size = "none")

Key Insights:

For the most part, distance traveled progresses similarly as activity level increases.
Based on the total calories burned, we can determine that traveling a minimum of 5 miles puts users, on average, at a higher caloric burn rate.

Next we will want to find out how much users are wearing their fitbits. We can pull this information from our users hourly data.

hourly_usage <- hourly_calories_intensities %>%
    group_by(date) %>%
    summarize(user_usage_hr = n()/33)

User device usage.

ggplot(hourly_usage, aes(date, user_usage_hr, fill= user_usage_hr)) +
    geom_bar(stat= "identity", width= .7) +
    geom_rect(aes(xmin = as.Date('2016-04-29'), ymin = 0, 
                 xmax = as.Date('2016-05-12'), ymax = 22.69),
             fill = "red", alpha= .01)+
  labs(title = "Daily Usage", x= "", y= "Hours", caption= 'Data Source: Fitabase Data 4.1.2.16-5.12.16')+
  theme_set(theme_classic())+
  theme(plot.title = element_text(hjust = 0.5, size = 16))+
  theme(legend.position = "none")+
    scale_fill_gradient(low= "lightslategrey", high= "lightslategrey") +
    xlab("") +
  scale_x_date(date_breaks= ("1 day"), 
             labels= date_format ("%b-%d")) +
  scale_y_continuous(limits= c(0,24),
                     breaks= seq(0,max(hourly_usage$user_usage_hr),by=2))+
    theme(axis.text.x= element_text(angle = 60, hjust= 1))

Key Insights:

Usage is steadily between 24 and 23 hours for this first 17 days. Starting day 18, usage decreases, and by the end date, the average daily usage was down to 8 hours.

Summary of Key Findings

User’s average daily step count is 7638. This is low compared to the CDC recommended step count of 10,000.
The total user very active and fairly active minutes was 34.72 or 0.59 hours while user’s lightly active and sedentary minutes was 1184 minutes or 19.73 hours.
Users are more active earlier on week days verse weekends.
We see that users are most active on Saturday, between 11:00am - 2:00pm, and Wednesday, between 5:00pm and 6:00pm.
There is a high correlation between activity level and the max amount of daily steps users take.
More active user burn more calories per step versus lower active users.
Traveling a minimum of 5 miles puts users, on average, at a higher caloric burn rate.
After day 18, usage progressively decreases, and by the end date, the average daily usage was down to 8 hours.

Recommendations

In order to encourage reaching the CDC recommendation of 10,000 steps per day, our app should have preset milestones for users every 2000 steps with a notification alerting them of how many more steps they need to reach 10,000.
Our app should have a process of alerting users if they have been sedentary for an extended period of time in one day.
For users who are detected being sedentary an extended period of time for multiple days, our app should prompt them to our Bellabeat membership subscription.
To enforce being active, our app should have a feature that lets users allow for notifications pushed to their phone, as well as their wellness watch device. This can be done every hour to encourage them to be active.
Our app should provide users with the ability to setup a sleep schedule, that can then create notifications for when it is time to sleep.
We should give our users the ability to setup a plan or schedule to be active and also have preset active schedules users can select from, based on their needs.
We should be alerting users to their activity level
- Have milestone alerts for users when they increase in their level
- Push recommendations at low active users
We should push encouraging alerts to all users, summarizing their progress every 15 days to maintain steady levels of usage
We should alert users whenever we have an update planned so they have something to look forward to, encouraging continued usage of the device.
After one month, we should prompt users to fill out a survey or list one feature they would like added to the app or one of the Bellabeat devices.

References

1. Fitabase Data 4.1.2.16-5.12.16 [link](https://www.kaggle.com/datasets/arashnic/fitbit)

2. Physical activity for campus employees: a university worksite wellness program [link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4308577/)

3. CDC recommended sleep data [link](https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html)

4. CDC recommended step data [link](https://www.cdc.gov/diabetes/prevention/pdf/postcurriculum_session8.pdf)