Google Analytics Capstone Project: Bellabeat Case Study!
2022-03-18
- About Bellabeat
- Analysis Objectives
- Business Task
- Environment Setup
- Data Clean and Preparation
- Data Visualization
- Summary of Key Findings
- Recommendations
- References
About Bellabeat
Here at Bellabeat, women’s health is our passion. Bellabeat is a high-tech company that manufactures health-focused smart products worldwide. Urška Sršen and Sando Mur founded Bellabeat in 2013, with the intent to develop beautifully designed technology that informs and inspires women around the world.
Analysis Objectives
What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?
Business Task
Utilize the Fitbit Fitness Tracker dataset to derive potential growth opportunities and make analysis based recommendations to our marketing operations team.
Environment Setup
install.packages(“tidyverse”) install.packages(“lubridate”) install.packages(“dplyr”) install.packages(“ggplot2”) install.packages(“tidyr”) install.packages(“viridisLite”) install.packages(“scales”) install.packages(“devtools”) devtools::install_github(“hadley/devtools”) remotes::install_github(“gadenbuie/cleanrmd”)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## âś“ ggplot2 3.3.5 âś“ purrr 0.3.4
## âś“ tibble 3.1.6 âś“ dplyr 1.0.8
## âś“ tidyr 1.2.0 âś“ stringr 1.4.0
## âś“ readr 2.1.2 âś“ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(dplyr)
library(ggplot2)
library(tidyr)
library(viridisLite)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(devtools)
## Loading required package: usethis
Importing Datasets
For our analysis, only the following csv files will be nessesary: dailyActivity, hourlyCalories, hourlyIntensities, sleepDay, weightLogInfo
<- read.csv(file = "Bellabeat Case Study/dailyActivity_merged.csv",header = TRUE, sep = ",")
activity <- read.csv(file = "Bellabeat Case Study/hourlyCalories_merged.csv",header = TRUE, sep = ",")
hourly_calories <- read.csv(file = "Bellabeat Case Study/hourlyIntensities_merged.csv",header = TRUE, sep = ",")
hourly_intensities <- read.csv(file = "Bellabeat Case Study/sleepDay_merged.csv",header = TRUE, sep = ",")
sleep <- read.csv(file = "Bellabeat Case Study/weightLogInfo_merged.csv",header = TRUE, sep = ",") weightlog
Data Clean and Preparation
Now that we have our data loaded in, we will check our population per dataset.
n_distinct(activity$Id)
## [1] 33
n_distinct(hourly_calories$Id)
## [1] 33
n_distinct(hourly_intensities$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weightlog$Id)
## [1] 8
Based on our total population of 33 users, the weightlog dataset will have an insufficient sample size to be used in this analysis.
Now we want to check for duplicates.
sum(duplicated(sleep))
## [1] 3
sum(duplicated(activity))
## [1] 0
sum(duplicated(hourly_intensities))
## [1] 0
sum(duplicated(hourly_calories))
## [1] 0
We see that the sleep data contains duplicates and needs to be cleaned.
<- unique(sleep)
sleep sum(duplicated(sleep))
## [1] 0
Now that we have cleaned our data, we will standardize the data’s column names.
<- rename_with(activity, tolower)
activity <- rename_with(sleep, tolower)
sleep <- rename_with(hourly_calories, tolower)
hourly_calories <- rename_with(hourly_intensities, tolower) hourly_intensities
Next we want to standardize our date and time format throughout our datasets.
<- activity %>%
activity rename(date= activitydate) %>%
mutate(date= as_date(date, format= "%m/%d/%Y"))
<- sleep %>%
sleep rename(date= sleepday) %>%
mutate(date= as_date(date, format= "%m/%d/%Y %I:%M:%S %p", tz= Sys.timezone()))
## Warning: `tz` argument is ignored by `as_date()`
<- hourly_intensities %>%
hourly_intensities rename(date_time= activityhour) %>%
mutate(date_time= as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz= Sys.timezone()))
<- hourly_calories %>%
hourly_calories rename(date_time= activityhour) %>%
mutate(date_time= as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz= Sys.timezone()))
Now that our data is consistent, we will merge our data.
<- merge(x = hourly_calories, y = hourly_intensities, by = c("id","date_time"))
hourly_calories_intensities <- merge( x = activity, y = sleep, by = c("id", "date")) activity_sleep
Data Summarization
Next, let us get a better understanding of our data. We will do this by summarizing our datasets.
%>%
activity select(totalsteps, totaldistance,calories) %>%
summary()
## totalsteps totaldistance calories
## Min. : 0 Min. : 0.000 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :2134
## Mean : 7638 Mean : 5.490 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :4900
%>%
activity select(veryactiveminutes, fairlyactiveminutes, lightlyactiveminutes, sedentaryminutes) %>%
summary()
## veryactiveminutes fairlyactiveminutes lightlyactiveminutes sedentaryminutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
%>%
sleep select(totalminutesasleep) %>%
summary()
## totalminutesasleep
## Min. : 58.0
## 1st Qu.:361.0
## Median :432.5
## Mean :419.2
## 3rd Qu.:490.0
## Max. :796.0
%>%
hourly_calories_intensities select(totalintensity, averageintensity, calories) %>%
summary()
## totalintensity averageintensity calories
## Min. : 0.00 Min. :0.0000 Min. : 42.00
## 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 63.00
## Median : 3.00 Median :0.0500 Median : 83.00
## Mean : 12.04 Mean :0.2006 Mean : 97.39
## 3rd Qu.: 16.00 3rd Qu.:0.2667 3rd Qu.:108.00
## Max. :180.00 Max. :3.0000 Max. :948.00
Key Findings:
The population has an average daily step count of 7638. This is low compared to the CDC recommended step count of 10,000.
The average daily distance traveled was 5.49 miles.
The total user very active and fairly active minutes was 34.72 or 0.59 hours while user’s lightly active and sedentary minutes was 1184 minutes or 19.73 hours.
The average sleep minutes for users was 419.2 or 6.99 hours, which is just under the CDC recommended 7 hours for adults 18-60 years old.
It will be necessary for us to associate the day of the week with our data so we will add a weekday column to both of our data sets.
For our hourly data, we will also separate time from the date column.
<- hourly_calories_intensities %>%
hourly_calories_intensities separate(date_time, into= c('date', 'time'), sep= c(' ')) %>%
mutate(date= ymd (date))
$weekday <- weekdays(hourly_calories_intensities$date) hourly_calories_intensities
Group weekday by time.
<- (hourly_calories_intensities) %>%
hourly_calories_intensities_day_time group_by(weekday, time)%>%
summarize(mean_avg_intensity= mean(averageintensity, na.rm = TRUE))
## `summarise()` has grouped output by 'weekday'. You can override using the
## `.groups` argument.
Organize with Monday as starting weekday.
$weekday <- factor(hourly_calories_intensities_day_time$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) hourly_calories_intensities_day_time
Data Visualization
Now that our data is cleaned and prepared, we will begin visualizing our data in order to derive correlations and important findings.
We will begin by examining the relationship between the day of the week, time and user intensity output.
ggplot(hourly_calories_intensities_day_time, aes(time, weekday))+
theme(axis.text.x= element_text(angle = 90))+
labs(title= "Daily Intensity Output", x = " ", y = " ", fill = "Average Intensity Output", caption = 'Data Source: Fitabase Data 4.1.2.16-5.12.16')+
geom_tile(color = "black", aes(fill = mean_avg_intensity))+
scale_fill_gradient(low= "grey", high= "deeppink4")+
theme(plot.title = element_text(hjust = 0.5, size = 16))
Key findings
- The data shows us that our population is more active earlier on week days than weekends. We also see that, on average, higher levels of intensity were out later in the day versus early in the day. We see that users are most active on Saturday, between 11:00am - 2:00pm, and Wednesday, between 5:00pm and 6:00pm.
Searching for Correlations
Next we will analyze user activity level and search for correlations.
We will be categorizing user’s activity level according to the NIH sponsored, peer-reviewed article, “Physical activity for campus employees: a university worksite wellness program”. The article uses average step count to categorize activity level as such: sedentary (< 5000 steps/day), low active (5000–7499), somewhat active (7500–9999), active (10,000–12,499), or highly active (≥ 12,500).
Note: Based on our sample size we will reference highly active and active users as one group.
$user_steps <- " "
activity_sleep
<- activity_sleep %>%
activity_sleep_grouped group_by (id) %>%
summarize(average_totalsteps = mean(totalsteps),
average_totalcalories = mean(calories),
average_totaldistance = mean(totaldistance),
average_minutesasleep = mean(totalminutesasleep, na.rm = TRUE)) %>%
mutate(user_steps = case_when(
>= 10000 ~ "Highly Active/Active",
average_totalsteps >= 7500 & average_totalsteps < 10000 ~ "Somewhat Active",
average_totalsteps >= 5000 & average_totalsteps < 7500 ~ "Low Active",
average_totalsteps < 5000 ~ "Sedentary"))
average_totalsteps
<- subset(activity_sleep, select = -user_steps)
activity_sleep
<- merge(activity_sleep, activity_sleep_grouped, by= c("id"))
activity_sleep_grouped
$user_steps <- factor(activity_sleep_grouped$user_steps, levels = c("Sedentary", "Low Active", "Somewhat Active", "Highly Active/Active")) activity_sleep_grouped
Activity level vs daily sleep minutes
ggplot(activity_sleep_grouped, aes(user_steps, totalminutesasleep))+
geom_boxplot(aes(fill= user_steps))+
geom_point(alpha = 0.5, aes(size = calories, color = calories))+
labs(title = "Activity Level vs Daily Sleep Minutes", x = "Activity Level", y = "Daily Sleep Minutes", fill= "Activity Level", color= "Daily Calories Burned", caption= "Data Source:
Physical activity for campus employees: a university worksite wellness program")+
coord_flip()+
scale_fill_brewer(palette="PiYG")+
scale_color_gradient(low= "grey2", high= "red")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5, size = 16))+
theme(plot.caption = element_text(hjust = 1.75))+
guides(size = "none",fill ="none")
Key Insights:
- We can see that there is no significant correlation between activity level and daily sleep minutes.
Activity level vs total daily steps.
ggplot(activity_sleep_grouped, aes(user_steps, totalsteps))+
geom_boxplot(aes(fill= user_steps))+
geom_point(alpha = 0.5, aes(size = calories, color = calories))+
labs(title = "Activity Level vs Daily Steps", x = "Activity Level", y = "Daily Steps", fill= "Activity Level", size= "", color= "Daily Calories Burned", caption= "Data Source: Physical activity for campus employees: a university worksite wellness program")+
coord_flip()+
scale_fill_brewer(palette="PiYG")+
scale_color_gradient(low= "grey2", high= "red")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5, size = 16))+
theme(plot.caption = element_text(hjust = 1.75))+
guides(size = "none",fill ="none")
Key Insights
There is a high correlation between activity level and the max amount of daily steps users take.
We also see that users in the more active groups have a more spread out range of user steps.
Users in the more active groups burn more calories per step, with somewhat active users burning the highest caloric burn.
Activity level vs total calories burned.
ggplot(activity_sleep_grouped, aes(user_steps, average_totalcalories))+
geom_boxplot(aes(fill= user_steps))+
geom_point(alpha = 0.5, aes(size = average_totalcalories, color = average_totalcalories))+
labs(title = "Activity Level vs Total Calories Lost", x = "Activity Level", y = "Total Calories Lost", fill= "Activity Level", size= "", color= "Total Calories", caption= "Data Source: Physical activity for campus employees: a university worksite wellness program")+
coord_flip()+
scale_fill_brewer(palette="PiYG")+
scale_color_gradient(low= "grey2", high= "red")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5, size = 16))+
theme(plot.caption = element_text(hjust = 1.75))+
guides(size = "none",fill ="none")
Key Insights:
There is a high correlation between activity level and the max amount of daily steps users take.
We also see that users in the more active groups have a more spread out range of user steps.
Users in the more active groups burn more calories per step, with somewhat active users burning the highest caloric burn.
Activity level vs total distance traveled.
ggplot(activity_sleep_grouped, aes(x= user_steps, y= average_totaldistance))+
geom_point(alpha = 0.5, aes(size = average_totalcalories, color = average_totalcalories))+
geom_segment(aes(x= user_steps,
xend= user_steps,
y= min(average_totaldistance),
yend= max(average_totaldistance)),
linetype= "dashed",
size= 0.1)+
labs(title = "Activity Level vs Distance traveled", x= "Activity Level", y= "Miles", size= "", color= "Total Calories", caption= 'Data Source: Fitabase Data 4.1.2.16-5.12.16')+
coord_flip()+
scale_color_gradient(low= "grey2", high= "red")+
theme_set(theme_classic())+
theme(plot.title = element_text(hjust = 0.5, size = 16))+
theme(plot.caption = element_text(hjust = 1.75))+
guides(size = "none")
Key Insights:
For the most part, distance traveled progresses similarly as activity level increases.
Based on the total calories burned, we can determine that traveling a minimum of 5 miles puts users, on average, at a higher caloric burn rate.
Next we will want to find out how much users are wearing their fitbits. We can pull this information from our users hourly data.
<- hourly_calories_intensities %>%
hourly_usage group_by(date) %>%
summarize(user_usage_hr = n()/33)
User device usage.
ggplot(hourly_usage, aes(date, user_usage_hr, fill= user_usage_hr)) +
geom_bar(stat= "identity", width= .7) +
geom_rect(aes(xmin = as.Date('2016-04-29'), ymin = 0,
xmax = as.Date('2016-05-12'), ymax = 22.69),
fill = "red", alpha= .01)+
labs(title = "Daily Usage", x= "", y= "Hours", caption= 'Data Source: Fitabase Data 4.1.2.16-5.12.16')+
theme_set(theme_classic())+
theme(plot.title = element_text(hjust = 0.5, size = 16))+
theme(legend.position = "none")+
scale_fill_gradient(low= "lightslategrey", high= "lightslategrey") +
xlab("") +
scale_x_date(date_breaks= ("1 day"),
labels= date_format ("%b-%d")) +
scale_y_continuous(limits= c(0,24),
breaks= seq(0,max(hourly_usage$user_usage_hr),by=2))+
theme(axis.text.x= element_text(angle = 60, hjust= 1))
Key Insights:
- Usage is steadily between 24 and 23 hours for this first 17 days. Starting day 18, usage decreases, and by the end date, the average daily usage was down to 8 hours.
Summary of Key Findings
User’s average daily step count is 7638. This is low compared to the CDC recommended step count of 10,000.
The total user very active and fairly active minutes was 34.72 or 0.59 hours while user’s lightly active and sedentary minutes was 1184 minutes or 19.73 hours.
Users are more active earlier on week days verse weekends.
We see that users are most active on Saturday, between 11:00am - 2:00pm, and Wednesday, between 5:00pm and 6:00pm.
There is a high correlation between activity level and the max amount of daily steps users take.
More active user burn more calories per step versus lower active users.
Traveling a minimum of 5 miles puts users, on average, at a higher caloric burn rate.
After day 18, usage progressively decreases, and by the end date, the average daily usage was down to 8 hours.
Recommendations
In order to encourage reaching the CDC recommendation of 10,000 steps per day, our app should have preset milestones for users every 2000 steps with a notification alerting them of how many more steps they need to reach 10,000.
Our app should have a process of alerting users if they have been sedentary for an extended period of time in one day.
For users who are detected being sedentary an extended period of time for multiple days, our app should prompt them to our Bellabeat membership subscription.
To enforce being active, our app should have a feature that lets users allow for notifications pushed to their phone, as well as their wellness watch device. This can be done every hour to encourage them to be active.
Our app should provide users with the ability to setup a sleep schedule, that can then create notifications for when it is time to sleep.
We should give our users the ability to setup a plan or schedule to be active and also have preset active schedules users can select from, based on their needs.
We should be alerting users to their activity level
- Have milestone alerts for users when they increase in their level
- Push recommendations at low active users
We should push encouraging alerts to all users, summarizing their progress every 15 days to maintain steady levels of usage
We should alert users whenever we have an update planned so they have something to look forward to, encouraging continued usage of the device.
After one month, we should prompt users to fill out a survey or list one feature they would like added to the app or one of the Bellabeat devices.
References
1. Fitabase Data 4.1.2.16-5.12.16 [link](https://www.kaggle.com/datasets/arashnic/fitbit)
2. Physical activity for campus employees: a university worksite wellness program [link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4308577/)
3. CDC recommended sleep data [link](https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html)
4. CDC recommended step data [link](https://www.cdc.gov/diabetes/prevention/pdf/postcurriculum_session8.pdf)