As a junior data analyst on the marketing team at Bellabeat, I was tasked with analyzing smart device fitness data to uncover insights into consumer usage patterns. This analysis aims to guide Bellabeat’s marketing strategy and help the company leverage data to unlock new growth opportunities.
This case study, part of the Google Data Analytics Professional Certificate program, delves into the strategic steps essential for fostering rapid expansion within the bike-share domain. The study meticulously examines the phases of:
The primary goal of this analysis is to gain insights into how consumers are using Bellabeat’s smart devices to inform the company’s marketing strategy. Key questions include:
These questions aim to uncover patterns and preferences that can guide marketing efforts and help Bellabeat capitalize on growth opportunities in the global smart device market.
During the Prepare phase, data was collected from multiple sources, including Bellabeat’s smart device usage logs, demographic information, and market data for competitor analysis. The collected data was meticulously cleaned and standardized to ensure accuracy and completeness. This process involved verifying data integrity, removing duplicates, correcting errors, and normalizing formats. Additionally, the data was integrated from various sources to create a comprehensive dataset ready for analysis.
To facilitate this process, we created variables for each data type and converted dates into more readable formats. The data was then segmented into more useful parts to enhance the efficiency and effectiveness of the analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
main_path_one <- "mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/"
main_path_two <- "mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/"
daily_activity_one <- read.csv(paste(main_path_one, 'dailyActivity_merged.csv', sep = ""))
daily_activity_two <- read.csv(paste(main_path_two, 'dailyActivity_merged.csv', sep = ""))
daily_activity <- bind_rows(daily_activity_one, daily_activity_two)
daily_activity$ActivityDate <- as.Date(daily_activity$ActivityDate, format = "%m/%d/%Y")
daily_activity$active_month <-format(as.Date(daily_activity$ActivityDate), "%b")
daily_activity$active_year <-format(as.Date(daily_activity$ActivityDate), "%Y")
daily_activity$active_day_of_week <-format(as.Date(daily_activity$ActivityDate), "%A")
#Hourly Files
hourly_intesity_one <- read.csv(paste(main_path_one, 'hourlyIntensities_merged.csv', sep = ""))
hourly_intesity_two <- read.csv(paste(main_path_two, 'hourlyIntensities_merged.csv', sep = ""))
hourly_intesity <- bind_rows(hourly_intesity_one, hourly_intesity_two)
hourly_intesity$ActivityHour <- mdy_hms(hourly_intesity$ActivityHour)
hourly_intesity$hour <- hour(hourly_intesity$ActivityHour)
#Heartrate
heartrate_seconds_one <- read.csv(paste(main_path_one, 'heartrate_seconds_merged.csv', sep = ""))
heartrate_seconds_two <- read.csv(paste(main_path_two, 'heartrate_seconds_merged.csv', sep = ""))
heartrate_seconds <- bind_rows(heartrate_seconds_one, heartrate_seconds_two)
heartrate_seconds$date_time <- mdy_hms(heartrate_seconds$Time)
heartrate_seconds$hour <- hour(heartrate_seconds$date_time)
#Sleep Duration
sleep_day <- read.csv(paste(main_path_two, 'sleepDay_merged.csv', sep = ""))
sleep_day$date_time <- mdy_hms(sleep_day$SleepDay)
sleep_day$month <- month(sleep_day$date_time, label = TRUE)
sleep_day$day_of_week <- wday(sleep_day$date_time, label = TRUE)
daily_activity_month <- daily_activity %>%
group_by(active_month)%>%
summarize(total_steps = mean(TotalSteps),
total_distance= mean(TotalDistance),
very_active_distance= mean(VeryActiveDistance),
fairly_active_distance= mean(ModeratelyActiveDistance),
lightly_active_distance= mean(LightActiveDistance),
very_active_minutes= mean(VeryActiveMinutes),
fairly_active_minutes= mean(FairlyActiveMinutes),
lightly_active_minutes= mean(LightlyActiveMinutes),
sedentary_minutes= mean(SedentaryMinutes),
total_time_minute = sum(VeryActiveMinutes)+sum(FairlyActiveMinutes)+sum(LightlyActiveMinutes))
daily_activity_day_of_weel <- daily_activity %>%
group_by(active_day_of_week)%>%
summarize(total_steps = mean(TotalSteps),
total_distance= mean(TotalDistance),
very_active_distance= mean(VeryActiveDistance),
fairly_active_distance= mean(ModeratelyActiveDistance),
lightly_active_distance= mean(LightActiveDistance),
very_active_minutes= mean(VeryActiveMinutes),
fairly_active_minutes= mean(FairlyActiveMinutes),
lightly_active_minutes= mean(LightlyActiveMinutes),
sedentary_minutes= mean(SedentaryMinutes),
Calories=mean(Calories),
total_time_minute = sum(VeryActiveMinutes)+sum(FairlyActiveMinutes)+sum(LightlyActiveMinutes))
hourly_intesity_hours <- hourly_intesity %>%
group_by(hour)%>%
summarize(total_intensity= mean(TotalIntensity),
average_intensity= mean(AverageIntensity))
heartrate_seconds_hour <- heartrate_seconds %>%
group_by(hour)%>%
summarize(avg_value= mean(Value))
sleep_day_day <- sleep_day %>%
group_by(day_of_week) %>%
summarize(avg_sleep_minute = mean(TotalMinutesAsleep),
avg_in_bed_time = mean(TotalTimeInBed))
sleep_day_month <- sleep_day %>%
group_by(month) %>%
summarize(avg_sleep_minute = mean(TotalMinutesAsleep),
avg_in_bed_time = mean(TotalTimeInBed))
In the Analyze phase, we conducted a detailed examination of the prepared data to extract meaningful insights. The analysis included:
Descriptive Analysis: Calculating summary statistics and visualizing usage patterns to understand overall device usage. Exploratory Analysis: Identifying trends over time, analyzing feature usage frequency, and segmenting users by demographics to uncover usage patterns. Comparative Analysis: Comparing Bellabeat’s device usage with competitors to identify significant differences and opportunities. These analyses provided a comprehensive understanding of how consumers interact with Bellabeat’s smart devices, highlighting key usage trends and user preferences.
The analysis reveals that Bellabeat’s smart devices are most frequently used on Saturdays, while the highest distances are recorded on Sundays. On average, device usage increases throughout the weekdays. However, Sundays, typically a weekend day for most users, see the highest levels of distance covered and exercises performed. This indicates a pattern where users are more active in terms of distance and exercise during weekends, particularly on Sundays.
The data indicates a noticeable increase in sleep minutes by midweek, peaking on Wednesday. This trend suggests that users experience a cumulative sleep deficit starting from Sunday and continuing through the early part of the week. Consequently, by midweek, users tend to compensate for this lack of sleep, highlighting a pattern of increased sleep needs as the week progresses.
The graph indicates that the highest calorie expenditure occurs during very high activity periods of approximately 150 minutes, as well as around the 10-minute mark. Notably, there is a drop in calorie usage at the 25-minute mark. However, the data also shows that during fairly active minutes, users spend more time and consequently burn more calories. This suggests a direct correlation between the duration of activity and the number of calories burned, emphasizing that longer activity periods result in greater calorie expenditure.
The data shows that device usage peaks in April and is lowest in March. This trend may suggest that as summer approaches, users are more inclined to exercise, resulting in increased device usage. However, to establish a definitive pattern, a longer timeframe and additional data spanning more months are required for a comprehensive analysis.
The data indicates that heart rates peak around 3 PM and are lowest at 12 AM. This suggests that most users follow a typical daily routine, with peak physical activity occurring in the afternoon. Consequently, 3 PM appears to be the optimal time for exercise for most users, as indicated by the highest heart rate readings during this period.
ggplot(daily_activity_day_of_weel, mapping = aes(active_day_of_week, total_steps, fill= total_distance))+
geom_bar(position = "dodge", stat = "identity")
# First step Analysis
# the questions we have to answre includes
# the trends between the days and the calories, total steps, time spent
ggplot(daily_activity_day_of_weel, mapping = aes(active_day_of_week, total_steps, fill= total_time_minute))+
geom_bar(position = "dodge", stat = "identity")
ggplot(daily_activity_day_of_weel, mapping = aes(active_day_of_week, Calories, fill= total_time_minute))+
geom_bar(position = "dodge", stat = "identity")
ggplot(daily_activity_day_of_weel, mapping = aes(active_day_of_week, total_time_minute, fill= total_time_minute))+
geom_bar(position = "dodge", stat = "identity")
ggplot(sleep_day_day, mapping = aes(day_of_week, avg_in_bed_time, fill=avg_sleep_minute))+
geom_bar(position = "dodge", stat = "identity")
ggplot(sleep_day_month, mapping = aes(month, avg_in_bed_time, fill=avg_sleep_minute))+
geom_bar(position = "dodge", stat = "identity")
# trends between each activity and the colories usage
ggplot(daily_activity, mapping = aes(VeryActiveMinutes, Calories, group=1))+
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(daily_activity, mapping = aes(FairlyActiveMinutes, Calories, group=1))+
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(daily_activity, mapping = aes(LightlyActiveMinutes, Calories, group=1))+
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(daily_activity, mapping = aes(SedentaryMinutes, Calories, group=1))+
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
# trends of the usage on day
ggplot(daily_activity_day_of_weel, mapping = aes(active_day_of_week, total_time_minute, fill=very_active_minutes ))+
geom_bar(position = "dodge", stat = "identity")
# trends of the usage on month
ggplot(daily_activity_month, mapping = aes(active_month, total_time_minute, fill=very_active_minutes ))+
geom_bar(position = "dodge", stat = "identity")
# trends of the usage on Heart Rate Changes
ggplot(heartrate_seconds_hour, mapping = aes(hour, avg_value, group=1))+
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The analysis of Bellabeat’s smart device data reveals key insights into user behavior and device usage patterns. Device usage is highest on weekends, particularly for activities involving greater distances and exercises. Sleep data shows a midweek increase, indicating users compensate for sleep deficits incurred earlier in the week. Calorie expenditure correlates positively with the duration of physical activity, with higher calorie burn during extended activity periods. Monthly trends indicate a peak in usage in April, potentially due to increased exercise as summer approaches. To refine these insights and confirm patterns, additional data over a more extended period is recommended. These findings can inform Bellabeat’s marketing strategies, helping to target user engagement more effectively.