This data analytics report is the second case study of Google professional analytics course. The purpose of the analysis is to show a new growth opportunity for Bellabeat, a high-tech manufacturer of health-focused products for women. Based on the analysis of the existing data of smart health device, new insights are found at the section of “Act” through passing six basic steps, “Ask”, “Prepare”,“Process”, “Analyze”, “Share”, and “Act”.
This section is to clarify the basic requirements and business questions in related with the analysis.
According to Ms.Urška Sršen, co-founder and Chief Creative Officer of
Bellabeat, the business tasks are following:
In order to start analysis, the data of smart device must be prepared. According to Sršen, using public data that explores smart device users’ daily habits is recommended.
For the preparation to analysis, the following actions are taken.
Download the data from the above designated link and store it.
Identify how it’s organized.
Sort and filter the data.
Determine the credibility of the data
To implement the data preparation, the following packages of tools are installed.
library(tidyverse) #helps wrangle data
library(lubridate) #helps wrangle date attributes
library(ggplot2) #helps visualize data
library(cowplot) #helps visualize data
#For data cleaning, following packages are installed.
library(here)
library(janitor)
library(skimr)
library(dplyr)
getwd() #displays your working directory
#Data location: "C:/Users/satos/Documents/project/case-2dataset"
Identify how the data sets are organized. After the download the data sets, We confirmed the data type is csv files which is collected by Fitbit.Those shall be uploaded as below.
activity <- read_csv("dailyActivity_merged.csv")
sleep <- read_csv("sleepDay_merged.csv")
heartrate <- read_csv("heartrate_seconds_merged.csv")
hourly_calories <- read_csv("hourlyCalories_merged.csv")
Preview the each data frame.
glimpse(activity) #Rows: 940, Columns: 15, ActivityDate <chr>
glimpse(sleep) #Rows: 940, Columns: 3, ActivityDay <chr>
glimpse(heartrate) #Rows: 2,483,658, Columns: 3, Time <chr>
glimpse(hourly_calories) #Rows: 22,099, Columns: 3 ActivityHour <chr>
According to previewing the data, activity data has wide format and data of heart rate and hourly calories have long format. In addition, to integrate multiple data in one, the following actions will be required in the next step, “PROCESS”.
Unify the chronological column as “date”.
Rename “StepTotal” to “TotalStep”.
Split the time data to year, month, day, hour, minute, and seconds.
Convert the data type of “date” to as.POSIXct from character.
According to the data with 30 users and only two months of collected data, the credibility of the data is not high. For example, the data was taken in April and May. In general, the outside temperature of the season is relatively comfort to do exercise. The tendencies might be different from other months. In addition, numbers of sleeping data is 14. In terms of the numbers of sampling data, it is limited for the analysis.
For the process the data for analysis, the following tasks will be implemented.
Check the data for errors by Rstudio.
Transform the data effectively.
Document the cleaning process.
Rename some column names such as renaming “date” from “ActivityDate” or “ActivityHour”. In addition, make a consequence name of columns with “clean_names” function.
#Rename to common names.
activity <- activity %>% rename(date = ActivityDate)
heartrate <- heartrate %>% rename(date_time= Time)
heartrate <- heartrate %>% rename(heart_rate = Value)
sleep <- sleep %>% rename(date = SleepDay )
hourly_calories <- hourly_calories %>% rename(id = Id)
hourly_calories <- hourly_calories %>% rename(date_time = ActivityHour)
hourly_calories <- hourly_calories %>% rename(hourly_calories = Calories)
#Use clean name function for readable.
activity <- clean_names(activity)
heartrate <- clean_names(heartrate)
sleep <- clean_names(sleep)
hourly_calories <- clean_names(hourly_calories)
Review if it changed.
colnames(activity)
colnames(heartrate)
colnames(sleep)
colnames(hourly_calories)
# [1] "id" "date" "total_steps"
# [4] "total_distance" "tracker_distance" "logged_activities_distance"
# [7] "very_active_distance" "moderately_active_distance" "light_active_distance"
#[10] "sedentary_active_distance" "very_active_minutes" "fairly_active_minutes"
#[13] "lightly_active_minutes" "sedentary_minutes" "calories"
#[1] "id" "date_time" "heart_rate"
#[1] "id" "date" "total_sleep_records" "total_minutes_asleep"
#[5] "total_time_in_bed"
#[1] "id" "date_time" "hourly_calories"
Before split the time data, confirm the forma on each date frame.
head(activity) #<date>, Year,Month, Day
tail(activity)
head(heartrate) #<S3: POSIXct> Year,Month, Day, hour,minute,second, AM/PM
tail(heartrate)
head(sleep) #<date>, Year,Month, Day
tail(sleep)
head(hourly_calories) #<S3: POSIXct> Year,Month, Day, hour,minute,second, AM/PM
tail(hourly_calories)
The parts should be transformed are as follows:
Transform the character vector to date vector.
Split and add columns of day, month, and year.
#activity
activity <- activity %>%
mutate(date = as_date(date, format = "%m/%d/%Y")) #The default format is yyyy-mm-dd
#Hourly calories
hourly_calories<- hourly_calories %>%
mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
#Split date and time
hourly_calories$time <- format(hourly_calories$date_time, format = "%H:%M:%S")
hourly_calories$date <- format(hourly_calories$date_time, format = "%m/%d/%y")
hourly_calories$date <- as.POSIXct(hourly_calories$date, format = "%m/%d/%y")
hourly_calories$date <- as_date(hourly_calories$date)
#Heart rate
heartrate <- heartrate %>%
mutate(date_time= as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
#Split date and time
heartrate$time <- format(heartrate$date_time, format = "%H:%M:%S")
heartrate$date <- format(heartrate$date_time, format = "%m/%d/%y")
heartrate$date <- as.POSIXct(heartrate$date, format = "%m/%d/%y")
heartrate$date <- as_date(heartrate$date)
#sleep
sleep <- sleep %>% mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p")) #The default format is yyyy-mm-dd
head(activity)
## # A tibble: 6 × 15
## id date total_steps total_distance tracker_distance
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## # very_active_distance <dbl>, moderately_active_distance <dbl>,
## # light_active_distance <dbl>, sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
head(heartrate)
## # A tibble: 6 × 5
## id date_time heart_rate time date
## <dbl> <dttm> <dbl> <chr> <date>
## 1 2022484408 2016-04-12 07:21:00 97 07:21:00 2016-04-12
## 2 2022484408 2016-04-12 07:21:05 102 07:21:05 2016-04-12
## 3 2022484408 2016-04-12 07:21:10 105 07:21:10 2016-04-12
## 4 2022484408 2016-04-12 07:21:20 103 07:21:20 2016-04-12
## 5 2022484408 2016-04-12 07:21:25 101 07:21:25 2016-04-12
## 6 2022484408 2016-04-12 07:22:05 95 07:22:05 2016-04-12
head(sleep)
## # A tibble: 6 × 5
## id date total_sleep_records total_minutes_asleep total_time_in_bed
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1.50e9 2016-04-12 1 327 346
## 2 1.50e9 2016-04-13 2 384 407
## 3 1.50e9 2016-04-15 1 412 442
## 4 1.50e9 2016-04-16 2 340 367
## 5 1.50e9 2016-04-17 1 700 712
## 6 1.50e9 2016-04-19 1 304 320
head(hourly_calories)
## # A tibble: 6 × 5
## id date_time hourly_calories time date
## <dbl> <dttm> <dbl> <chr> <date>
## 1 1503960366 2016-04-12 00:00:00 81 00:00:00 2016-04-12
## 2 1503960366 2016-04-12 01:00:00 61 01:00:00 2016-04-12
## 3 1503960366 2016-04-12 02:00:00 59 02:00:00 2016-04-12
## 4 1503960366 2016-04-12 03:00:00 47 03:00:00 2016-04-12
## 5 1503960366 2016-04-12 04:00:00 48 04:00:00 2016-04-12
## 6 1503960366 2016-04-12 05:00:00 48 05:00:00 2016-04-12
Check if there are any duplicated data.
sum(duplicated(activity)) #0
sum(duplicated(sleep)) #3
sum(duplicated(heartrate)) #0
sum(duplicated(hourly_calories)) #0
We found duplicated data in the sleep file.
We verify the numbers of users on each data set.
n_unique(activity$id) #33
n_unique(sleep$id) #24
n_unique(heartrate$id) #14
n_unique(hourly_calories$id) #33
Accordingly, the data set of “heartrate” contains small numbers of users. To keep user numbers as many as possible, we may not use the data of “heartrate” in this analysis.
#Activity
activity <- activity %>%
distinct() %>%
drop_na()
#Heartrate
heartrate <- heartrate %>%
distinct() %>%
drop_na()
#sleep
sleep <- sleep %>%
distinct() %>%
drop_na()
#hourly steps
hourly_calories <- hourly_calories %>%
distinct() %>%
drop_na()
Verify if all unnecessary data is removed.
## [1] 0
## [1] 0
## [1] 0
## [1] 0
In addition, check cleaned four data.
head(activity)
head(heartrate)
head(sleep)
head(hourly_calories)
Accordingly, we verified that all data is cleaned.
Firstly, we merge two data sets of “activity” and “sleep” to see any correlation between variables by using id and date as their primary keys.
#Merge two data sets of "activity" and "sleep", merged data is named as "activity_sleep"
activity_sleep <- merge(activity, sleep, by= c("id", "date"))
glimpse(activity_sleep)
#Confirm any duplicated data.
sum(duplicated(activity_sleep)) #[1] 0
Now, we confirmed the merged data set is cleaned.
At this part, through organizing and perform calculations, the trends of the FitBit users will be revealed to meet BellaBeat’s marketing strategy. The analysis will be proceeded as following steps:
Knowing the meaning of data variables.
Perform calculations.
Identify trends and relationships.
To do analysis work, knowing the meaning of data variables is important.
colnames(activity_sleep)
#############################################################################################
#[1] "id" "date" "total_steps"
#[4] "total_distance" "tracker_distance" "logged_activities_distance"
#[7] "very_active_distance" "moderately_active_distance" "light_active_distance"
#[10] "sedentary_active_distance" "very_active_minutes" "fairly_active_minutes"
#[13] "lightly_active_minutes" "sedentary_minutes" "calories"
#[16] "total_sleep_records" "total_minutes_asleep" "total_time_in_bed"
#############################################################################################
According to the Centers for Disease Control and Prevention (CDC) defines the following activity levels in terms of steps per day. Reference: Tomwaltersfitness
df_level <- data.frame(
activity_level = c("Very active", "Moderately active", "Fairly active", "Lightly active", "Sedentary active"),
steps_per_day = c(">12000", "7500-12000", "5000-7499", "2500-4999","<2500")
)
print(df_level)
## activity_level steps_per_day
## 1 Very active >12000
## 2 Moderately active 7500-12000
## 3 Fairly active 5000-7499
## 4 Lightly active 2500-4999
## 5 Sedentary active <2500
knitr::kable(df_level, caption = "Table with user type")
activity_level | steps_per_day |
---|---|
Very active | >12000 |
Moderately active | 7500-12000 |
Fairly active | 5000-7499 |
Lightly active | 2500-4999 |
Sedentary active | <2500 |
To get dived into deeply about users of FitBit, calculate average steps, average calories, and average heart rate by users.
We calculate average values of steps, calories, and heart rate.
daily_average <- activity_sleep %>%
group_by(id) %>%
summarise(mean_daily_steps = mean(total_steps), mean_daily_calories = mean(calories), mean_minutes_asleep = mean(total_minutes_asleep))
head(daily_average)
## # A tibble: 6 × 4
## id mean_daily_steps mean_daily_calories mean_minutes_asleep
## <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 12406. 1872. 360.
## 2 1644430081 7968. 2978. 294
## 3 1844505072 3477 1676. 652
## 4 1927972279 1490 2316. 417
## 5 2026352035 5619. 1541. 506.
## 6 2320127002 5079 1804 61
Based on the average data above, users are classified by the daily average steps.
user_type <- daily_average %>%
mutate(user_type = case_when(
mean_daily_steps < 2500 ~ "sedentary",
mean_daily_steps >= 2500 & mean_daily_steps < 5000 ~ "lightly active",
mean_daily_steps >= 5000 & mean_daily_steps < 7500 ~ "fairly active",
mean_daily_steps >= 7500 & mean_daily_steps < 12000 ~ "moderately active",
mean_daily_steps >= 12000 ~ "very active"
))
head(user_type)
## # A tibble: 6 × 5
## id mean_daily_steps mean_daily_calories mean_minutes_asleep user_type
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1503960366 12406. 1872. 360. very acti…
## 2 1644430081 7968. 2978. 294 moderatel…
## 3 1844505072 3477 1676. 652 lightly a…
## 4 1927972279 1490 2316. 417 sedentary
## 5 2026352035 5619. 1541. 506. fairly ac…
## 6 2320127002 5079 1804 61 fairly ac…
sum(is.na(user_type)) # 0
## [1] 0
Then, calculate how many user type exists in the data.
user_type_ratio <- user_type %>%
group_by(user_type) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(user_type) %>%
summarise(total_ratio = total / totals) %>%
mutate(labels = scales::percent(total_ratio))
user_type_ratio$user_type <- factor(user_type_ratio$user_type , levels = c("very active", "moderately active", "fairly active", "lightly active", "sedentary"))
head(user_type_ratio)
## # A tibble: 5 × 3
## user_type total_ratio labels
## <fct> <dbl> <chr>
## 1 fairly active 0.208 20.8%
## 2 lightly active 0.167 16.7%
## 3 moderately active 0.5 50.0%
## 4 sedentary 0.0417 4.2%
## 5 very active 0.0833 8.3%
Visualize the distribution of user type as follows.
user_type_ratio %>%
ggplot(aes(x="",y=total_ratio, fill=user_type)) +
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0)+ #Convert the plot to polar coorinates
theme_minimal()+
theme(axis.title.x= element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
scale_fill_manual(values = c("#66CDAA","#8EEBEC", "#ffd480","#ffa07a", "#e55451")) +
geom_text(aes(label = labels),
position = position_stack(vjust = 0.5, reverse = FALSE))+
labs(title="Distribution of users", fill = "User type")
df_asleep <- data.frame(
sleeper_type = c("Long sleeper", "Normal sleeper", "Short sleeper"),
sleeping_minutes = c(">540", "420-539", "<420")
)
knitr::kable(df_asleep, caption = "Table with sleeper type")
sleeper_type | sleeping_minutes |
---|---|
Long sleeper | >540 |
Normal sleeper | 420-539 |
Short sleeper | <420 |
sleeper_type <- daily_average %>%
mutate(sleeper_type = case_when(
mean_minutes_asleep < 420 ~ "short sleeper",
mean_minutes_asleep >= 420 & mean_minutes_asleep < 540 ~ "normal sleeper",
mean_minutes_asleep >= 540 ~ "long sleeper"
))
head(sleeper_type)
## # A tibble: 6 × 5
## id mean_daily_steps mean_daily_calories mean_minutes_asleep sleeper_type
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1.50e9 12406. 1872. 360. short sleep…
## 2 1.64e9 7968. 2978. 294 short sleep…
## 3 1.84e9 3477 1676. 652 long sleeper
## 4 1.93e9 1490 2316. 417 short sleep…
## 5 2.03e9 5619. 1541. 506. normal slee…
## 6 2.32e9 5079 1804 61 short sleep…
Then, calculate how many user type exists in the data.
sleeper_type_ratio <- sleeper_type %>%
group_by(sleeper_type) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(sleeper_type) %>%
summarise(total_ratio = total / totals) %>%
mutate(labels = scales::percent(total_ratio))
sleeper_type_ratio$sleeper_type <- factor(sleeper_type_ratio$sleeper_type , levels = c("short sleeper", "normal sleeper", "long sleeper"))
head(sleeper_type_ratio)
## # A tibble: 3 × 3
## sleeper_type total_ratio labels
## <fct> <dbl> <chr>
## 1 long sleeper 0.0417 4%
## 2 normal sleeper 0.417 42%
## 3 short sleeper 0.542 54%
Visualize the distribution of user type as follows.
sleeper_type_ratio %>%
ggplot(aes(x="",y=total_ratio, fill=sleeper_type)) +
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0)+
theme_minimal()+
theme(axis.title.x= element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
scale_fill_manual(values = c("#66CDAA", "#ffd480", "#e55451")) +
geom_text(aes(label = labels),
position = position_stack(vjust = 0.5))+
labs(title="Distribution of sleepers", fill = "Sleeper type")
According to the distribution of users, the most of the fitbit users are moderately active. In addition, half of users are short sleepers. For the next step, we will take a look at the numbers of type of activity users and sleeping users.
#At first, merge two data frame, "sleeper_type" and "user_type", into single data frame as "df_su"
df_su <- merge(sleeper_type, user_type, by = c("id", "mean_daily_steps", "mean_daily_calories", "mean_minutes_asleep"))
glimpse(df_su)
#Then, merge cleaned raw data frame as "activity_sleep" with "df_su" into a single data frame as "df_general".
df_general <- merge(activity_sleep, df_su, by = c("id"))
glimpse(df_general)
#Add add the data of weekdays.
df_general <- df_general%>% mutate(weekday = weekdays(date))
#Order the weekdays.
df_general$weekday <- ordered(df_general$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
head(df_general)
#1. Calories and steps by sleeper type and activity type
ggplot(data=df_general)+
geom_point(mapping = aes(x=total_steps, y=calories, shape = sleeper_type, color = sleeper_type))+
geom_smooth(mapping = aes(x=total_steps, y=calories)) +
facet_wrap(user_type~sleeper_type)+
labs(title = "Calories and steps", subtitle = "Type of sleeper types and user type", caption = "Data collected by FitBit Fitness Tracker Data", color = "Sleeper type",shape = "Sleeper type", x = "Daily step", y= "calorie")
#Count the number of data for each categorized user above.
df_general %>%
group_by(user_type, sleeper_type) %>%
summarise(total = n())
## # A tibble: 9 × 3
## # Groups: user_type [5]
## user_type sleeper_type total
## <chr> <chr> <int>
## 1 fairly active normal sleeper 54
## 2 fairly active short sleeper 11
## 3 lightly active long sleeper 3
## 4 lightly active normal sleeper 15
## 5 lightly active short sleeper 31
## 6 moderately active normal sleeper 176
## 7 moderately active short sleeper 87
## 8 sedentary short sleeper 5
## 9 very active short sleeper 28
#moderately active+normal sleeper: 176
#moderately active + short sleeper: 87
n_unique(df_general$id) # 24 in total users.
## [1] 24
According to the categorized by sleeper type and user type, “Moderately active with normal sleeper” as 176 data variables are the largest. “Moderately active with short sleeper” as 87 data variables is placed at the second.
The other data group are small numbers of the data distribution, so we skip to state the insights.
In this section, based on the business tasks and analysis works, we describe our recommendations to take actions for a new business growth and value of Bellabeat.
The Bellabeat’s value is “Empowering Women to Unlock Their Full Potential”.
The data sets contains 33 users as maximum. In addition, the remarkable user class, which is “moderately active with normal sleeping time”, contains only 7 users, 21%. It is limited to expand the insights from the results of the data analysis to tips for new marketing strategy of Bllabeat. Based on such fact, our answers to the business questionss as follows.
Majority users take 7 or 8 hours for an average daily asleep and between 7500 and 12000 steps as an average daily steps.
Such users usually do exercise on Tuesday and Saturday. -During weekday, they consume calories in the morning and evening.
During weekend, they do active at noon and afternoon.
The target audience of Bellabeat is female customers. Any gender information are not contained in the data sets, so the following answers may be limited.
Based on that we select the following products according to the section, 1.4.
Bellabeat membership: Based on the personal active habits and sleeping cycle, the membership will receive useful or informative information or tips for their future enriched life style.
.Leaf: As an accessory for tracking life cycle, it is more useful. In the case of watch type, user may leave watch during resting time. This product could track in any situation, such as sportive time, private time, and semi-special moment.
These tendencies encourage Bellabeat to have more engagement with their customers or future customers based on the proper timing of the customers’ demands. For example, setting up more accurate and reasonable timeline for their marketing activities or product release timing.
As a conclusion, we sugest the following ideas for potential business opportunity.
Online live coaching service can be a potential product for Bellabeat.
Informative articles for only subscribed membership users
If the raw data has more varieties for the analysis, such as gender, ages, category of the sports for the activities. We wish Bellabeat could empower woman’s potential.