A new growth opportunity of Bellabeat

Summary

This data analytics report is the second case study of Google professional analytics course. The purpose of the analysis is to show a new growth opportunity for Bellabeat, a high-tech manufacturer of health-focused products for women. Based on the analysis of the existing data of smart health device, new insights are found at the section of “Act” through passing six basic steps, “Ask”, “Prepare”,“Process”, “Analyze”, “Share”, and “Act”.

Objectives

An analytically report
Included:
- Process of an analysis
- Visualization
- Insights

1.ASK

This section is to clarify the basic requirements and business questions in related with the analysis.

1.1.Business tasks

According to Ms.Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, the business tasks are following:

To analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices.
Which Bellabeat product can be applied these insight for the potential business opportunity.

1.2.Question

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

1.3.Stakeholders

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

1.4.Products

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

2.Prepare

In order to start analysis, the data of smart device must be prepared. According to Sršen, using public data that explores smart device users’ daily habits is recommended.

2.1.Data sets

FitBit Fitness TrackerData (CC0: Public Domain, a data set made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users
The data sets about “daily activity”, “sleep day”, “hourly calories”, and “heart rate” that can be used to explore users’ habits.
If necessary, another data will be used to address the limitation for the analysis.

2.2.COLLECT DATA

For the preparation to analysis, the following actions are taken.

Download the data from the above designated link and store it.
Identify how it’s organized.
Sort and filter the data.
Determine the credibility of the data

2.3.Installation packages

To implement the data preparation, the following packages of tools are installed.

library(tidyverse) #helps wrangle data
library(lubridate) #helps wrangle date attributes
library(ggplot2) #helps visualize data
library(cowplot) #helps visualize data

#For data cleaning, following packages are installed.
library(here)
library(janitor)
library(skimr)
library(dplyr)
getwd() #displays your working directory
#Data location: "C:/Users/satos/Documents/project/case-2dataset"

2.4.Identify

Identify how the data sets are organized. After the download the data sets, We confirmed the data type is csv files which is collected by Fitbit.Those shall be uploaded as below.

activity <- read_csv("dailyActivity_merged.csv")
sleep <- read_csv("sleepDay_merged.csv")
heartrate <- read_csv("heartrate_seconds_merged.csv")
hourly_calories <- read_csv("hourlyCalories_merged.csv")

Preview the each data frame.

glimpse(activity) #Rows: 940, Columns: 15, ActivityDate <chr>
glimpse(sleep) #Rows: 940, Columns: 3, ActivityDay <chr>
glimpse(heartrate) #Rows: 2,483,658, Columns: 3, Time <chr>
glimpse(hourly_calories) #Rows: 22,099, Columns: 3 ActivityHour <chr>

According to previewing the data, activity data has wide format and data of heart rate and hourly calories have long format. In addition, to integrate multiple data in one, the following actions will be required in the next step, “PROCESS”.

Unify the chronological column as “date”.
Rename “StepTotal” to “TotalStep”.
Split the time data to year, month, day, hour, minute, and seconds.
Convert the data type of “date” to as.POSIXct from character.

2.5.Credibility and integrity of data

According to the data with 30 users and only two months of collected data, the credibility of the data is not high. For example, the data was taken in April and May. In general, the outside temperature of the season is relatively comfort to do exercise. The tendencies might be different from other months. In addition, numbers of sleeping data is 14. In terms of the numbers of sampling data, it is limited for the analysis.

3.PROCESS

For the process the data for analysis, the following tasks will be implemented.

Check the data for errors by Rstudio.
Transform the data effectively.
Document the cleaning process.

3.1.Unify the same types of data in each data sets

Rename some column names such as renaming “date” from “ActivityDate” or “ActivityHour”. In addition, make a consequence name of columns with “clean_names” function.

#Rename to common names.
activity <- activity %>%  rename(date = ActivityDate)
heartrate <- heartrate %>%  rename(date_time= Time)
heartrate <- heartrate %>%  rename(heart_rate = Value)
sleep <- sleep %>% rename(date = SleepDay )
hourly_calories <- hourly_calories %>% rename(id = Id)
hourly_calories <- hourly_calories %>% rename(date_time = ActivityHour)
hourly_calories <- hourly_calories %>% rename(hourly_calories = Calories)


#Use clean name function for readable.
activity <- clean_names(activity)
heartrate <- clean_names(heartrate)
sleep <- clean_names(sleep)
hourly_calories <- clean_names(hourly_calories)

Review if it changed.

colnames(activity)
colnames(heartrate)
colnames(sleep)
colnames(hourly_calories)
# [1] "id"                         "date"                       "total_steps"               
# [4] "total_distance"             "tracker_distance"           "logged_activities_distance"
# [7] "very_active_distance"       "moderately_active_distance" "light_active_distance"     
#[10] "sedentary_active_distance"  "very_active_minutes"        "fairly_active_minutes"     
#[13] "lightly_active_minutes"     "sedentary_minutes"          "calories"                  
#[1] "id"         "date_time"  "heart_rate"
#[1] "id"                   "date"                 "total_sleep_records"  "total_minutes_asleep"
#[5] "total_time_in_bed"   
#[1] "id"              "date_time"       "hourly_calories"

3.2.Split the time data to year, month, day, hour, minute, and seconds.

Before split the time data, confirm the forma on each date frame.

head(activity) #<date>, Year,Month, Day
tail(activity)
head(heartrate) #<S3: POSIXct>  Year,Month, Day, hour,minute,second, AM/PM
tail(heartrate) 
head(sleep) #<date>, Year,Month, Day
tail(sleep)
head(hourly_calories) #<S3: POSIXct>  Year,Month, Day, hour,minute,second, AM/PM
tail(hourly_calories)

The parts should be transformed are as follows:

Transform the character vector to date vector.
Split and add columns of day, month, and year.

3.2.1.Transform to date vector and split columns

#activity
activity <- activity %>%
  mutate(date = as_date(date, format = "%m/%d/%Y")) #The default format is yyyy-mm-dd

#Hourly calories
hourly_calories<- hourly_calories %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
#Split date and time
hourly_calories$time <- format(hourly_calories$date_time, format = "%H:%M:%S")
hourly_calories$date <- format(hourly_calories$date_time, format = "%m/%d/%y")
hourly_calories$date <- as.POSIXct(hourly_calories$date, format = "%m/%d/%y")
hourly_calories$date <- as_date(hourly_calories$date)

#Heart rate
heartrate <- heartrate %>% 
  mutate(date_time= as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
#Split date and time
heartrate$time <- format(heartrate$date_time, format = "%H:%M:%S")
heartrate$date <- format(heartrate$date_time, format = "%m/%d/%y")
heartrate$date <- as.POSIXct(heartrate$date, format = "%m/%d/%y")
heartrate$date <- as_date(heartrate$date)

#sleep
sleep <- sleep %>% mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p")) #The default format is yyyy-mm-dd

head(activity)

## # A tibble: 6 × 15
##           id date       total_steps total_distance tracker_distance
##        <dbl> <date>           <dbl>          <dbl>            <dbl>
## 1 1503960366 2016-04-12       13162           8.5              8.5 
## 2 1503960366 2016-04-13       10735           6.97             6.97
## 3 1503960366 2016-04-14       10460           6.74             6.74
## 4 1503960366 2016-04-15        9762           6.28             6.28
## 5 1503960366 2016-04-16       12669           8.16             8.16
## 6 1503960366 2016-04-17        9705           6.48             6.48
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>

head(heartrate)

## # A tibble: 6 × 5
##           id date_time           heart_rate time     date      
##        <dbl> <dttm>                   <dbl> <chr>    <date>    
## 1 2022484408 2016-04-12 07:21:00         97 07:21:00 2016-04-12
## 2 2022484408 2016-04-12 07:21:05        102 07:21:05 2016-04-12
## 3 2022484408 2016-04-12 07:21:10        105 07:21:10 2016-04-12
## 4 2022484408 2016-04-12 07:21:20        103 07:21:20 2016-04-12
## 5 2022484408 2016-04-12 07:21:25        101 07:21:25 2016-04-12
## 6 2022484408 2016-04-12 07:22:05         95 07:22:05 2016-04-12

head(sleep)

## # A tibble: 6 × 5
##         id date       total_sleep_records total_minutes_asleep total_time_in_bed
##      <dbl> <date>                   <dbl>                <dbl>             <dbl>
## 1   1.50e9 2016-04-12                   1                  327               346
## 2   1.50e9 2016-04-13                   2                  384               407
## 3   1.50e9 2016-04-15                   1                  412               442
## 4   1.50e9 2016-04-16                   2                  340               367
## 5   1.50e9 2016-04-17                   1                  700               712
## 6   1.50e9 2016-04-19                   1                  304               320

head(hourly_calories)

## # A tibble: 6 × 5
##           id date_time           hourly_calories time     date      
##        <dbl> <dttm>                        <dbl> <chr>    <date>    
## 1 1503960366 2016-04-12 00:00:00              81 00:00:00 2016-04-12
## 2 1503960366 2016-04-12 01:00:00              61 01:00:00 2016-04-12
## 3 1503960366 2016-04-12 02:00:00              59 02:00:00 2016-04-12
## 4 1503960366 2016-04-12 03:00:00              47 03:00:00 2016-04-12
## 5 1503960366 2016-04-12 04:00:00              48 04:00:00 2016-04-12
## 6 1503960366 2016-04-12 05:00:00              48 05:00:00 2016-04-12

3.3.Eliminate unnecessary data, duplicates and NA.

Check if there are any duplicated data.

sum(duplicated(activity)) #0
sum(duplicated(sleep))  #3
sum(duplicated(heartrate)) #0
sum(duplicated(hourly_calories)) #0

We found duplicated data in the sleep file.

3.4.Numbers of users

We verify the numbers of users on each data set.

n_unique(activity$id) #33
n_unique(sleep$id) #24
n_unique(heartrate$id) #14
n_unique(hourly_calories$id) #33

Accordingly, the data set of “heartrate” contains small numbers of users. To keep user numbers as many as possible, we may not use the data of “heartrate” in this analysis.

3.5.Drop off NA data

#Activity
activity <- activity %>%
  distinct() %>%
  drop_na()

#Heartrate
heartrate <- heartrate %>% 
  distinct() %>%
  drop_na()

#sleep
sleep <- sleep %>%
  distinct() %>%
  drop_na()

#hourly steps
hourly_calories <- hourly_calories %>%
  distinct() %>%
  drop_na()

Verify if all unnecessary data is removed.

## [1] 0

## [1] 0

## [1] 0

## [1] 0

In addition, check cleaned four data.

head(activity)
head(heartrate)
head(sleep)
head(hourly_calories)

Accordingly, we verified that all data is cleaned.

3.6.Merge data sets

Firstly, we merge two data sets of “activity” and “sleep” to see any correlation between variables by using id and date as their primary keys.

#Merge two data sets of "activity" and "sleep", merged data is named as "activity_sleep"
activity_sleep <- merge(activity, sleep, by= c("id", "date"))
glimpse(activity_sleep)
#Confirm any duplicated data.
sum(duplicated(activity_sleep)) #[1] 0

Now, we confirmed the merged data set is cleaned.

4. ANALYZE

At this part, through organizing and perform calculations, the trends of the FitBit users will be revealed to meet BellaBeat’s marketing strategy. The analysis will be proceeded as following steps:

Knowing the meaning of data variables.
Perform calculations.
Identify trends and relationships.

4.1.Knowing the meaning of data variables.

To do analysis work, knowing the meaning of data variables is important.

colnames(activity_sleep)
#############################################################################################
#[1] "id"                         "date"                       "total_steps"               
#[4] "total_distance"             "tracker_distance"           "logged_activities_distance"
#[7] "very_active_distance"       "moderately_active_distance" "light_active_distance"     
#[10] "sedentary_active_distance"  "very_active_minutes"        "fairly_active_minutes"     
#[13] "lightly_active_minutes"     "sedentary_minutes"          "calories"                  
#[16] "total_sleep_records"        "total_minutes_asleep"       "total_time_in_bed"
#############################################################################################

According to the Centers for Disease Control and Prevention (CDC) defines the following activity levels in terms of steps per day. Reference: Tomwaltersfitness

df_level <- data.frame(
  activity_level = c("Very active", "Moderately active", "Fairly active", "Lightly active", "Sedentary active"),
  steps_per_day = c(">12000", "7500-12000", "5000-7499", "2500-4999","<2500")
)
print(df_level)

##      activity_level steps_per_day
## 1       Very active        >12000
## 2 Moderately active    7500-12000
## 3     Fairly active     5000-7499
## 4    Lightly active     2500-4999
## 5  Sedentary active         <2500

knitr::kable(df_level, caption = "Table with user type")

Table with user type
activity_level	steps_per_day
Very active	>12000
Moderately active	7500-12000
Fairly active	5000-7499
Lightly active	2500-4999
Sedentary active	<2500

4.2.Perform calculations

To get dived into deeply about users of FitBit, calculate average steps, average calories, and average heart rate by users.

4.2.1.Average steps, calories, and heart rate

We calculate average values of steps, calories, and heart rate.

daily_average <- activity_sleep %>%
  group_by(id) %>%
  summarise(mean_daily_steps = mean(total_steps), mean_daily_calories = mean(calories), mean_minutes_asleep = mean(total_minutes_asleep))

head(daily_average)

## # A tibble: 6 × 4
##           id mean_daily_steps mean_daily_calories mean_minutes_asleep
##        <dbl>            <dbl>               <dbl>               <dbl>
## 1 1503960366           12406.               1872.                360.
## 2 1644430081            7968.               2978.                294 
## 3 1844505072            3477                1676.                652 
## 4 1927972279            1490                2316.                417 
## 5 2026352035            5619.               1541.                506.
## 6 2320127002            5079                1804                  61

4.2.2 User type (average steps)

Based on the average data above, users are classified by the daily average steps.

user_type <- daily_average %>%
  mutate(user_type = case_when(
    mean_daily_steps < 2500 ~ "sedentary",
    mean_daily_steps >= 2500 & mean_daily_steps < 5000 ~ "lightly active", 
    mean_daily_steps >= 5000 & mean_daily_steps < 7500 ~ "fairly active", 
    mean_daily_steps >= 7500 & mean_daily_steps < 12000 ~ "moderately active", 
    mean_daily_steps >= 12000 ~ "very active"
  ))

head(user_type)

## # A tibble: 6 × 5
##           id mean_daily_steps mean_daily_calories mean_minutes_asleep user_type 
##        <dbl>            <dbl>               <dbl>               <dbl> <chr>     
## 1 1503960366           12406.               1872.                360. very acti…
## 2 1644430081            7968.               2978.                294  moderatel…
## 3 1844505072            3477                1676.                652  lightly a…
## 4 1927972279            1490                2316.                417  sedentary 
## 5 2026352035            5619.               1541.                506. fairly ac…
## 6 2320127002            5079                1804                  61  fairly ac…

sum(is.na(user_type)) # 0

## [1] 0

Then, calculate how many user type exists in the data.

user_type_ratio <- user_type %>%
  group_by(user_type) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(user_type) %>%
  summarise(total_ratio = total / totals) %>%
  mutate(labels = scales::percent(total_ratio))

user_type_ratio$user_type <- factor(user_type_ratio$user_type , levels = c("very active", "moderately active", "fairly active", "lightly active", "sedentary"))

head(user_type_ratio)

## # A tibble: 5 × 3
##   user_type         total_ratio labels
##   <fct>                   <dbl> <chr> 
## 1 fairly active          0.208  20.8% 
## 2 lightly active         0.167  16.7% 
## 3 moderately active      0.5    50.0% 
## 4 sedentary              0.0417 4.2%  
## 5 very active            0.0833 8.3%

4.2.3.Distribution of user type

Visualize the distribution of user type as follows.

user_type_ratio %>%
  ggplot(aes(x="",y=total_ratio, fill=user_type)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+  #Convert the plot to polar coorinates
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#66CDAA","#8EEBEC", "#ffd480","#ffa07a", "#e55451")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5, reverse = FALSE))+
  labs(title="Distribution of users", fill = "User type")

4.2.4 Sleeper type Categorize the type of sleeper.

df_asleep <- data.frame(
  sleeper_type = c("Long sleeper", "Normal sleeper", "Short sleeper"),
  sleeping_minutes = c(">540", "420-539", "<420")
)

knitr::kable(df_asleep, caption = "Table with sleeper type")

Table with sleeper type
sleeper_type	sleeping_minutes
Long sleeper	>540
Normal sleeper	420-539
Short sleeper	<420

sleeper_type <- daily_average %>%
  mutate(sleeper_type = case_when(
    mean_minutes_asleep < 420 ~ "short sleeper",
    mean_minutes_asleep >= 420 & mean_minutes_asleep < 540 ~ "normal sleeper", 
    mean_minutes_asleep >= 540 ~ "long sleeper"
  ))

head(sleeper_type)

## # A tibble: 6 × 5
##         id mean_daily_steps mean_daily_calories mean_minutes_asleep sleeper_type
##      <dbl>            <dbl>               <dbl>               <dbl> <chr>       
## 1   1.50e9           12406.               1872.                360. short sleep…
## 2   1.64e9            7968.               2978.                294  short sleep…
## 3   1.84e9            3477                1676.                652  long sleeper
## 4   1.93e9            1490                2316.                417  short sleep…
## 5   2.03e9            5619.               1541.                506. normal slee…
## 6   2.32e9            5079                1804                  61  short sleep…

Then, calculate how many user type exists in the data.

sleeper_type_ratio <- sleeper_type %>%
  group_by(sleeper_type) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(sleeper_type) %>%
  summarise(total_ratio = total / totals) %>%
  mutate(labels = scales::percent(total_ratio))

sleeper_type_ratio$sleeper_type <- factor(sleeper_type_ratio$sleeper_type , levels = c("short sleeper", "normal sleeper", "long sleeper"))
head(sleeper_type_ratio)

## # A tibble: 3 × 3
##   sleeper_type   total_ratio labels
##   <fct>                <dbl> <chr> 
## 1 long sleeper        0.0417 4%    
## 2 normal sleeper      0.417  42%   
## 3 short sleeper       0.542  54%

4.2.5.Distribution of user type.

Visualize the distribution of user type as follows.

sleeper_type_ratio %>%
  ggplot(aes(x="",y=total_ratio, fill=sleeper_type)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#66CDAA", "#ffd480", "#e55451")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  labs(title="Distribution of sleepers", fill = "Sleeper type")

4.3. Identify trends of users’ habits and relationships

According to the distribution of users, the most of the fitbit users are moderately active. In addition, half of users are short sleepers. For the next step, we will take a look at the numbers of type of activity users and sleeping users.

4.3.1.Calories vs steps by sleeper type and user type

#At first, merge two data frame, "sleeper_type" and "user_type", into single data frame as "df_su"
df_su <- merge(sleeper_type, user_type, by = c("id", "mean_daily_steps", "mean_daily_calories", "mean_minutes_asleep"))
glimpse(df_su)

#Then, merge cleaned raw data frame as "activity_sleep" with "df_su" into a single data frame as "df_general".
df_general <- merge(activity_sleep, df_su, by = c("id"))
glimpse(df_general)

#Add add the data of weekdays.
df_general <- df_general%>% mutate(weekday = weekdays(date))
#Order the weekdays.
df_general$weekday <- ordered(df_general$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
head(df_general)

#1. Calories and steps by sleeper type and activity type
ggplot(data=df_general)+
  geom_point(mapping = aes(x=total_steps, y=calories, shape = sleeper_type, color = sleeper_type))+
  geom_smooth(mapping = aes(x=total_steps, y=calories)) + 
  facet_wrap(user_type~sleeper_type)+
  labs(title = "Calories and steps", subtitle = "Type of sleeper types and user type", caption = "Data collected by FitBit Fitness Tracker Data", color = "Sleeper type",shape = "Sleeper type", x = "Daily step", y= "calorie")

#Count the number of data for each categorized user above.
df_general %>%
  group_by(user_type, sleeper_type) %>%
  summarise(total = n())

## # A tibble: 9 × 3
## # Groups:   user_type [5]
##   user_type         sleeper_type   total
##   <chr>             <chr>          <int>
## 1 fairly active     normal sleeper    54
## 2 fairly active     short sleeper     11
## 3 lightly active    long sleeper       3
## 4 lightly active    normal sleeper    15
## 5 lightly active    short sleeper     31
## 6 moderately active normal sleeper   176
## 7 moderately active short sleeper     87
## 8 sedentary         short sleeper      5
## 9 very active       short sleeper     28

#moderately active+normal sleeper:  176 
#moderately active + short sleeper: 87  
n_unique(df_general$id) # 24 in total users.

## [1] 24

According to the categorized by sleeper type and user type, “Moderately active with normal sleeper” as 176 data variables are the largest. “Moderately active with short sleeper” as 87 data variables is placed at the second.

The other data group are small numbers of the data distribution, so we skip to state the insights.

5.SHARE

In this section, we visualize the detailed trends about habits of calorie consumption among the user with “moderately active” and “normal sleeper”.

5.1.Trends, Calorie and step through the week

We find out the correlation how steps influence to calorie consumption through the week.

# Filter the data of "moderately active" from user type and "normal sleeper" from sleeper type.
df_standard_user <- df_general %>% 
  filter(user_type == "moderately active",sleeper_type == "normal sleeper" )
# Numbers of users of "moderately active"+"normal sleeper".
n_unique(df_standard_user$id) # 7 users, 31.8%

## [1] 7

#Visualize correction between calorie and steps through the week.
ggplot(data=df_standard_user)+
  geom_point(mapping = aes(x=total_steps, y=calories))+
  geom_smooth(mapping = aes(x=total_steps, y=calories)) + 
  facet_wrap(~weekday)+
  labs(title = "Calories and steps", subtitle = "Weekly acitivity of Moderately active and normal sleeping time users.", caption = "Data collected by FitBit Fitness Tracker Data", x = "Daily step", y= "calorie")

From the graph, numbers of steps and calorie consumption are not strongly related.

It is considerable to do activities which are not related with steps, such as running, YOGA, and work out. It is limited to find out the trends because of wide range data variable a each day of the week.

So, we use bar chart to find out the trends of calorie consumption on each day of the week.

5.2.Calories per weekday

#Mean values vs Median values
p4 <- ggplot(data = df_standard_user)+
  geom_bar(mapping = aes(x=weekday, y=calories), stat = "summary", fun = "mean") +
  annotate("rect", xmin = "Monday", xmax= "Sunday", ymin = 2200, ymax = 2400, alpha = .2, fill = "green")+
  labs(title = "Average calories per weekday", x= "Weekday", y = "Calorie") +
  theme(axis.text.x = element_text(angle = 30,vjust = 0.5, hjust = 1))
 

p5 <- ggplot(data = df_standard_user)+
  geom_bar(mapping = aes(x=weekday, y=calories), stat = "summary", fun = "median") +
  annotate("rect", xmin = "Monday", xmax= "Sunday", ymin = 2200, ymax = 2400, alpha = .2, fill = "green")+
  labs(title = "Median calories per weekday", x= "Weekday", y = "Calorie") +
  theme(axis.text.x = element_text(angle = 30,vjust = 0.5, hjust = 1))

plot_grid(p4,p5)

#2200 calories are the minimum values of calorie consumption for men according to the Dietary Guidelines for Americans.
#2400 calories are the maximum values of calorie consumption for women according to the Dietary Guidelines for Americans.

In general the median values are less sensitivity from the outliers. So, we apply the data of median values.

According to the bar graph of median values of calorie, the moderately active user with normal sleeping time tends to active intensively on Tuesday, Wednesday, and Saturday.

On the other hands, users are moderately active on Monday, Thursday, and Sunday.

So, next we wonder how long the moderately active with normal sleeping users take sleep during weekday?

5.3.Minutes asleep per weekday

# Mean minutes asleep vs Median minutes asleep 
p6 <- ggplot(data = df_standard_user)+
  geom_bar(mapping = aes(x=weekday, y=total_minutes_asleep), stat = "summary", fun = "mean") +
  annotate("rect", xmin = "Monday", xmax= "Sunday", ymin = 420, ymax = 539, alpha = .2, fill = "green")+
  labs(title = "Average minutes asleep per weekday", x= "Weekday", y = "Minute") +
  theme(axis.text.x = element_text(angle = 30,vjust = 0.5, hjust = 1))
 

p7 <- ggplot(data = df_standard_user)+
  geom_bar(mapping = aes(x=weekday, y=total_minutes_asleep), stat = "summary", fun = "median") +
  annotate("rect", xmin = "Monday", xmax= "Sunday", ymin = 420, ymax = 539, alpha = .2, fill = "green")+
  labs(title = "Median minutes asleep per weekday", x= "Weekday", y = "Minute") +
  theme(axis.text.x = element_text(angle = 30,vjust = 0.5, hjust = 1))

plot_grid(p6,p7)

#Green range indicates the sleeping time with more than 420 mins(7 hours) but below 540mins(9 hours.)

Same as the graph of calorie consumption, we apply median values for minutes asleep through a week.

According to the graph of the median minutes asleep per weekday, the users take a rest more on Sunday, Saturday, and Wednesday.

The shortest sleeping day of the week is Thursday.

5.4.Trends, calorie consumption and minutes asleep per weekday.

Let’s compare the graphs between calorie consumption and minutes asleep per weekday.

plot_grid(p5,p7)

From the above two graphs, Sunday is the most resting day of the week among the users.

Tuesday is the most active day of the week among the users.
Thursday is the busiest day of the week among the users because of the lowest values on both graphs.
Saturday is the relatively active and resting day of the week.

5.5.Tuesday vs Saturday in terms of behaviors of calrorie consumption.

To get to know about the behaviors of “moderately active with normal sleeping user, we compare the calorie consumption on Tuesday and Saturday.

#Merge the data freame of "hourly_calories" and "df_standard_user.
df_standard_user1 <- merge(hourly_calories, df_standard_user, by= c("id", "date"))
glimpse(df_standard_user1)

## Rows: 4,136
## Columns: 27
## $ id                         <dbl> 2347167796, 2347167796, 2347167796, 2347167…
## $ date                       <date> 2016-04-13, 2016-04-13, 2016-04-13, 2016-0…
## $ date_time                  <dttm> 2016-04-13 18:00:00, 2016-04-13 16:00:00, …
## $ hourly_calories            <dbl> 140, 91, 57, 121, 73, 63, 59, 75, 86, 77, 5…
## $ time                       <chr> "18:00:00", "16:00:00", "05:00:00", "13:00:…
## $ total_steps                <dbl> 10352, 10352, 10352, 10352, 10352, 10352, 1…
## $ total_distance             <dbl> 7.01, 7.01, 7.01, 7.01, 7.01, 7.01, 7.01, 7…
## $ tracker_distance           <dbl> 7.01, 7.01, 7.01, 7.01, 7.01, 7.01, 7.01, 7…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.66, 1.66, 1.66, 1.66, 1.66, 1.66, 1.66, 1…
## $ moderately_active_distance <dbl> 1.94, 1.94, 1.94, 1.94, 1.94, 1.94, 1.94, 1…
## $ light_active_distance      <dbl> 3.41, 3.41, 3.41, 3.41, 3.41, 3.41, 3.41, 3…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19,…
## $ fairly_active_minutes      <dbl> 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,…
## $ lightly_active_minutes     <dbl> 195, 195, 195, 195, 195, 195, 195, 195, 195…
## $ sedentary_minutes          <dbl> 676, 676, 676, 676, 676, 676, 676, 676, 676…
## $ calories                   <dbl> 2038, 2038, 2038, 2038, 2038, 2038, 2038, 2…
## $ total_sleep_records        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ total_minutes_asleep       <dbl> 467, 467, 467, 467, 467, 467, 467, 467, 467…
## $ total_time_in_bed          <dbl> 531, 531, 531, 531, 531, 531, 531, 531, 531…
## $ mean_daily_steps           <dbl> 8533.2, 8533.2, 8533.2, 8533.2, 8533.2, 853…
## $ mean_daily_calories        <dbl> 1971.333, 1971.333, 1971.333, 1971.333, 197…
## $ mean_minutes_asleep        <dbl> 446.8, 446.8, 446.8, 446.8, 446.8, 446.8, 4…
## $ sleeper_type               <chr> "normal sleeper", "normal sleeper", "normal…
## $ user_type                  <chr> "moderately active", "moderately active", "…
## $ weekday                    <ord> Wednesday, Wednesday, Wednesday, Wednesday,…

n_unique(df_standard_user1$id) #7 users

## [1] 7

#Hourly calorie consumption on Tuesday.
df_standard_user1 %>%
  filter(weekday == "Tuesday") %>%
  group_by(time) %>%
  summarize(median_hourly_calories = median(hourly_calories)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = median_hourly_calories, fill = median_hourly_calories)) + 
  labs(title = "Hourly calories throughout Tuesday", x="hour", y="cal", fill = "Calorie") + 
  scale_fill_gradient(low = "green", high = "orange")+
  theme(axis.text.x = element_text(angle = 90))

#Hourly calorie consumption on Saturday.
df_standard_user1 %>%
  filter(weekday == "Saturday") %>%
  group_by(time) %>%
  summarize(median_hourly_calories = median(hourly_calories)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = median_hourly_calories, fill = median_hourly_calories)) + 
  labs(title = "Hourly calories throughout Saturday", x="hour", y="cal", fill = "Calorie") + 
  scale_fill_gradient(low = "green", high = "orange")+
  theme(axis.text.x = element_text(angle = 90))

For the reference, we confirm the tendencies of calorie consumption on each day of the week.

#All days of the week.
df_standard_user1 %>%
  group_by(time, weekday) %>%
  summarize(median_hourly_calories = median(hourly_calories)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = median_hourly_calories, fill = median_hourly_calories)) + 
  facet_wrap(~weekday) +
  labs(title = "Hourly calories throughout Saturday", x="hour", y="cal", fill = "Calorie") + 
  scale_fill_gradient(low = "green", high = "orange")+
  theme(axis.text.x = element_text(angle = 90))

According to the graph, both days of the week are days of the high calorie consumption, but both are different excising timing.

On Tuesday, the users did active at 6:00 and at 16:00.
On Saturday, the users did active at noon. In the morning and evening, they are relatively relaxed.
As the wide point of views, 7 users do exercise in the morning and in the evening. During weekend, they relatively active during at noon and afternoon.

6.Act

In this section, based on the business tasks and analysis works, we describe our recommendations to take actions for a new business growth and value of Bellabeat.

The Bellabeat’s value is “Empowering Women to Unlock Their Full Potential”.

The data sets contains 33 users as maximum. In addition, the remarkable user class, which is “moderately active with normal sleeping time”, contains only 7 users, 21%. It is limited to expand the insights from the results of the data analysis to tips for new marketing strategy of Bllabeat. Based on such fact, our answers to the business questionss as follows.

6.1.What are some trends in smart device usage?

Majority users take 7 or 8 hours for an average daily asleep and between 7500 and 12000 steps as an average daily steps.
Such users usually do exercise on Tuesday and Saturday. -During weekday, they consume calories in the morning and evening.
During weekend, they do active at noon and afternoon.

6.2.How could these trends apply to Bellabeat customers?

The target audience of Bellabeat is female customers. Any gender information are not contained in the data sets, so the following answers may be limited.

Based on that we select the following products according to the section, 1.4.

Bellabeat membership: Based on the personal active habits and sleeping cycle, the membership will receive useful or informative information or tips for their future enriched life style.
.Leaf: As an accessory for tracking life cycle, it is more useful. In the case of watch type, user may leave watch during resting time. This product could track in any situation, such as sportive time, private time, and semi-special moment.

6.3.How could these trends help influence Bellabeat marketing strategy?

These tendencies encourage Bellabeat to have more engagement with their customers or future customers based on the proper timing of the customers’ demands. For example, setting up more accurate and reasonable timeline for their marketing activities or product release timing.

6.4.Suggestion for potential business opportunity.

As a conclusion, we sugest the following ideas for potential business opportunity.

Online live coaching service can be a potential product for Bellabeat.
- Online “Live coaching” service could be set up in the evening on Tuesday and in the afternoon on Saturday according to the results of trends 5.1.3 and 5.1.4. This service could be a high chance to connect interactively Bellabeat and customers.
Informative articles for only subscribed membership users
- New articles on SNS or mailing news can be read if marketing team release on the less active day, such as Sunday, Monday, and Thursday

If the raw data has more varieties for the analysis, such as gender, ages, category of the sports for the activities. We wish Bellabeat could empower woman’s potential.

-Case study 2- Bellabeat data analysis

s.matsumoto

2022-04-23