Bellabeat Case Study - How can a wellness technology company play it smart?

1. Introduction


This project is my attempt to solve the second case study from the Google Data Analytics Professional Certificate Capstone Projects on Coursera that you can be accessed here.

This project will be about understanding how people use smart devices to help with their health and well-being. This knowledge then will be implemented on products of a smart devices company called Bellabeat. This company aims to manufacture smart devices to help women with their overall health and well-being.

If you have any questions regarding this project, don’t hesitate to contact me through email here, or send me a message on LinkedIn here.

2. Background


Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

3. Business Problem


Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more growth opportunities. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Sršen asks you to analyze smart device usage data to gain insight into how consumers use non-Bellabeat smart devices. These questions will guide your analysis:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

4. Analysis


4.1 Data Wrangling

The data we are going to use is the data that contains information about activity intensity and sleep from the users. The activity intensity is coming from the daily level and the minute level data. Now, let’s load the data we need for this analysis.

library(tidyverse) # helps wrangle data
library(lubridate) # helps wrangle the date attributes
library(kableExtra) # helps format table on output
library(scales) # helps with formatting
library(summarytools) # helps with descriptive statistics
theme_set(theme_bw()) # set default ggplot2 theme

# Set default summarytools package options
st_options(descr.silent = TRUE, 
           descr.stats = c("min", "mean", "med","max"),
           descr.transpose = TRUE, 
           headings  = FALSE,
           style = "rmarkdown")

# Okabe-Ito color palette
okabe_ito <- c("#E69F00", "#56B4E9", "#009E73", 
               "#F0E442", "#0072B2", "#D55E00",
               "#CC79A7", "#000000")

# datawrapper color palette 
datawrapper_cp <- c("#ea594e", "#e5b039", "#ede65a", 
                    "#193cbc", "#1473af", "#589acf", 
                    "#89c3ef", "#838383")

4.1.1. Daily level data

Reading the data needed using the read_csv() function.

# Daily level activities data
df_activity_day <-
  read_csv("../data/dailyActivity_merged.csv")

# Daily level sleep data
df_sleep_day <- 
  read_csv("../data/sleepDay_merged.csv")

# Daily level weight log data
df_weight_day <-
  read_csv("../data/weightLogInfo_merged.csv")

The cleaning for these dataframe is started by changing the Id column data type to string. This is to ensure the unique property of this column.

# Changing Id data types
df_activity_day$Id <- as.character(df_activity_day$Id)
df_sleep_day$Id <- as.character(df_sleep_day$Id)
df_weight_day$Id <- as.character(df_weight_day$Id)

Next, rename the Date column from each dataframe to make them uniform and to help with merging the data later.

# Renaming Date column name
df_activity_day <- df_activity_day %>%
  rename(Date = ActivityDate)

df_sleep_day <- df_sleep_day %>%
  rename(Date = SleepDay)

After that, let’s format the data in the Date column to a proper date format. We can achieve this by using the as_date() function from lubridate package.

# Changing the Date data type to date
df_activity_day$Date <- 
  parse_date_time(df_activity_day$Date, "%m/%d/%Y") %>% 
  date()

df_sleep_day$Date <- 
  parse_date_time(df_sleep_day$Date,
                  c("%m/%d/%Y %H:%M:%S",
                    "%m/%d/%Y %I:%M:%S %p")) %>% 
  date()

df_weight_day$Date <- 
  parse_date_time(df_weight_day$Date,
                  c("%m/%d/%Y %H:%M:%S",
                    "%m/%d/%Y %I:%M:%S %p")) %>% 
  date()

Next, let’s add the day of the week column to the daily activity dataframe.

# Add the day of the week column to the activity data
df_activity_day <- df_activity_day %>%
  mutate(Day = wday(
    Date,
    label = TRUE,
    abbr = TRUE,
    week_start = 1
  ))

We will also add the TotalActiveMinute and TotalActiveHour columns to the daily activity dataframe.

# Add TotalActiveMinute and TotalActiveHour column
df_activity_day <- df_activity_day %>%
  mutate(
    TotalActiveMinutes =
      VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes,
    TotalActiveHour = TotalActiveMinutes / 60
  )

Another column that we need to change its data type is the IsManualReport column from the daily weight log dataframe.

# Converting IsManualReport column to a boolean type
df_weight_day$IsManualReport <- 
  as.logical(df_weight_day$IsManualReport)

Also, we wouldn’t need the LogId column from the weight log data because this shows the unique day each data is collected. The reason to drop this column is that our data is already breaking down for each day, where each day is already unique for each user.

df_weight_day <- subset(df_weight_day, select = -LogId)

The last wrangling is adding a column that measures how many hours each user is sleeping on the daily sleep tracking data.

# Adding TotalHourAsleep column for the sleep data
df_sleep_day <- df_sleep_day %>%
  mutate(TotalHourAsleep = TotalMinutesAsleep / 60)

Finally, we can merge the dataframe above into a single dataframe.

# Merging the data
df_daily <- 
  left_join(df_activity_day, df_sleep_day, 
            by = c("Id", "Date")) %>%
  left_join(df_weight_day, by = c("Id", "Date"))

We haven’t checked for duplicates for each dataframe. Because the dataframe is not that big enough, we can check it from the merged dataframe.

# Find duplicated rows based on Id and Date columns
df_daily <- df_daily %>%
  distinct(Id, Date, .keep_all = TRUE)

We can see the result from the wrangling above.

glimpse(df_daily)
## Rows: 940
## Columns: 27
## $ Id                       <chr> "1503960366", "1503960366", "1503960366", "15…
## $ Date                     <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
## $ Day                      <ord> Tue, Wed, Thu, Fri, Sat, Sun, Mon, Tue, Wed, …
## $ TotalActiveMinutes       <dbl> 366, 257, 222, 272, 267, 222, 291, 345, 245, …
## $ TotalActiveHour          <dbl> 6.100000, 4.283333, 3.700000, 4.533333, 4.450…
## $ TotalSleepRecords        <dbl> 1, 2, NA, 1, 2, 1, NA, 1, 1, 1, NA, 1, 1, 1, …
## $ TotalMinutesAsleep       <dbl> 327, 384, NA, 412, 340, 700, NA, 304, 360, 32…
## $ TotalTimeInBed           <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377, 36…
## $ TotalHourAsleep          <dbl> 5.450000, 6.400000, NA, 6.866667, 5.666667, 1…
## $ WeightKg                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ WeightPounds             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Fat                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ BMI                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ IsManualReport           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

4.1.2. Minute level data

For this data, we’re going to use the sleep data per minute for each user. First, let’s import the data.

# Minute level steps data
df_steps_minute <- 
  read_csv("../data/minuteStepsNarrow_merged.csv")

# Minute level intensity data
df_intensity_minutes <- 
  read_csv("../data/minuteIntensitiesNarrow_merged.csv")

# Minute level calories data
df_calories_minute <- 
  read_csv("../data/minuteCaloriesNarrow_merged.csv")

# Minute level sleep data
df_sleep_minute <- read_csv("../data/minuteSleep_merged.csv")

The first cleaning we will perform is to format the datetime column in each dataframes to datetime format.

# df_steps_minute$ActivityMinute to datetime
df_steps_minute <- df_steps_minute %>%
  mutate(
    ActivityMinute =
      parse_date_time(
        ActivityMinute,
        c("%m/%d/%Y %H:%M:%S", "%m/%d/%Y %I:%M:%S %p")
      )
  )

# df_intensity_minutes$ActivityMinute to datetime
df_intensity_minutes <- df_intensity_minutes %>%
  mutate(
    ActivityMinute =
      parse_date_time(
        ActivityMinute,
        c("%m/%d/%Y %H:%M:%S", "%m/%d/%Y %I:%M:%S %p")
      )
  )

# df_calories_minute$ActivityMinute to datetime
df_calories_minute <- df_calories_minute %>%
  mutate(
    ActivityMinute =
      parse_date_time(
        ActivityMinute,
        c("%m/%d/%Y %H:%M:%S", "%m/%d/%Y %I:%M:%S %p")
      )
  )

# df_sleep_minute$date to datetime
df_sleep_minute <- df_sleep_minute %>%
  mutate(
    date =
      parse_date_time(
        date,
        c("%m/%d/%Y %H:%M:%S", "%m/%d/%Y %I:%M:%S %p")
      )
  )

Next, we will format the unique identifier columns like Id and logId from each dataframes to character to preserve their unique property.

# df_steps_minute$Id to character
df_steps_minute$Id <- as.character(df_steps_minute$Id)

# df_intensity_minutes$Id to character
df_intensity_minutes$Id <- as.character(df_intensity_minutes$Id)

# df_calories_minute$Id to character
df_calories_minute$Id <- as.character(df_calories_minute$Id)

# df_sleep_minute$Id to character
df_sleep_minute$Id <- as.character(df_sleep_minute$Id)

# df_sleep_minute$logId to character
df_sleep_minute$logId <- as.character(df_sleep_minute$logId)

To properly combine our data later, using only Id and date column would not be sufficient. We would need to make sure each dataframes combined properly. To do that, we will need to match the data until their minute level.

# Get date, day, wday, hour for df_steps_minute
df_steps_minute <- df_steps_minute %>%
  mutate(
    Date = date(ActivityMinute),
    Dow = wday(ActivityMinute, label = TRUE, week_start = 1),
    Hour = hour(ActivityMinute),
    Minute = minute(ActivityMinute)
  )

# Get date, day, wday, hour for df_intensity_minutes
df_intensity_minutes <- df_intensity_minutes %>%
  mutate(
    Date = date(ActivityMinute),
    Dow = wday(ActivityMinute, label = TRUE, week_start = 1),
    Hour = hour(ActivityMinute),
    Minute = minute(ActivityMinute)
  )

# Get date, day, wday, hour for df_calories_minute
df_calories_minute <- df_calories_minute %>%
  mutate(
    Date = date(ActivityMinute),
    Dow = wday(ActivityMinute, label = TRUE, week_start = 1),
    Hour = hour(ActivityMinute),
    Minute = minute(ActivityMinute)
  )

# Get date, day, wday, hour for df_sleep_minute
df_sleep_minute <- df_sleep_minute %>%
  mutate(
    Date = date(date),
    Dow = wday(date, label = TRUE, week_start = 1),
    Hour = as.character(hour(date)),
    Minute = as.character(minute(date))
  )

Now, let’s join the dataframes. We are going to combine all the activities related dataframes into a dataframe, that is the activity intensity, steps, and calories. Because the analysis will be different, we will left the sleep data alone for now.

# Columns to join
col_names_join <- c("Id", "ActivityMinute", "Date", 
                    "Dow", "Hour", "Minute")

# Joining the dataframes
df_activity_minute <- df_intensity_minutes %>%
  inner_join(df_steps_minute, by = col_names_join) %>%
  inner_join(df_calories_minute, by = col_names_join) %>%
  select("Id", "ActivityMinute", "Intensity", 
         "Steps", "Calories", everything())

From this combined dataframes, we can create the data for the hour level data by aggregating them into the hour level.

# The time of day list, morning, afternoon, evening, and night
vector_time <- c(6, 12, 18, 21)

# Weekend list
weekend_list <- c("Sat", "Sun")

df_activity_hour <- df_activity_minute %>% 
  group_by(Id, Date, Dow, Hour) %>%
  summarise(
    StepTotal = sum(Steps),
    TotalIntensity = sum(Intensity > 0),
    CaloriesTotal = round(sum(Calories))
  ) %>%
  ungroup() %>%
  
  # Create date and time relate columns
  # The time of day column
  mutate(TimesDay = case_when(
    # 06:00 - 12:00
    between(Hour, vector_time[1], vector_time[2]) ~ "Morning",

    # 12:00 - 18:00
    between(Hour, vector_time[2], vector_time[3]) ~ "Afternoon",

    # 18:00 - 21:00
    between(Hour, vector_time[3], vector_time[4]) ~ "Evening",

    # 21:00 - 06:00
    (Hour >= vector_time[4] | Hour < vector_time[1]) ~ "Night"),
    
    # Week type column
    Weektype = if_else(Dow %in% weekend_list, "Weekend", "Weekday")
  ) %>% 
  select(Id, Date, Dow, Hour, TimesDay, Weektype, everything())

We can also create the higher level of detail from our combined data into the daily level data.

# Creating daily level activity data from minute level data
df_activity_daily <- df_activity_minute %>% 
  group_by(Id, Date, Dow) %>%
  summarise(
    StepTotal = sum(Steps),
    TotalIntensity = sum(Intensity > 0),
    CaloriesTotal = round(sum(Calories))
  ) %>%
  ungroup() %>%
  
  # Create date and time relate columns
  # Week type column
  mutate(Weektype = if_else(Dow %in% weekend_list, 
                            "Weekend", "Weekday")) %>% 
  select(Id, Date, Dow, Weektype, everything())

4.2. Data Analysis

On this phase, we will perform analysis to answer the business questions for this case study. We’re going to focus on analyzing the sleep and activity level from the data to keep the result aligned with what Bellabeat’s products measure.

4.2.1. Data overview

Let’s see how many users we have in our data.

# How many users are in the data?
user_count = length(unique(df_daily$Id))
print(paste("There are", user_count, 
            "number of unique users in the data."))
## [1] "There are 33 number of unique users in the data."

Also, let’s see what is the interval of time our data is collected.

# Date interval from our data
print(paste(min(df_daily$Date), 'to', max(df_daily$Date)))
## [1] "2016-04-12 to 2016-05-12"

From our daily level data, there are three variables there, that is activity, sleep, and weight log. Let’s see how our 33 users tracked these variables using their devices.

# Activity
track_activity <- df_daily %>%
  filter(!is.na(df_daily$TotalSteps)) %>%
  select(Id) %>%
  unique() %>%
  nrow()

# Sleep
track_sleep <- df_daily %>%
  filter(!is.na(df_daily$TotalMinutesAsleep)) %>%
  select(Id) %>%
  unique() %>%
  nrow()

# Weight
track_weight <- df_daily %>%
  filter(!is.na(df_daily$WeightKg)) %>%
  select(Id) %>%
  unique() %>%
  nrow()

print(paste(track_activity, "of", user_count, 
            "users tracked their activity level at least once."))
## [1] "33 of 33 users tracked their activity level at least once."
print(paste(track_sleep, "of", user_count, 
            "users tracked their sleep at least once."))
## [1] "24 of 33 users tracked their sleep at least once."
print(paste(track_weight, "of", user_count, 
            "users tracked their weight level at least once."))
## [1] "8 of 33 users tracked their weight level at least once."

We can see what is the favorable variable the users in our data is more interested to know about, that is their activity level throughout the day.

We would also like to see some summary statistics from our data. Let’s see how our daily tracking data statistics summary looks like.

cols_vec_day <- c("TotalSteps", "LightlyActiveMinutes", 
                  "SedentaryMinutes", "VeryActiveMinutes",
                  "FairlyActiveMinutes", "TotalActiveMinutes",
                  "TotalActiveHour", "TotalMinutesAsleep",
                  "TotalHourAsleep", "Calories")
descr(df_daily[cols_vec_day])
  Min Mean Median Max
Calories 0.00 2303.61 2134.00 4900.00
FairlyActiveMinutes 0.00 13.56 6.00 143.00
LightlyActiveMinutes 0.00 192.81 199.00 518.00
SedentaryMinutes 0.00 991.21 1057.50 1440.00
TotalActiveHour 0.00 3.79 4.12 9.20
TotalActiveMinutes 0.00 227.54 247.00 552.00
TotalHourAsleep 0.97 6.99 7.21 13.27
TotalMinutesAsleep 58.00 419.17 432.50 796.00
TotalSteps 0.00 7637.91 7405.50 36019.00
VeryActiveMinutes 0.00 21.16 4.00 210.00

From the table above, we can conclude several things:

Before doing any transformation, let’s investigate how is our data distributed along the week. I would like to see if there are days where users use their devices more than other days.

df_daily %>%
  count(Day) %>%
  transmute(Day,
            count = n,
            percentage = percent(
              round((count / sum(count)), 2)),
            is_high = ifelse(percentage >= 15, 'High', 'Low')
            ) %>% 
  kbl()
Day count percentage is_high
Mon 120 13% Low
Tue 152 16% High
Wed 150 16% High
Thu 147 16% High
Fri 126 13% Low
Sat 124 13% Low
Sun 121 13% Low

Let’s visualize this result to understand the result easier.

df_daily %>%
  count(Day) %>%
  mutate(perc = n / sum(n),
         is_high = ifelse(perc > 0.15, 'High', 'Low')) %>%
  
  # Create a column chart
  ggplot(aes(x = Day, y = perc, fill = is_high)) +
  geom_col(width = 0.7) +
  
  # Set the title, subtitle, and axis labels
  labs(
    title = 'Data tracked across the week',
    subtitle = 'How are the data distributed along the week? Are they using their device every day?',
    x = element_blank(),
    y = "Percentage"
  ) +
  
  # Set the display of y axis label
  scale_y_continuous(labels = scales::percent,
                     expand = c(0, 0),
                     limits = c(0, 0.18)) +
  
  # Set color for each conditions
  scale_fill_manual(values = c(datawrapper_cp[1], 
                               datawrapper_cp[8])) +
  
  # Format legend
  theme(legend.position = "None")

We can see that the data are not distributed equally along the week where almost half our data alone are generated on Tuesday, Wednesday, and Thursday. This means, these days are the days where people actively use their devices.

4.2.2. Activity level analysis

Before moving into the analysis, let’s first look at is the users in our data are meeting the recommended physical activity level, which is 150 minutes of moderate activity or 75 minutes of vigorous-intensity per week.

active_users <- df_daily %>% 
  mutate(Week = week(Date)) %>% 
  group_by(Id, Week) %>% 
  summarise(total_moderate = sum(FairlyActiveMinutes),
            total_very_active = sum(VeryActiveMinutes)) %>% 
  filter(total_moderate >= 150 | total_very_active >= 75) %>% 
  pull(Id) %>% unique() %>% length()

print(paste(active_users, "users achieved the recommended activity level per week."))
## [1] "23 users achieved the recommended activity level per week."

Now, let’s look at how the activity duration per day from each users are distributed.

# The distribution of the total activity intensity in minutes
df_daily %>%
  filter(TotalActiveMinutes > 0) %>% 
  ggplot(aes(x = TotalActiveHour)) +
  geom_histogram(binwidth = 0.5, fill = datawrapper_cp[1], 
                 color = "white") +
  
  # Set the title, subtitle, and axis labels
  labs(
    title = "The activity intensity distribution per day",
    subtitle = "The activity intensity duration is combined duration from light, moderate, and very active\nactivity level on each day in minutes",
    x = "Total active hour",
    y = "Count"
  ) +
  
  # Format x-axis
  scale_x_continuous(breaks = seq(0, 9, by = 1)) +
  
  # Format y-axis
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 120),
                     breaks = seq(0, 120, by = 30))

We can see that the total activity intensity in minutes per day is normally distributed with average of 227.5 hours per day.

Now, let’s breakdown the activity level of the users to the day-to-day level.

# Activity vector
activity_vector <- 
  c("VeryActiveMinutes",
    "FairlyActiveMinutes",
    "LightlyActiveMinutes")

df_daily %>% 
  select(Day, activity_vector) %>% 
  pivot_longer(cols = activity_vector, names_to = "Activity",
               values_to = "ActivityMinutes") %>%
  transmute(Day,
            activity_level = case_when(
              Activity == "VeryActiveMinutes" ~ "Very Active Activity",
              Activity == "FairlyActiveMinutes" ~ "Moderate Activity",
              Activity == "LightlyActiveMinutes" ~ "Light Activity"),
            ActivityMinutes
            ) %>% 
  group_by(Day, activity_level) %>% 
  summarise(average_activity_minutes = mean(ActivityMinutes)) %>%
  
  # Create a column chart
  ggplot(aes(x = Day, y = average_activity_minutes, 
             fill = factor(activity_level, 
                           levels = c("Very Active Activity",
                                     "Moderate Activity",
                                     "Light Activity")
                           ))
         ) +
  geom_col(width = 0.7) +
  
  # Format y-axis
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 260)) +
  
  # Set the title, subtitle, and axis labels
  labs(
    title = "User\'s activity intensity duration throughout the week",
    subtitle = "Showing the activity intensity per minute for each day on average",
    x = element_blank(),
    y = "Minutes"
  ) +
  
  # Format legend
  theme(
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank()
  ) +
  
  # Custom color fill
  scale_fill_manual(values = 
                      c(datawrapper_cp[1],
                        datawrapper_cp[2], 
                        datawrapper_cp[6]))

We can see that throughout the week, the activity intensity of our users in the data are pretty much the same with slightly higher activity intensity on Saturday. We can also see that the level of activity for moderate and very active level of activity are constant along the week and the changes on the result are mostly caused by the light activity value.

Next, let’s breakdown this result to the hourly level to look at how the activity intensity are in a day.

# Average activity intensity per hour
average_activity_intensity <- 
  round(mean(df_activity_hour$TotalIntensity))

# Activity level per hour viz
df_activity_hour %>% 
  group_by(Hour) %>% 
  summarise(AverageIntensityMinutes = 
              round(mean(TotalIntensity))) %>%
  
  # Conditional label for our data
  mutate(label = if_else(AverageIntensityMinutes >= 
                          average_activity_intensity,
                        "Above average", "Below average")) %>% 
  
  # Create a column chart
  ggplot(aes(x = reorder(Hour, Hour), y = AverageIntensityMinutes, 
             fill = label)) +
  geom_col(width = 0.7) +

  # Format y-axis
  scale_y_continuous(
    expand = c(0, 0),
    limits = c(0, 18)
  ) +

  # Set the title, subtitle, and axis labels
  labs(
    title = "User\'s activity per hour of day",
    subtitle = "The average activity intensity per hour in minutes",
    x = "Hour of day",
    y = "Minutes"
  ) +
  
  # Add a reference line
  geom_hline(
    yintercept = average_activity_intensity,
    color = "dimgrey",
    linetype = "dashed",
    size = 0.5
  ) +

  # Format the reference line label
  geom_text(
    aes(
      0,
      average_activity_intensity,
      label = paste(
        "Average:",
        average_activity_intensity,
        "minutes"
      ),
      vjust = -0.5
    ),
    nudge_x = 3.5,
    color = "dimgrey",
    size = 4
  ) +
  
  # Set color for each conditions
  scale_fill_manual(values = c(datawrapper_cp[1], datawrapper_cp[8])) +
  
  # Format legend
  theme(
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank()
  )

We can see that high activity intensity in a day are started from 8 a.m. until 9 p.m. in the night where users in the data are performing an activity for at least 10 minutes per hour.

Next, let’s breakdown this result in the term of time of day each activity take place.

# Creating order of time_of_day
df_activity_hour$TimesDay <-
  ordered(df_activity_hour$TimesDay,
          levels = c("Morning",
                     "Afternoon",
                     "Evening",
                     "Night"))

# Activity duration per hour in minutes
df_activity_hour %>%
  group_by(TimesDay) %>%
  summarise(average_intensity_minutes = 
              round((mean(TotalIntensity)), 2)
  ) %>% 
  kbl()
TimesDay average_intensity_minutes
Morning 11.79
Afternoon 15.33
Evening 12.85
Night 2.44

Let’s visualize this result.

# Activity level per time of day viz
df_activity_hour %>%
  group_by(TimesDay) %>%
  summarise(average_intensity_minutes = 
              round((mean(TotalIntensity)), 2)
            ) %>%
  
  # Create a column chart
  ggplot(aes(x = TimesDay, y = average_intensity_minutes)) +
  geom_col(width = 0.7, fill = datawrapper_cp[1]) +

  # Format y-axis
  scale_y_continuous(
    expand = c(0, 0),
    limits = c(0, 17)
  ) +

  # Set the title, subtitle, and axis labels
  labs(
    title = "User\'s activity intensity per time of day",
    subtitle = "Highest activity intensity occurred on the afternoon, or from 12 to 6 p.m.",
    x = "Hour of day",
    y = "Minutes"
  )

From the result above, we can see that high activity intensity occurred in the afternoon or from 12 to 6 p.m.

Let’s also check how the data is looks like for the activity intensity each day on weektype level, or for the weekday and the weekend.

df_activity_hour %>%
  group_by(Hour, Weektype) %>%
  summarise(average_intensity_minutes = 
              round((mean(TotalIntensity)), 2)
  ) %>% 
  mutate(HighLow = if_else(
    average_intensity_minutes > average_activity_intensity, 
    "High", "Low")) %>%
  
  # Filter for hour above 3 a.m.
  filter(Hour >= 5) %>% 
  
  # Create a column chart
  ggplot(aes(x = reorder(Hour, Hour), 
             y = average_intensity_minutes, fill = HighLow)) +
  geom_col(width = 0.7) +
  facet_wrap(~ Weektype) +
  
  # Format y-axis
  scale_y_continuous(
    expand = c(0, 0),
    limits = c(0, 20)
  ) +

  # Set the title, subtitle, and axis labels
  labs(
    title = "User\'s activity intensity on weekday vs. weekend",
    subtitle = "The users are starting their day earlier on the weekday but more active on the weekend",
    x = "Hour of day",
    y = "Minutes",
    fill = "Activity intensity"
  ) +
  
  # Set color for each conditions
  scale_fill_manual(values = c(datawrapper_cp[1], datawrapper_cp[8])) +
  
  # Format legend
  theme(
    legend.position = "top",
    legend.justification = "left"
  )

The high intensity interval on the weekday have higher interval range, that is from 7 a.m. to 09 p.m. On the other hand, the range of the high intensity interval on weekend is from 9 a.m. to 9 p.m.

df_activity_hour %>% 
  group_by(Weektype) %>% 
  summarise(activity_intensity_minutes = 
              round(mean(TotalIntensity), 1)) %>% 
  kbl()
Weektype activity_intensity_minutes
Weekday 9.7
Weekend 9.6

The average activity intensity on the weekday is 9.7 minutes per hour, meanwhile it’s 9.5 minutes per hour on the weekend. That means, even thought they are starting their day earlier on the weekday, they are more active on the the weekend.

4.2.3. Sleep analysis

First, let’s see how sleep are distributed within our daily data.

df_daily %>%
  
  # Filter out NA values from the column
  filter(!is.na(df_daily$TotalHourAsleep)) %>%
  
  # Create a histogram
  ggplot(aes(x = TotalHourAsleep, color = I("white"))) +
  geom_histogram(fill = datawrapper_cp[1], binwidth = 0.5) +
  
  # Set the title, subtitle, and axis labels
  labs(
    title = "Distribution of hour asleep among users",
    subtitle = "How is hour asleep each day distrbuted?",
    x = "Hour asleep",
    y = "Count"
  ) +
  
  # Format x-axis
  scale_x_continuous(limits = c(0, 14), 
                     breaks = seq(0, 14, by = 1)) +
  
  # Format y-axis
  scale_y_continuous(expand = c(0, 0), limits = c(0, 65)) +
  
  # Remove legend
  theme(legend.position = "none")

We can see that the the sleep are concentrated between 5.5 - 9, that means the users on our data are sleeping mostly for about 5.5 - 9 hours each day. With the average of NA hours asleep, the users are barely hitting the recommended amount of sleep each per night.

Let’s break down this data to look at what is the average hours they sleeps each day.

df_daily %>% 
  group_by(Day) %>%
  summarise(avg_sleep_hour = 
              round((mean(TotalHourAsleep, na.rm = TRUE)), 1)
            ) %>% 
  kbl()
Day avg_sleep_hour
Mon 7.0
Tue 6.7
Wed 7.2
Thu 6.7
Fri 6.8
Sat 7.0
Sun 7.5

We can see that the result are pretty consistent along the week, which is between 6.7 - 7.5 hours of sleep each day along the week.

Next, let’s analyze how the users in our data are taking nap. We will only include sleep above 6 a.m. and under 6 p.m. to consider it as napping.

df_nap <- df_sleep_minute %>%
  group_by(Id, logId) %>%
  summarise(
    sleep_start = min(as_date(date)),
    sleep_end = max(as_date(date)),
    min_date = min(date),
    max_date = max(date)
  ) %>%
  filter(sleep_start == sleep_end) %>%
  mutate(
    nap_minutes =
      as.numeric(difftime(max_date, min_date, units = "mins")),
    nap_hour = round((nap_minutes / 60), digits = 2),
    Dow = wday(
      min_date,
      label = TRUE,
      abbr = TRUE,
      week_start = 1
    ),
    Hour = hour(min_date)
  ) %>%
  filter(Hour >= 6 & Hour < 18) %>%
  ungroup() 

First, let’s analyze how is the value distributed.

df_nap %>%
  
  # Create a histogram
  ggplot(aes(x = nap_hour, color = I("white"))) +
  geom_histogram(fill = datawrapper_cp[1], binwidth = 0.5) +
  
  # Set the title, subtitle, and axis labels
  labs(
    title = "Distribution of time spent to nap",
    subtitle = "Still including the outliers",
    x = "Hour asleep",
    y = "Count"
  ) +
  
  # Format x-axis and y-axis tics
  scale_x_continuous(limits = c(0, NA), 
                     breaks = seq(0, 14, by = 1)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 14),
                     breaks = seq(0, 12, by = 4)) +
  
  # Remove legend
  theme(legend.position = "none")

Let’s remove the outliers on the result. The outliers could be caused by the users that took sleep above 6 a.m. in the morning.

df_nap %>%
  filter(nap_hour < 4 & Hour < 18) %>% 
  
  # Create a histogram
  ggplot(aes(x = nap_hour, color = I("white"))) +
  geom_histogram(fill = datawrapper_cp[1], binwidth = 0.5) +
  
  # Set the title, subtitle, and axis labels
  labs(
    title = "Distribution of time spent to nap in hours",
    subtitle = "Removing the outliers",
    x = "Hour nap",
    y = "Count"
  ) +
  
  # Format x-axis and y-axis tics
  scale_x_continuous(breaks = seq(0, 4, by = 0.5)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 14),
                     breaks = seq(0, 12, by = 4)) +
  
  # Remove legend
  theme(legend.position = "none")

We can see that most of the nap were taken for about 1 to 2 hours.

5. Result

With the analysis performed above, let’s try to answer the questions listed from the business problem part.

  1. What are some trends in smart device usage?

  2. How could these trends apply to Bellabeat customers?

  3. How could these trends help influence Bellabeat marketing strategy?