Case Study

This case study is part of the Google Data Analytics Capstone Project and served as my very first data analytics project. It has been an intensive learning process and this project serve as a culmination of everything I have learnt throughtout this program.

Comapny Overview
1.1. Products
1.2. Current Marketing Efforts
Business Task
Data Overview
3.1. Data Source
3.2. License and Accessibility
3.3. Data Organization
3.4. Data Credibility
3.5. Data Integrity Verification
Data Cleaning
4.1. Loading Packages
4.2. Importing Datasets
4.3. Renaming Columns
4.4. Verifying Distinct Values
4.5. Removing Duplicates and Nulls
4.6. Changing the Data Format
Analyzing Trends
5.1. Users’ habit in using the smart device
5.2. Metrics to determine users’ activity levels
5.3. Users’ activity level
5.4. Daily steps in a day
5.5. Daily steps in days of the week
5.6. Daily steps in days of the week
Recommendations

1. Company Overview

Bellabeat is a high-tech manufacturer of health-focused smart products specialized for women. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.

1.1. Products

Bellabeat has several wellness-tracker products.

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time:This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Ivy: A wellness-tracker designed as an elegant bracelet that monitors users heart rate, respiratory rate, and physical and mental activity. The data is combined and displayed as wellness score and readiness score to keep track on daily habit.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

1.2. Current Marketing Efforts

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has been making several marketing efforts, listed as follows.

Investing in traditional advertising media, such as radio, out-of-home billboards, print, and television
Investing year-round in Google Search
Maintaining active Facebook and Instagram pages
Engaging consumers consistently on Twitter
Running video ads on Youtube
Displaying ads on the Google Display Network

2. Business Task

The Cofounder and Chief Creative Officer of Bellabeat believe that analyzing smart device usage data from non-Bellabeat products will provide insights that is applicable to one of the Bellabeat product.

This report will attempt to answer these business questions:

What are some trends in smart device usage?
How could these trend apply to Bellabeat customers?
How could these trends influence Bellabeat marketing strategy?

The data analysis will focus on the following tasks: 1. Identifying users’ habit in tracking daily activities and recording sleep. 2. Determining metrics to identify users activity level; intensity value vs total steps. 3. Identifying trends in users’ activity levels. 4. Identifying trends of hourly activity across different user types. 5. Identifying trends of daily activity on days of the week across different user types.

3. Data Overview

3.1. Data Source

The dataset provided by this case study is FitBit Fitness Tracker Data. Data is made available by Möbius on Kaggle. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 2016-12-03 and 2016-12-05. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors and preferences.

3.2. License and Accessibility

This dataset is listed as CC0: Public Domain. It is available for use publicly without requiring any permission.

3.3. Data Organization

The dataset contains eighteen tables. There are long and wide data types depending on the tables referred. The dataset provides daily, hourly, and minute-level of data for calories, intensities, and sleep datasets.

3.4. Data Credibility

Data credibility determines whether a data source is a good or a bad one. A good data source must be reliable, original, comprehensive, current, and cited. Metadata (data about the data) usually provide this level of information. Credibility issues can be solved strictly by collecting additional data, completing missing variables, and gathering data from different sources. The following tables explain some issues with this dataset and how analysts could address the problems.

Credibility Issues	Solution
Incomplete demographic data such as gender, weight, and age, which are some of important variables determining activity behaviors	Complete the missing demographic data from sample users
Sample bias, no evidence that the data only came from female users	Confirm the gender of sample, remove sample that doesn’t comply with the population of consumers
Small sample size of 33 users and 31 days	Collect additional data from different users in a more extended time
The data is collected by another party and then made public	Second party data should be verified and confirmed by data vendor to ensure integrity and completeness.
The data is not current	Collect and use the most recent data

Base on the above issues stated, the data credibility can be assessed as poor. However, data with good credibility consumes effort and time and it is up to the analyst judgement call to decide whether the existing data is sufficiently credible to answer the business task. This limitation should be kept in mind when drawing conclusions from this analysis exercise.

3.5. Data Integrity Verification

Data integrity deals with the accuracy, completeness, consistency, and trustworthiness of data. It encompasses data quality and data security and the ability of the data to answer the required business questions. To address data integrity, analyst has an arsenal of tools to verify that the data is complete, and free of any errors such as null values, multiple data types in a column, etc. Below are a comparison table that describes the datasets used:

Table Name	Description	Total Observations	Variables	Integrity Status
dailyActivity_merged	Daily activity of 33 users in 31 days	940	Number of steps, Distance, Active minutes, Calories	No duplicates, no nulls, barely sufficient data, aligned to objectives
sleepDay_merged	Daily sleep records of 24 users in 31 days	410	Sleep count, Total minutes of sleep, Total minutes of time in bed	Duplicates, no nulls, insufficient data, aligned to objectives
hourlySteps_merged	Hourly steps count of 33 users in 31 days	22,099	Steps total per hour per day	No duplicates, no nulls, barely sufficient data, aligned to objectives

It seems the data integrity of this dataset is good as the issues are most of the data has no duplicates. Duplicate data can be eliminated relatively easy using R code. The datasets are also comptible with each other and allows merging and comparison across the tables. This enables a more comprehensive analysis that can relate users’ activity, intensities and calories burned.

4. Data Cleaning

From this section onward, the data is processed using R in RStudio. R is an exceptional tool for data exploration, cleaning, analysis, and visualization because it has a variety of functions in packages to perform tasks. It can handle a large amount of data, record every data manipulation step in the script, and compile the results in a document (like this one) conveniently.

4.1. Loading Packages

The following packages contain useful functions to produce insightful trends out of the data.

tidyverse: an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. It loads multiple packages at once, with ggplot2, dplyr, and tibble as some of the most popular ones.
lubridate: makes it easier to work with dates and times.
skimr: provides summary statistics about variables in data frames, tibbles, data tables and vectors.
janitor: has simple functions for examining and cleaning dirty data.
ggalluvial: a ggplot2 extension for producing alluvial plots.
ggpubr: provides some easy-to-use functions for creating and customizing ggplot2.

install.packages("ggpubr", repos = "http://cran.us.r-project.org")

## 
## The downloaded binary packages are in
##  /var/folders/83/h8ylmntn5xq3lf6j12yhgy440000gn/T//RtmpLKvYu5/downloaded_packages

library(tidyverse)
library(lubridate)
library(skimr)
library(janitor)
library(ggalluvial)
library(ggpubr)

4.2. Importing Datasets

Datasets are imported and assigned their respective variable names for easier manipulation. The respective dataset are: * dailyActivity_merged.csv * sleepDay_merged.csv * hourlySteps_merged.csv

daily_activity_df <- read_csv("dailyActivity_merged.csv", show_col_types = FALSE)
sleep_day_df <- read_csv("sleepDay_merged.csv", show_col_types = FALSE)
hourly_steps_df <- read_csv("hourlySteps_merged.csv", show_col_types = FALSE)

head(daily_activity_df)

## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## #   ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## #   ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance

head(sleep_day_df)

## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹TotalTimeInBed

head(hourly_steps_df)

## # A tibble: 6 × 3
##           Id ActivityHour          StepTotal
##        <dbl> <chr>                     <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366 4/12/2016 1:00:00 AM        160
## 3 1503960366 4/12/2016 2:00:00 AM        151
## 4 1503960366 4/12/2016 3:00:00 AM          0
## 5 1503960366 4/12/2016 4:00:00 AM          0
## 6 1503960366 4/12/2016 5:00:00 AM          0

4.3. Renaming Columns

According to the generic naming conventions in R, it is a good practice to only use lowercase letters, numbers and underscores to name the columns. The function clean_names() can do this particular job efficiently. Try comparing the column names before and after clean_names() is executed.

Before `clean_names()` is applied.

colnames(daily_activity_df)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(sleep_day_df)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

colnames(hourly_steps_df)

## [1] "Id"           "ActivityHour" "StepTotal"

After `clean_names()` is applied

daily_activity_df <- clean_names(daily_activity_df)
sleep_day_df <- clean_names(sleep_day_df)
hourly_steps_df <- clean_names(hourly_steps_df)

colnames(daily_activity_df)

##  [1] "id"                         "activity_date"             
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"

colnames(sleep_day_df)

## [1] "id"                   "sleep_day"            "total_sleep_records" 
## [4] "total_minutes_asleep" "total_time_in_bed"

colnames(hourly_steps_df)

## [1] "id"            "activity_hour" "step_total"

4.4. Verifying Distinct Values

As noted in the data overview, daily_activities has 33 distinct users and 31 days of activity data, while daily_sleeps has 24 users and 31 days.hourlySteps_merged.csvhas 33 users and 736 hours of data.

n_distinct(daily_activity_df$id)

## [1] 33

n_distinct(daily_activity_df$activity_date)

## [1] 31

n_distinct(sleep_day_df$id)

## [1] 24

n_distinct(sleep_day_df$sleep_day)

## [1] 31

n_distinct(hourly_steps_df$id)

## [1] 33

n_distinct(hourly_steps_df$activity_hour)

## [1] 736

4.5. Removing Duplicates and Nulls

First step in any data analysis flow of work is to check for any duplicates and nulls that exist in the data. The function sum(duplicated()) checks the number of duplicates, while colSums(is.na()) checks the number of nulls in each column of the tables.

sum(duplicated(daily_activity_df))

## [1] 0

sum(duplicated(sleep_day_df))

## [1] 3

sum(duplicated(hourly_steps_df))

## [1] 0

colSums(is.na(daily_activity_df))

##                         id              activity_date 
##                          0                          0 
##                total_steps             total_distance 
##                          0                          0 
##           tracker_distance logged_activities_distance 
##                          0                          0 
##       very_active_distance moderately_active_distance 
##                          0                          0 
##      light_active_distance  sedentary_active_distance 
##                          0                          0 
##        very_active_minutes      fairly_active_minutes 
##                          0                          0 
##     lightly_active_minutes          sedentary_minutes 
##                          0                          0 
##                   calories 
##                          0

colSums(is.na(sleep_day_df))

##                   id            sleep_day  total_sleep_records 
##                    0                    0                    0 
## total_minutes_asleep    total_time_in_bed 
##                    0                    0

colSums(is.na(hourly_steps_df))

##            id activity_hour    step_total 
##             0             0             0

As shown above, there is only duplicates entry in the daily_sleeps table and no null values in all 3 tables. Duplicates in the table can be easily removed using the distinct() function. We can then verify again that duplicates has been removed.

sleep_day_df <- sleep_day_df %>% distinct()
sum(duplicated(sleep_day_df))

## [1] 0

4.6. Changing the Data Format

Notice that id is in a number format <dbl> and activity_date is in a character format <chr> in the daily_activity_df. It makes the analysis easier down the road if we change these number format into an appropriate one. The id acts as an identifier and hence can be changed to a <chr> format while the activity_date column can be changed to a date_time format to allow data manipulation and calculation. We will also change the column name of activity_date into date to be concise.

head(daily_activity_df)

## # A tibble: 6 × 15
##       id activ…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸ seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>, and abbreviated variable names
## #   ¹activity_date, ²total_steps, ³total_distance, ⁴tracker_distance,
## #   ⁵logged_activities_distance, ⁶very_active_distance,
## #   ⁷moderately_active_distance, ⁸light_active_distance,
## #   ⁹sedentary_active_distance

daily_activity_df <- daily_activity_df %>% 
  mutate(id = as.character(id)) %>% 
  rename(date = activity_date) %>% 
  mutate(date = as_date(date, format="%m/%d/%Y"))

head(daily_activity_df)

## # A tibble: 6 × 15
##   id         date       total_…¹ total…² track…³ logge…⁴ very_…⁵ moder…⁶ light…⁷
##   <chr>      <date>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1503960366 2016-04-12    13162    8.5     8.5        0    1.88   0.550    6.06
## 2 1503960366 2016-04-13    10735    6.97    6.97       0    1.57   0.690    4.71
## 3 1503960366 2016-04-14    10460    6.74    6.74       0    2.44   0.400    3.91
## 4 1503960366 2016-04-15     9762    6.28    6.28       0    2.14   1.26     2.83
## 5 1503960366 2016-04-16    12669    8.16    8.16       0    2.71   0.410    5.04
## 6 1503960366 2016-04-17     9705    6.48    6.48       0    3.19   0.780    2.51
## # … with 6 more variables: sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## #   abbreviated variable names ¹total_steps, ²total_distance,
## #   ³tracker_distance, ⁴logged_activities_distance, ⁵very_active_distance,
## #   ⁶moderately_active_distance, ⁷light_active_distance

We will perform similar functions to the other 2 dataframes.

sleep_day_df <- sleep_day_df %>% 
  mutate(id = as.character(id)) %>% 
  rename(date = sleep_day) %>% 
  mutate(date = as_date(date, format="%m/%d/%Y"))


head(sleep_day_df)

## # A tibble: 6 × 5
##   id         date       total_sleep_records total_minutes_asleep total_time_in…¹
##   <chr>      <date>                   <dbl>                <dbl>           <dbl>
## 1 1503960366 2016-04-12                   1                  327             346
## 2 1503960366 2016-04-13                   2                  384             407
## 3 1503960366 2016-04-15                   1                  412             442
## 4 1503960366 2016-04-16                   2                  340             367
## 5 1503960366 2016-04-17                   1                  700             712
## 6 1503960366 2016-04-19                   1                  304             320
## # … with abbreviated variable name ¹total_time_in_bed

hourly_steps_df <- hourly_steps_df %>% 
  mutate(id = as.character(id)) %>% 
  rename(date_time = activity_hour) %>% 
  mutate(date_time = as.POSIXct(date_time, format ="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))

head(hourly_steps_df)

## # A tibble: 6 × 3
##   id         date_time           step_total
##   <chr>      <dttm>                   <dbl>
## 1 1503960366 2016-04-12 00:00:00        373
## 2 1503960366 2016-04-12 01:00:00        160
## 3 1503960366 2016-04-12 02:00:00        151
## 4 1503960366 2016-04-12 03:00:00          0
## 5 1503960366 2016-04-12 04:00:00          0
## 6 1503960366 2016-04-12 05:00:00          0

5. Analyzing the Trends

The data is now ready to be for transformation and analysis that provides insights to answer business questions. Several techniques of data analysis such as merging, grouping, filtering, calculating new values are used to distill insights from the data.

5.1. Utilization rate of smart devices

Users are grouped based on the total days they tracked their activities and recorded their sleep out of total of 31 days of data. The device usage frequency is categorized into:

Heavy: more than 20 days of tracked activity
Moderate: from 11 to 20 days of tracked activity
Light: less than 10 days of tracked activity

The sleep record frequency is categorized as:

Often: more than 20 days of recorded sleep
Sometimes: from 11 to 20 days of recorded sleep
Rarely: from 1 to 10 days of recorded sleep
Never: 0 days of recorded sleep

These user groups are then tagged to the respective id in a new column named usage_frequency .

We will start with daily_activity_df. A new dataframe active_tracked is assigned to capture the aggregated data and assigned the respective usage_frequency base on conditions stated above:

active_tracked <- daily_activity_df %>% 
  group_by(id) %>% 
  summarise(days_used = sum(n())) %>% 
  mutate(usage_level = case_when(
    days_used <= 10 ~ "Light",
    days_used <= 20 ~ "Moderate",
    TRUE ~ "Heavy"
  ))
head(active_tracked)

## # A tibble: 6 × 3
##   id         days_used usage_level
##   <chr>          <int> <chr>      
## 1 1503960366        31 Heavy      
## 2 1624580081        31 Heavy      
## 3 1644430081        30 Heavy      
## 4 1844505072        31 Heavy      
## 5 1927972279        31 Heavy      
## 6 2022484408        31 Heavy

We will apply the same technique to the sleep_day_df. sleep_log is the new dataframe assigned to capture the aggregated data.

sleep_log <- sleep_day_df %>% 
  group_by(id) %>% 
  summarise(days_tracked = sum(n())) %>% 
  mutate(record_level = case_when(
    days_tracked <= 10 ~ "Rarely",
    days_tracked <= 20 ~ "Sometimes",
    TRUE ~ "Often"
  ))

We will then combine these two dataframes into joint_act_sleep using their id as the matching criteria.

left_join() is used to combined the dataframes.

joint_act_sleep <- active_tracked %>% 
  left_join(sleep_log, by = "id")

head(joint_act_sleep)

## # A tibble: 6 × 5
##   id         days_used usage_level days_tracked record_level
##   <chr>          <int> <chr>              <int> <chr>       
## 1 1503960366        31 Heavy                 25 Often       
## 2 1624580081        31 Heavy                 NA <NA>        
## 3 1644430081        30 Heavy                  4 Rarely      
## 4 1844505072        31 Heavy                  3 Rarely      
## 5 1927972279        31 Heavy                  5 Rarely      
## 6 2022484408        31 Heavy                 NA <NA>

From here, we see that there are null values in the combined dataframe. This is because some users did not use the device to track their sleep. We will assign 0 to days_tracked and “Never” to record_level to modify the dataframe so that it is easier to analyse.

joint_act_sleep <- joint_act_sleep %>% 
  mutate(days_tracked = replace(days_tracked, is.na(days_tracked), 0)) %>% 
  mutate(record_level = replace(record_level, is.na(record_level), "Never"))

head(joint_act_sleep)

## # A tibble: 6 × 5
##   id         days_used usage_level days_tracked record_level
##   <chr>          <int> <chr>              <dbl> <chr>       
## 1 1503960366        31 Heavy                 25 Often       
## 2 1624580081        31 Heavy                  0 Never       
## 3 1644430081        30 Heavy                  4 Rarely      
## 4 1844505072        31 Heavy                  3 Rarely      
## 5 1927972279        31 Heavy                  5 Rarely      
## 6 2022484408        31 Heavy                  0 Never

Here, we can make a pie chart to see the distribution of usage level.

usage_level <- joint_act_sleep %>% group_by(usage_level) %>% 
  summarise(usage_sum = sum(n())) 

usage_level <- usage_level %>% arrange(usage_sum)
hsize <- 2

usage_pie <- ggplot(usage_level, aes(x=hsize, y=usage_sum, fill=usage_level)) +
  geom_col(color="black") + 
  geom_text(aes(label = usage_sum), position = position_stack(vjust=0.5)) +
  coord_polar(theta = "y") + 
  guides(fill = guide_legend(title = "Usage Level")) + 
  scale_fill_brewer(direction = -1, labels = c("Heavy", "Moderate", "Light")) +
  xlim(c(0.2, hsize + 0.5)) + 
  theme_void()

usage_pie + labs(title = "Distribution of Smart Device Usage Level"
                 )

We will also visualize how many users use their smart device to track their sleep.

sleep_pie_df <- joint_act_sleep %>% group_by(record_level) %>% 
  summarise(record_sum = sum(n()))

sleep_pie_df <- sleep_pie_df %>% arrange(record_sum)
sleep_pie_df

## # A tibble: 4 × 2
##   record_level record_sum
##   <chr>             <int>
## 1 Sometimes             3
## 2 Never                 9
## 3 Rarely                9
## 4 Often                12

sleep_pie <- ggplot(sleep_pie_df, aes(x=hsize, y=record_sum, fill=record_level)) +
  geom_col(color="black") + 
  geom_text(aes(label = record_sum), position = position_stack(vjust=0.5)) +
  coord_polar(theta = "y") + 
  guides(fill = guide_legend(title = "Record Level")) + 
  xlim(c(0.2, hsize + 0.5)) + 
  scale_fill_brewer(direction = 1, labels = c("Never", "Rarely", "Sometimes", "Often")) +
  theme_void()

sleep_pie + labs(title="Usage of smart device for sleep tracking")

From above charts, we can see that a very high proportion of users utilize their smart device to track their daily activities, while the converse is true when it comes to sleep tracking. The reason could be users tend to wear their smart devices during the day while they leave the device to charge at night. Further correlation between sleep tracking usage and battery level could be investigated to confirm this hypothesis.

5.2. Determine users’ activity level

Here, we will investigate users’ activity level and to see if active people are more inclined to use smart device to track their activities. Activity level is defined as the proportion of time people spent in active state over a day. A day consists of 1440 minutes.

Because the active minutes are counted in four different activity levels, they will be converted into intensity values on a scale from 0 to 3.

The total sedentary minutes is multiplied by 0
The total lightly active minutes is multiplied by 1
The total fairly active minutes is multiplied by 2
The total very active minutes is multiplied by 3

A new data frame daily_intensity is assigned to aggregate the relevant data from daily_activity_df and calculating the intensity_level.

daily_intensity <- daily_activity_df %>% 
  select(id, date, total_steps,
         sedentary = sedentary_minutes,
         lightly = lightly_active_minutes,
         fairly = fairly_active_minutes,
         very_active = very_active_minutes,
         ) %>% 
  mutate(active_percentage = (sedentary*0 + lightly*1 + fairly*2 + very_active*3)/1440)

head(daily_intensity)

## # A tibble: 6 × 8
##   id         date       total_steps sedentary lightly fairly very_active activ…¹
##   <chr>      <date>           <dbl>     <dbl>   <dbl>  <dbl>       <dbl>   <dbl>
## 1 1503960366 2016-04-12       13162       728     328     13          25   0.298
## 2 1503960366 2016-04-13       10735       776     217     19          21   0.221
## 3 1503960366 2016-04-14       10460      1218     181     11          30   0.203
## 4 1503960366 2016-04-15        9762       726     209     34          29   0.253
## 5 1503960366 2016-04-16       12669       773     221     10          36   0.242
## 6 1503960366 2016-04-17        9705       539     164     20          38   0.221
## # … with abbreviated variable name ¹active_percentage

Then we will select the days_usage column from joint_act_sleep table and combine it with daily_intensity table to create a new dataframe.

intensity_df <- daily_intensity %>% 
  select(id, active_percentage, total_steps) %>% 
  left_join(joint_act_sleep, by="id") %>% 
  group_by(id)

intensity_df2 <- intensity_df %>% group_by(id) %>% 
  summarise(sum_total_steps = sum(total_steps), average_active_percentage = mean(active_percentage))
  
intensity_df3 <- joint_act_sleep %>% select(id, usage_level) %>% 
  left_join(intensity_df2, by="id")

head(intensity_df3)

## # A tibble: 6 × 4
##   id         usage_level sum_total_steps average_active_percentage
##   <chr>      <chr>                 <dbl>                     <dbl>
## 1 1503960366 Heavy                375619                    0.260 
## 2 1624580081 Heavy                178061                    0.133 
## 3 1644430081 Heavy                218489                    0.174 
## 4 1844505072 Heavy                 79982                    0.0822
## 5 1927972279 Heavy                 28400                    0.0306
## 6 2022484408 Heavy                352490                    0.281

We will then plot sum_total_steps vs active_percentage to see what correlation it has.

scatter_plot <- ggplot(intensity_df, aes(x=total_steps, y=active_percentage, color=usage_level)) +
  geom_point() + 
  geom_smooth(method=lm)


scatter_plot

## `geom_smooth()` using formula 'y ~ x'

From the scatterplot, we can observe a positive correlation between total_steps and active_percentage variable. This means that higher active minutes is reflected in higher steps recorded by the smart devices. We can use total_steps as a proxy to classify the user base to see the distribution of user activeness.

5.3. Users’ activity level

Are active people more likely to track their daily activities? Here, we will attempt to find out the relationship between average steps per day and users quantity by categorizing them according to the categories below:

Sedentary: less than 5000 steps per day
Lightly active: from 5000 to 7499 steps per day
Fairly active: from 7500 to 10000 steps per day
Very active: more than 10000 steps per day

A new data frame daily_steps is created using the aggregated data from daily_activity_df.

daily_steps <- daily_activity_df %>% 
  group_by(id) %>% 
  summarise(average_steps = mean(total_steps)) %>% 
  mutate(active_level=case_when(
    average_steps <= 5000 ~ "Sedentary",
    average_steps <= 7500 ~ "Lightly Active",
    average_steps <= 10000 ~ "Fairly Active",
    TRUE ~ "Very Active"
  ))

head(daily_steps)

## # A tibble: 6 × 3
##   id         average_steps active_level  
##   <chr>              <dbl> <chr>         
## 1 1503960366        12117. Very Active   
## 2 1624580081         5744. Lightly Active
## 3 1644430081         7283. Lightly Active
## 4 1844505072         2580. Sedentary     
## 5 1927972279          916. Sedentary     
## 6 2022484408        11371. Very Active

user_level <- daily_steps %>% group_by(active_level) %>% 
  summarise(count = sum(n()))

step_pie = ggplot(user_level, aes(x=hsize, y=count, fill=active_level))+
  geom_col(color="black")+
  coord_polar(theta = "y") +
  geom_text(aes(label = count),
            position = position_stack(vjust = 0.5)) +
  xlim(c(0.2, hsize + 0.5)) +
  guides(fill = guide_legend(title = "Active level"))+
  theme_void()

step_pie + labs(title="Distribution of smart devices user")

From the above pie chart, we observe that all users of smart devices are distributed rather evenly across all levels of activeness. Although the data points are quite limited in this visualization, if this is assumed to be representative of all user base, then we have a rather big market to tap into.

5.4. Steps in 24 hour

In this section, we will dive into the hourly_steps_df to explore if there is any patterns throughout the day where users track their steps.

The hourly_steps_df has a column named date_time. We will separate the date and time entry into separate columns to better manipulate them.

hourly_steps_1 <- hourly_steps_df %>% 
  separate(date_time, into = c("date", "time"), sep = " ") %>% 
  mutate(date = ymd(date)) %>% 
  mutate(time = str_sub(time, 1, 5))

hourly_steps_1

## # A tibble: 22,099 × 4
##    id         date       time  step_total
##    <chr>      <date>     <chr>      <dbl>
##  1 1503960366 2016-04-12 00:00        373
##  2 1503960366 2016-04-12 01:00        160
##  3 1503960366 2016-04-12 02:00        151
##  4 1503960366 2016-04-12 03:00          0
##  5 1503960366 2016-04-12 04:00          0
##  6 1503960366 2016-04-12 05:00          0
##  7 1503960366 2016-04-12 06:00          0
##  8 1503960366 2016-04-12 07:00          0
##  9 1503960366 2016-04-12 08:00        250
## 10 1503960366 2016-04-12 09:00       1864
## # … with 22,089 more rows

hourly_steps_1 <- hourly_steps_1 %>% 
  left_join(daily_steps, by="id")
head(hourly_steps_1)

## # A tibble: 6 × 6
##   id         date       time  step_total average_steps active_level
##   <chr>      <date>     <chr>      <dbl>         <dbl> <chr>       
## 1 1503960366 2016-04-12 00:00        373        12117. Very Active 
## 2 1503960366 2016-04-12 01:00        160        12117. Very Active 
## 3 1503960366 2016-04-12 02:00        151        12117. Very Active 
## 4 1503960366 2016-04-12 03:00          0        12117. Very Active 
## 5 1503960366 2016-04-12 04:00          0        12117. Very Active 
## 6 1503960366 2016-04-12 05:00          0        12117. Very Active

Now, we are ready to create a line plot to see if we can find some patterns on users’ activity.

lg_data <- hourly_steps_1 %>% 
  group_by(active_level, time) %>% 
  summarise(hourly_steps=mean(step_total))

## `summarise()` has grouped output by 'active_level'. You can override using the
## `.groups` argument.

line_plot <- ggplot(lg_data, aes(x=time, y=hourly_steps, group=active_level, color=active_level)) +
  geom_line(size=0.8) +
  ggtitle("Time vs Average steps of Smart Device users")+
  theme(axis.text.x = element_text(size = 12, angle = 90, vjust = 0.3),
        axis.text.y = element_text(size = 14),
        panel.border = element_rect(fill = "transparent"),
        panel.background = element_blank(),
        plot.title = element_text(size=17, hjust = .5),
        legend.title = element_blank(),
        legend.text = element_text(size=11))
  

line_plot + labs(x="Time of day", y="Hourly Steps") +
  annotate("rect", xmin="08:00", xmax="11:00", ymin=-0, ymax=1300, alpha=0.2)+
  annotate("rect", xmin="12:00", xmax="15:00", ymin=-0, ymax=1300, alpha=0.2)+
  annotate("rect", xmin="17:00", xmax="20:00", ymin=-0, ymax=1300, alpha=0.2)+
  scale_y_continuous(expand = c(0, 0), limits = c(0, 1300))+
  annotate("text", x=10.5, y=1000, label="Morning", size=4)+
  annotate("text", x=14.5, y=1150, label="Lunch", size=4)+
  annotate("text", x=20, y=1200, label="After Work", size=4)

From the line plot above, we can observe that there is a definite pattern where recorded steps increased significantly at time periods where people are usually moving, namely the morning rush hour, going for lunch, and after work hours.

5.5. How long do user wear smart device each day?

in this section, we will explore how often do user wear their smart device every day. This is an important factor to explore if we were to improve Bellabeat product to ensure it can support a user’s usage demand.

First, we will filter out all the minutes data from the daily_activity_df and sum it up. This will provide us a table with the total minutes used data for each day.

daily_use <- daily_activity_df %>% 
  select(id, date, very_active_minutes, lightly_active_minutes, fairly_active_minutes, sedentary_minutes) %>% 
  mutate(total_wear_minutes=very_active_minutes+lightly_active_minutes+fairly_active_minutes+sedentary_minutes)

head(daily_use)

## # A tibble: 6 × 7
##   id         date       very_active_minutes lightly_ac…¹ fairl…² seden…³ total…⁴
##   <chr>      <date>                   <dbl>        <dbl>   <dbl>   <dbl>   <dbl>
## 1 1503960366 2016-04-12                  25          328      13     728    1094
## 2 1503960366 2016-04-13                  21          217      19     776    1033
## 3 1503960366 2016-04-14                  30          181      11    1218    1440
## 4 1503960366 2016-04-15                  29          209      34     726     998
## 5 1503960366 2016-04-16                  36          221      10     773    1040
## 6 1503960366 2016-04-17                  38          164      20     539     761
## # … with abbreviated variable names ¹lightly_active_minutes,
## #   ²fairly_active_minutes, ³sedentary_minutes, ⁴total_wear_minutes

daily_use_lvl = daily_use %>% 
  select(id, date, total_wear_minutes) %>% 
  left_join(daily_steps) %>% 
  select(id, date, total_wear_minutes, active_level)

## Joining, by = "id"

head(daily_use_lvl)

## # A tibble: 6 × 4
##   id         date       total_wear_minutes active_level
##   <chr>      <date>                  <dbl> <chr>       
## 1 1503960366 2016-04-12               1094 Very Active 
## 2 1503960366 2016-04-13               1033 Very Active 
## 3 1503960366 2016-04-14               1440 Very Active 
## 4 1503960366 2016-04-15                998 Very Active 
## 5 1503960366 2016-04-16               1040 Very Active 
## 6 1503960366 2016-04-17                761 Very Active

We then classify users according to below categories:

Less than half a day(i.e <=720 minutes)
More than half a day(i.e >= 720 minutes)
Full day(i.e 1440 minutes)

daily_use_lvl<- daily_use_lvl %>% 
  mutate(daily_usage = case_when(
    total_wear_minutes <= 720 ~ "Less than half day",
    total_wear_minutes < 1440 ~ "More than half day",
    total_wear_minutes == 1440 ~ "Full day"
  ))
daily_use_lvl

## # A tibble: 940 × 5
##    id         date       total_wear_minutes active_level daily_usage       
##    <chr>      <date>                  <dbl> <chr>        <chr>             
##  1 1503960366 2016-04-12               1094 Very Active  More than half day
##  2 1503960366 2016-04-13               1033 Very Active  More than half day
##  3 1503960366 2016-04-14               1440 Very Active  Full day          
##  4 1503960366 2016-04-15                998 Very Active  More than half day
##  5 1503960366 2016-04-16               1040 Very Active  More than half day
##  6 1503960366 2016-04-17                761 Very Active  More than half day
##  7 1503960366 2016-04-18               1440 Very Active  Full day          
##  8 1503960366 2016-04-19               1120 Very Active  More than half day
##  9 1503960366 2016-04-20               1063 Very Active  More than half day
## 10 1503960366 2016-04-21               1076 Very Active  More than half day
## # … with 930 more rows

daily_usage_pie_df <- daily_use_lvl %>% 
  group_by(daily_usage) %>% 
  summarise(count=sum(n()))

daily_usage_pie_df

## # A tibble: 3 × 2
##   daily_usage        count
##   <chr>              <int>
## 1 Full day             478
## 2 Less than half day    25
## 3 More than half day   437

We will use a pie chart to illustrate the distribution of users each day.

daily_usage_pie <- ggplot(daily_usage_pie_df, aes(x=hsize, y=count, fill=daily_usage)) +
  geom_col(color="black") + 
  geom_text(aes(label = count), position = position_stack(vjust=0.5)) +
  coord_polar(theta = "y") + 
  guides(fill = guide_legend(title = "Record Level")) + 
  xlim(c(0.2, hsize + 0.5)) + 
  scale_fill_brewer(direction = 1, labels = c("Full day", "Less than half day", "More than half day")) +
  theme_void()

daily_usage_pie + labs(title="Distribution of daily usage")

From this pie chart, we see that 97% of users wear their smart device for more than 12 hours a day. This is very encouraging as this means a big window of time to provide targeted advertisement to them on Bellabeat products.

5.6. Analysis Summary

The key takeaways from this analysis are:

Almost 88% of users wear their smart devices almost everyday. This means that when users got their smart devices, they have an incentive to utilize it.
Majority of users do not use their smart device for sleep tracking. This could be a potential market to address.
Users’ activity level do not influence their decision to buy a smart device. This means we can market to all segment of the market.
Smart devices record significantly more steps during certain times of a day, this could be a potential time window where we could create programs to reward users base on the steps they make.
Over 97% of users use their device for more than half a day. This means a long time period where we can introduce value added products or programs that can attract users to spend more.

6. Recommendations

Takeaways	Recommendation
Almost 88% of users wear their smart devices almost everyday	We can emphasis advertisement for Bellabeat’s Time product as an ideal smart device to track user’s activity in a day. As the Time device is also connected the Bellabeat app, targeted advertisement could be created to market our other products
Majority of users do not use their smart device for sleep tracking.	This could be a potential market to exploit. Bellabeat Leaf bracelet could be a gamechanger to change consumer mindset on wearing a smart devices to sleep as the bracelet form factor is not intrusive.
Over 97% of users use their device for more than half a day	We can tap into this large pool of users to provide more value added products as they are connected to smart devices for a majority of their waking time.