Bellabeat Marketing Strategy Recommendations

1. Introduction

1.1 Company Description

Bellabeat is a high-tech manufacturer of health-focused products for women, co-founded by innovative artist Urška Sršen and mathematician Sando Mur. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.

Their main product, the Bellabeat app, provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions.

1.2 Company Products

Bellabeat’s product line includes the following:

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals

1.3 Analysis Importance and Relevance

While Bellabeat provides their own line of smart wellness products, our analysis will primarily focus on unlocking new growth opportunities for the Bellabeat app (and products) by analyzing consumer data from non-Bellabeat smart devices (namely, the Fitbit).

At the end of this analysis, we will provide high-level recommendations for Bellabeat that will launch the company into the global smart device market.

1.4 Assumptions and Theories

We are assuming that the Fitbit users within the data source being used for our analysis will be similar Bellabeat users. Further, we are assuming that Fitbit and Bellabeat users represent the average adult who uses wellness application to reach a better level of holistic health.

We are not expecting these users to be athletics, bodybuilders or individuals that require specialized wellness plans to achieve extreme level of fitness.

2. Objective

2.1 Our Task

We have been tasked to identify trends in how consumers use non-Bellabeat smart device and provide high-level recommendations for the Bellabeat app. The insights we discover will then help guide the marketing strategy for the company, with the goal of turning Bellabeat into a larger player in the global smart device market.

2.2 Our Stakeholders

  • Primary:
    • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
    • Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
  • Secondary:
    • Bellabeat marketing analytics team: data analysts team responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy

3. Problems

Let’s begin our analysis!

This part of our analysis is broken into three parts - Data Collection, Data Cleaning and Processing, and Exploratory Data Analysis (EDA). Each part provide in-depth information about and explanation of our analysis process.

Please use the tabs to switch between the parts.

3.1 Data Collection

3.1.1 Data Source

The ata source used for this analysis can be found on data science competition platform and online community site Kaggle and was made available by Kaggle account “Mobius”. Kaggle data source citation indicates that dataset was retrieved from general-purpose open repository site Zenodo and made available by Furberg, Robert; Brinton, Julia; Keating, Michael ; Ortiz, Alexa.

3.1.2 Gathering Methodology

Kaggle and Zenodo provide the following statement regarding how the data was gathered:

“These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.”

3.1.3 Limitations and Credibility

Data limitations and credibility were determined using the ROCCC guide:

  • Reliability: Concluding that the data source is not as reliable as it can be as it does not mention any limitations or assumptions. The sample size barely crosses over the 30-participant mark, which can present some bias.
  • Originality: The Kaggle and Zenodo datasets are not the original dataset nor from the original source, which is a survey taken by Amazon Mechanical Turk. Both data sources are seen as second-party.
  • Comprehensiveness: Participants of the survey are only indicated by an ID but do not include any information such as gender or age. Unable to conclude if the sample size was taken at random and may include bias.
  • Current: The dataset was first made available in 2016. Kaggle’s expected update frequency states it should be updated annually. However, Kaggle shows the dataset was updated 3 years ago (2020). Data is not current.
  • Cited: The only citation in Kaggle is a Zendodo crowd-source link. However, this source is not the original dataset source, which is Amazon Mechanical Turk. Zendodo cites two more sources but no clear indication of what the sources were used for.

Our recommendations on how to improve this data source can be found in the Appendices (7.2.1 Tables)

3.1.4 Verification and Licensing

Both platforms are opensource and provide the following licensings:

Kaggle

  • CC0: Public Domain
    • The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information below.

Zenodo

  • Creative Commons Attribution 4.0 International
    • By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution 4.0 International Public License (“Public License”). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
3.1.6 Data Used for Analysis

After our examination of the data source, we picked the following datasets to use for our analysis:

  • Daily Activity (dailyActivity_merge.csv)
    • Overview of daily activity including steps, distance, calories, intensities.
    • 31 days of data for 33 participants.
  • Hourly Steps (hourlySteps_merged.csv)
    • Overview of total steps per hour measured by hour.
    • 737 hours of data for 33 participants.

(Full overview of datasets in next section)

3.1.5 Data Description and Overview

The data source is available in 18 CSV files that can be downloaded from Kaggle and Zenodo. We took the data source located on Kaggle for our analysis. Files were copied into a working directory folder to preserve the original files. We began examining the data within each file using Excel (along with Pivot tables); however, switched to R due to finding that some of the files contained too much data for Excel.

SQL Server was considered as a possible tool for examination and cleaning but excluded due to the number of files. R was decided as the best tool to use considering the various file sizes and total number of files.

File Name Original File Type Number of Rows (inc header) Descriptions
dailyActivity_merged CSV File 941 Overview of daily activity including steps, distance, calories, intensities. 31 days of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. Long data format.
dailyCalories_merged CSV File 941 Overview of daily calories tracked. 31 days of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. Long data format.
dailyIntensities_merged CSV File 941 Overview of daily intensities (for active level and distance) measured in minutes and distance. 31 days of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. Long data format. Blanks noticed in ID and ActivityDate columns.
dailySteps_merged CSV File 941 Overview of daily total steps. 31 days of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date. Long data format.
heartrate_seconds_merged** CSV File 2483659 Overview of participants heartrate per second. 961274 seconds of data for 14 participants. Participants identified by 10-digit ID. Activity logged by date.
hourlyCalories_merged CSV File 22100 Overview of hourly calories measured by hour. 737 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date and hour. Long data format. Blanks noticed in ID and ActivityHour columns.
hourlyIntensities_merged CSV File 22100 Overview of hourly total and average intensity measured by hour. 737 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date and hour. Long data format. Blanks noticed in ID and ActivityHour columns
hourlySteps_merged CSV File 22100 Overview of total steps per hour measured by hour. 737 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date and hour. Long data format. Blanks noticed in ID and ActivityHour columns.
minuteCaloriesNarrow_merged** CSV File 1325581 Overview of calories burned every minute (calorie data point per row) measured in calories. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date.
minuteCaloriesWide_merged CSV File 21646 Overview of calories burned every minute (calorie data point per column) measured in calories. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date.
minuteIntensitiesNarrow_merged** CSV File 1325581 Overview of intensity reached every minute (intensity count data point per row) measured in intensity count. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date.
minuteIntensitiesWide_merged CSV File 21646 Overview of intensity reached every minute (calorie data point per column) measured in intensity count. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date.
minuteMETsNarrow_merged** CSV File 1325581 Overview of metabolic equivalent (MET) every minute (MET data point per row). 44160 minutes of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date.
minuteSleep_merged CSV File 188522 Overview of minutes slept (minute data point per row) 49773 minutes of data for 24 participants. Participants identified by 10-digit ID. Activity logged by date.
minuteStepsNarrow_merged** CSV File 1325581 Overview of intensity reached every minute (step data point per row) measured in number of steps. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date.
minuteStepsWide_merged CSV File 21646 Overview of intensity reached every minute (step data point per column) measured in number of steps. 729 hours of data for 33 participants. Participants identified by 10-digit ID. Activity logged by date.
sleepDay_merged CSV File 414 Overview of total sleep measured in records, minutes asleep and total time in bed. 31 days of data for 24 participants. Participants identified by 10-digit ID. Activity logged by date and hour. Long data format. Blanks noticed in ID and SleepDay columns.
weightLogInfo_merged CSV File 68 Overview of weight logging information measured in kg and lbs. 31 days of data for 8 participants. BMI logged along with LogId. Purpose of Fat and IsManualReport columns are unclear. Participants identified by 10-digit ID. Activity logged by date. Blanks noticed in ID and Date columns.

Additional Notes:

  • File “heartrate_seconds_merged” was renamed to “heartrateSeconds_merged” in the working directory folder to align with the naming conventions of all other files.
  • Larger file sizes are indicated by two asterisks at the end of their file names (i.e. heartrate_seconds_merged**).

3.2 Data Cleaning and Processing

3.2.1 Package Installation

Several packages are needed for our analysis from reading the CSV files to creating custom colors. Please ensure that the below code chuck is ran prior to running any other code. It will ensure that installation is needed afterwards.


# If needed, please switch your directory to your working dir
# setwd('path')

# Sets the CRAN mirror
options(repos = c(CRAN = "https://cran.r-project.org/"))

# Adds needed packages are to vector list and assigning to variable to cut down on work
packages <- c("dplyr",
              "tidyverse",
              "ggplot2",
              "skimr",
              "readr",
              "lubridate",
              "janitor",
              "ggpubr",
              "ggrepel",
              "scales",
              "remotes")

# Checks for packages that aren't installed
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]

if(length(new_packages)) install.packages(new_packages)

# Loads all required packages at once
sapply(packages, library, character.only=TRUE)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'janitor'
## 
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
## 
## 
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## $dplyr
## [1] "dplyr"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## $tidyverse
##  [1] "lubridate" "forcats"   "stringr"   "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "dplyr"     "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## $ggplot2
##  [1] "lubridate" "forcats"   "stringr"   "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "dplyr"     "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## $skimr
##  [1] "skimr"     "lubridate" "forcats"   "stringr"   "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "dplyr"     "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## $readr
##  [1] "skimr"     "lubridate" "forcats"   "stringr"   "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "dplyr"     "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## $lubridate
##  [1] "skimr"     "lubridate" "forcats"   "stringr"   "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "dplyr"     "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## $janitor
##  [1] "janitor"   "skimr"     "lubridate" "forcats"   "stringr"   "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "dplyr"    
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## $ggpubr
##  [1] "ggpubr"    "janitor"   "skimr"     "lubridate" "forcats"   "stringr"  
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "dplyr"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [19] "methods"   "base"     
## 
## $ggrepel
##  [1] "ggrepel"   "ggpubr"    "janitor"   "skimr"     "lubridate" "forcats"  
##  [7] "stringr"   "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
## [13] "tidyverse" "dplyr"     "stats"     "graphics"  "grDevices" "utils"    
## [19] "datasets"  "methods"   "base"     
## 
## $scales
##  [1] "scales"    "ggrepel"   "ggpubr"    "janitor"   "skimr"     "lubridate"
##  [7] "forcats"   "stringr"   "purrr"     "readr"     "tidyr"     "tibble"   
## [13] "ggplot2"   "tidyverse" "dplyr"     "stats"     "graphics"  "grDevices"
## [19] "utils"     "datasets"  "methods"   "base"     
## 
## $remotes
##  [1] "remotes"   "scales"    "ggrepel"   "ggpubr"    "janitor"   "skimr"    
##  [7] "lubridate" "forcats"   "stringr"   "purrr"     "readr"     "tidyr"    
## [13] "tibble"    "ggplot2"   "tidyverse" "dplyr"     "stats"     "graphics" 
## [19] "grDevices" "utils"     "datasets"  "methods"   "base"
# This remote package is needed for correlation viz
#   (For more info, https://rdrr.io/github/smin95/sesplot/man/sm_statCorr.html)
# remotes::install_github("smin95/sesplot")
library(smplot2)
## Updated tutorial for smplot: smin95.github.io/dataviz/

3.2.2 Data Examination

Data needed to thoroughly examined prior to performing our analysis. Smaller files were looked at using Excel and Pivot tables. After we hit a large file, we moved all files to R.

Using read_csv from the readr package, we read the CSV files and stored them into dataframe variables:

PLEASE NOTE: Since some datasets were used for the analysis, all other datasets were commented to improving knitting performance. However, if desired, please uncomment to store datasets into dataframe variables.
# Assigns CSV files to dataframe variables for ease
daily_activity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#daily_calories <- read_csv("dailyCalories_merged.csv")
#daily_intensities <- read_csv("dailyIntensities_merged.csv")
#daily_steps <- read_csv("dailySteps_merged.csv")
#heartrate_seconds <- read_csv("heartrateSeconds_merged.csv")
#hourly_calories <- read_csv("hourlyCalories_merged.csv")
#hourly_intensities <- read_csv("hourlyIntensities_merged.csv")
#minute_calories_long <- read_csv("minuteCaloriesNarrow_merged.csv")
#minute_calories_wide <- read_csv("minuteCaloriesWide_merged.csv")
#minute_intensities_long <- read_csv("minuteIntensitiesNarrow_merged.csv")
#minute_intensities_wide <- read_csv("minuteIntensitiesWide_merged.csv")
#minute_sleep <- read_csv("minuteSleep_merged.csv")
#minute_steps_long <- read_csv("minuteStepsNarrow_merged.csv")
#minute_steps_wide <- read_csv("minuteStepsWide_merged.csv")
#daily_sleep <- read_csv("sleepDay_merged.csv")
#weight_log_info <- read_csv("weightLogInfo_merged.csv")

We examined larger files that could not be opened with Excel deeper separately using functions such as head and n_unique. n_unique was used to see how many unique identifies were contained in the dataframes.

Lastly, we manually verified the results from n_unique by selecting only distinct identifies.


# heartrateSeconds.csv
heartrate_seconds <- read_csv("heartrateSeconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(heartrate_seconds) # provides first 6 rows for quick glance
n_unique(heartrate_seconds$Id) # returns 14 as total unique IDs
## [1] 14
heartrate_seconds %>% 
  select(Id) %>% 
  distinct() %>%
  count() # verifies there are 14 unique IDs
n_unique(heartrate_seconds$Time) # returns 961274 as total unique Times
## [1] 961274
heartrate_seconds %>% 
  select(Time) %>% 
  distinct() %>%
  count() # verifies there are 961274 unique Times
# minutesMETsNarrow.csv
minutes_mets_long <- read_csv("minuteMETsNarrow_merged.csv")
## Rows: 1325580 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityMinute
## dbl (2): Id, METs
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(minutes_mets_long)
n_unique(minutes_mets_long$Id) # returns 33 as total unique IDs
## [1] 33
minutes_mets_long %>% 
  select(Id) %>%
  distinct() %>% 
  count() # verifies there are 33 unique IDs
n_unique(minutes_mets_long$ActivityMinute) # returns 44160 as total unique minutes
## [1] 44160
minutes_mets_long %>% 
  select(ActivityMinute) %>% 
  distinct() %>%
  count() # verifies there are 44160 unique minutes

3.2.3 Data Verification

Once we examined all of the files and picked the datasets that would work best for our analysis, we verified that the dataset truly meet the 30-participant sample size. We again used n_unique, which returned a result of 33 unique IDs for daily_activity and hourly_steps dataframes. We manually confirmed this number by selecting all distinct IDs from the dataframes.


n_unique(daily_activity$Id) # returns 33 as total unique IDs
## [1] 33
daily_activity %>% 
  select(Id) %>% 
  distinct() %>%
  count()  # verifies there are 33 unique IDs
n_unique(hourly_steps$Id) # returns 33 as total unique IDs
## [1] 33
hourly_steps %>% 
  select(Id) %>% 
  distinct() %>% 
  count() # verifies there are 33 unique IDs

3.2.4 Duplicates

Moving forward, we conduction a check to see if our dataframes had any duplicates. We started with using duplicated function to locate duplicates and nested it within a sum function for a count, which returned a result of 0.

We manually verfied this result with a subset, that scanned through our dataframe from beginning to end. If any duplicates were present, the rows would be stored within a subset for further examination and verification.

Our dataframes did not contain any duplicates. However, if needed, the subset code can be used for other datasets in the future by changing the dataframe variable name.


sum(duplicated(daily_activity)) # returns 0 as total duplicates 
## [1] 0
# returns all duplicates (if any) from beginning to end of dataframe; 
#   if found, row stored in new subset that's assigned to duplicates object
duplicates <- daily_activity[duplicated(daily_activity) | duplicated(daily_activity,fromLast = TRUE), ]
head(duplicates) # displays duplicated rows (if any)
sum(duplicated(hourly_steps)) # returns 0 as total duplicates
## [1] 0
duplicates <- hourly_steps[duplicated(hourly_steps) | duplicated(hourly_steps,fromLast = TRUE), ] # used for verification
head(duplicates) # displays duplicated rows (if any)

3.2.5 NA Removal

Next, we worked to remove any NA values. We first created a holder variable to preservation an original copy of our dataframe. The count of this copy was then checked so that we can used the count for comparison. We then selected distinct values from our dataframes and dropped any NA. We counted our dataframe and matched the resulting number with the count of our preservation copy.

We manually verified by creating a subset that using the is.na function across all rows. If any rows returned a TRUE, they were stored within the subset for further examination and verification. We can confirm that both dataframe did not contain any NAs.


preservation_daily_activity <- daily_activity # original record preservation
count(preservation_daily_activity) # returns previous count of rows (940) prior to removal
daily_activity <- preservation_daily_activity %>% 
  distinct() %>% 
  drop_na() # drops any missing values
count(daily_activity) # returns count of 940 - no missing values found
# returns any rows that contain NAs by applying is.na function across all rows in daily_activity: 
#  if found, stores them in new subset if marked as TRUE and assigned to missing_rows object

missing_rows <- daily_activity[apply(is.na(daily_activity), 1, any), ] # used for verification
head(missing_rows) # displays missing rows (if any)
preservation_hourly_steps <- hourly_steps
count(preservation_hourly_steps) # returns previous count of rows (22099) prior to removal
hourly_steps <- preservation_hourly_steps %>% 
  distinct() %>% 
  drop_na()

count(hourly_steps) # returns count of 22099 - no missing values found
missing_rows <- hourly_steps[apply(is.na(hourly_steps), 1, any), ] # used for verification
head(missing_rows) # displays missing rows (if any)

3.2.6 Name Standardization

After removing duplicates and NAs, we moved on to standardizing the column names in our dataframes. For this, we used the clean_names function from the janitor package and stored the results back into the dataframe.

If desired, we also added a small section that displays the columns of the preservation copies. This can be used for comparison or referencing.


# ensures columns are lowered, underscored, unique and contains standardized characters
daily_activity <- clean_names(daily_activity) 
colnames(daily_activity) # displays cleaned column names
##  [1] "id"                         "activity_date"             
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"
hourly_steps <- clean_names(hourly_steps)
colnames(hourly_steps)
## [1] "id"            "activity_hour" "step_total"
# PLEASE NOTE: If you would like to see the original column names, please comment the below lines. 
#    `preservation_daily_activity` and `preservation_hourly_steps` created in NA Removal step
# colnames(preservation_daily_activity)
# colnames(preservation_hourly_steps)

3.2.7 Date-Time Consistency

The date and times within our dataframes did not align to a standarized format of ‘YYYY-MM-DD HH:MM:SS Am/PM’. We used two slightly different method for each dataframe. Since daily_activty only had a date column, we simply used as.Data from base R combined with mdy from lubridate.

hourly_steps required a different approach since it contained a date-time column. This dataframe had to be separated and then merged so that date-time values can be formatted as ‘YYYY-MM-DD HH:MM:SS Am/PM’ We did this because using as.POSIXCT prior to splitting and mutating the date-time values will produce NAs due to the desired format being different from the original format (‘MM/DD/YYYY HH:MM:SS Am/PM’).


daily_activity <- daily_activity %>% 
  rename(date = activity_date) %>% 
  mutate(date = as.Date(mdy(date)))# renames `activity_date` column to date and standardizes dates to 'YYYY-MM-DD'
colnames(daily_activity) # displays column names for confirmation of change
##  [1] "id"                         "date"                      
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"
hourly_steps <- hourly_steps %>% 
  # separates `activity_hour` column into `date` and `time` columns and extra merges AM/PM back to `time`
  separate(activity_hour, into = c("date", "time"), sep = " ", extra = "merge") %>% 
  mutate(date = as.Date(mdy(date))) %>%  # changes `date` values to 'YYYY-MM-DD' format
  unite(date_time, date, time, sep = " ") %>% # merges `date` and `time` columns into `date_time` column
  # standardizes `date-time` values and sets timezone to UTC for accessibility
  mutate(date_time = as.POSIXct(date_time, format = "%Y-%m-%d %I:%M:%S %p", tz = "UTC")) 
head(hourly_steps) # displays new dttm object with desired formatting

3.2.8 Anomaly Hunting

Last part of our date cleaning and processing is “Anomaly Hunting” or finding any outliners and peculiarities. Here is an overview of our findings with the use of skim_without_chartsfrom theskimr` package:

  • daily_activity anomalies:
    • Column logged_activities_distance doesn’t include a lot of data; will be removed from dataframe and excluded from analysis
    • Column sedentary_active_distance doesn’t include a lot of data but can be useful for analysis; will be kept in dataframe and included in analysis
  • hourly_steps anomalies:
    • 25th percentile is 0 for step_total, could imply that dataframe has skewness.

skim_without_charts(daily_activity) # quick outline and summary of dataframe
Data summary
Name daily_activity
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
Date 1
numeric 14
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2016-04-12 2016-05-12 2016-04-26 31

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09
total_steps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04
total_distance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
tracker_distance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
logged_activities_distance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00
very_active_distance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01
moderately_active_distance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00
light_active_distance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01
sedentary_active_distance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01
very_active_minutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02
fairly_active_minutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02
lightly_active_minutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02
sedentary_minutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03
calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03
daily_activity <- select(daily_activity, -logged_activities_distance) # removes column `logged_activities_distance`
colnames(daily_activity) # confirms `logged_activities_distance` was removed
##  [1] "id"                         "date"                      
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "very_active_distance"      
##  [7] "moderately_active_distance" "light_active_distance"     
##  [9] "sedentary_active_distance"  "very_active_minutes"       
## [11] "fairly_active_minutes"      "lightly_active_minutes"    
## [13] "sedentary_minutes"          "calories"
skim_without_charts(hourly_steps) # quick outline and summary of dataframe
Data summary
Name hourly_steps
Number of rows 22099
Number of columns 3
_______________________
Column type frequency:
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 4.848235e+09 2.4225e+09 1503960366 2320127002 4445114986 6962181067 8877689391
step_total 0 1 3.201700e+02 6.9038e+02 0 0 40 357 10554

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date_time 0 1 2016-04-12 2016-05-12 15:00:00 2016-04-26 06:00:00 736

Out of curiosity, we dive deeper into the anomaly for hourly_steps. We first manually produced summary statistics to verification the results from skim_without_charts and then created a histogram for visual representation. However, please note that, this visualization is not primary to our analysis - simply a curious side quest :D.
# CURIOUS SIDE QUEST: Checking skewness with histogram and summary statistics
hs_stat <- hourly_steps %>% 
  summarize(mean(step_total), # 320
            median(step_total), # 40
            min(step_total), # 0
            max(step_total)) # 10554
print(hs_stat) # based on summary stats alone, `step_total` is right-skewed since the mean is greater than the median
## # A tibble: 1 × 4
##   `mean(step_total)` `median(step_total)` `min(step_total)` `max(step_total)`
##                <dbl>                <dbl>             <dbl>             <dbl>
## 1               320.                   40                 0             10554
ggplot(hourly_steps, aes(x=step_total)) + 
  geom_histogram(binwidth=1500, fill="blue", color="black") # verifies right-skewness using a histogram

3.3 Exploratory Data Analysis (EDA)

After examining, verifying, cleaning and processing our data, we began our analysis. In this section, we display each step of our analysis along with any necessary explanation.

3.3.1 Part 1: Activity Level
Custom Color

Prior to outlining our analysis, the below piece of code needs to be executed. This piece of code creates 1) a custom color with the same HEX color of Bellabeat’s logo (#e38872) and 2) a custom palette using a monochromatic scale based on Bellabeat’s logo HEX. This color and color palette are used for our visualizations. Our reasoning for create these two is for aesthetics and to show our appreciate to our client :)


 # custom color taken from Bellabeat logo
bellabeat_color = "#e38872"

# custom monochromatic palette ("#D65232", "#DF775E", "#E89D8A", "#F1C2B6", "#FAE8E3")
bellabeat_palette <- colorRampPalette(c("#df4c29", "#fae9e5"))(5) 

Categorization System

Now for the fun part! We first wanted a way to categorize the participants as the dataset did not come with any demographic information such as age, gender, nationality, etc. We settle on creating activity levels as a categorization system (called activity_level as column but will be elaborated on in “Classifying Participants” section).

We based our category structure on the information provided by a Medicine Net article titled “How Many Steps a Day Is Considered Active?”

A category structure is as follows:

Activity Level Requirement
Sedentary Less than 5,000 steps per day
Barely Active Between 5,000 and 7,499 steps per day
Somewhat Active Between 7,500 and 9,999 steps per day
Active Between 10,000 and 12,499 steps per day
Highly Active More than 12,500 steps per day
Averaging Daily Steps

Before creating the categories and assigning them to participants, we calculated the average daily steps for each participants. We grouped daily_activity by ID, summarized each group by its mean or average and stored the results into a new dataframe called daily_average_steps.


daily_average_steps <- daily_activity %>% 
  group_by(id) %>% # groups data by participant id
  # averages total steps, round to nearest whole number and stores in new column
  summarize(average_steps = round(mean(total_steps))) 
head(daily_average_steps) # displays average steps per id

Classifying Participants

With the daily average steps, we used case_when to apply our categories to participants and stored this classifications in a new column called active_level. We also created a new dataframe called participant_activity_level that contains the new category values.


participant_activity_level <- daily_average_steps %>% 
  mutate(
    # classifies average steps into categories and stores in new column
    activity_level = case_when( average_steps < 5000 ~ "Sedentary", 
                              average_steps >= 5000 & average_steps < 7500 ~ "Barely Active",
                              average_steps >= 7500 & average_steps < 10000 ~ "Somewhat Active",
                              average_steps >= 10000 & average_steps < 12500 ~ "Active",
                              average_steps >= 12500 ~ "Highly Active" 
                              )
    )
head(participant_activity_level) # displays average steps and active level per id

Taking Percentages

Now that we have categories attached to our participants, we wanted to see what percentage of our participants belonged in each category. We did this using several pipes to weave together summary calculations, column mutations, aggregation and data displaying.


activity_level_percentage <- participant_activity_level %>% 
  group_by(activity_level) %>% # groups participants by active level
  summarize(total_participants = n()) %>%  # calculates total number of participants
  # calculates percentage of participants in active level and stores in new
  mutate(active_percentage = (total_participants / sum(total_participants))) %>%
  mutate(percentage_labels = percent(active_percentage)) # changes calculated percentages into labeled percentages

# displays active level by percentage in descending order
activity_level_percentage %>% 
  arrange(desc(active_percentage)) %>% 
  head() 

Re-factoring Levels

To gear us up for our visualization, we re-factored the activity levels so that “Highly Active” is first and “Sedentary” is last. If desired, please uncomment the first two lines in the below code chuck to see the original factor levels.

To re-factor, we used the base R factor function and applied it to our activity_level column:


# Stores old factor levels for preservation
# preservation_activity_level_percentage_factor_levels <- activity_level_percentage$activity_level 
# print(preservation_activity_level_percentage_factor_levels) # displays old factor levels

activity_level_percentage$activity_level <- activity_level_percentage$activity_level %>% 
  factor(levels = c("Highly Active",
                    "Active",
                    "Somewhat Active",
                    "Barely Active",
                    "Sedentary"
                    )
         ) # reorders factor levels for visualization
print(activity_level_percentage$activity_level) # displays new factor levels
## [1] Active          Barely Active   Highly Active   Sedentary      
## [5] Somewhat Active
## Levels: Highly Active Active Somewhat Active Barely Active Sedentary

Viz 1: Pie Chart

With our average daily steps and percentages calculated along with the re-factoring of our levels, we visualized our data using a pie chart. Within our code chuck, we provided step-by-step insight into each line for context:


# starts the plot of average active level percentages
ggplot(activity_level_percentage, aes(x = "", y = active_percentage, fill = activity_level)) +
  # first part needed for pie chart, bar chat layer and ensuring that pre-calculated percentages are used
  geom_bar(stat = "identity", width = 1) + 
  # second part needed for pie chart, y-values used for slices and rotation starts at 3'clock, 
  #     using 0 instead of pi/2 for aesthetic reasons
  coord_polar(theta = "y", start = 0) + 
  # centers title (subtitle_ and sets font size to 14 (10)
  theme_void() + # uses void theme for aesthetic reasons
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        plot.subtitle = element_text(hjust = 0.5, size = 10)) + 
  # applies custom colors to pie slices (Highly Active = Darkest Tint and so forth)
  scale_fill_manual(values = bellabeat_palette) + 
  geom_text(aes(label = percentage_labels), 
            position = position_stack(vjust = 0.5)) + # adds percentage labels on top of pie slices and centers them
  labs(x = NULL,
       y = NULL, # removes x- and y-axes titles
       fill = "Activity Level", # changes legend title
       title = "Participant Distribution",# adds chart title
       subtitle = "% represents proportion participants that fall into each activity level") # adds chart subtitle

Viz #2: Histogram

Once we observed the proportion of participants that fell within each activity level, we explored distribution using a histogram. With our histogram, we wanted to see where participants typically fell based on their activity level.

For our visualization, we created two variables for specific purposes:

  • mean_value: provides the mean for plotting a vertical line on our visualization using aesthetic aspect xintercept
  • bin_width: We were having some difficulty getting the best binwidth for our histogram. After some research, we found that the Freedman-Diaconis rule works best since daily_average_steps has a small amount of observations but large average step values.

mean_value <- mean(daily_average_steps$average_steps) # gives mean value for plotting

bin_width <- 2 * (IQR(daily_average_steps$average_steps) / nrow(daily_average_steps)^(1/3)) # Freedman-Diaconis rule

With these two variables, we completed the below code chuck for our visualization:


ggplot(daily_average_steps, aes(x = average_steps)) +
  # creates histogram and bins
  geom_histogram(binwidth = bin_width, fill = bellabeat_color, color = bellabeat_palette[5]) + 
  geom_vline(aes(xintercept = mean_value),
             linetype = "dashed",
             linewidth = 1) + # plots line at mean value
  theme_light() + # desired viz theme
  theme(plot.title = element_text(hjust = 0.5, size = 14)) + # centers and enlarges chart title
  labs(x = "Average Steps per Participant",
       y = "Number of Participants",
       title="Histogram of Average Daily Steps") + # adds text to x-,y-axis and chart title
  annotate("text", 
           x = mean_value, 
           y = 0, 
           label = paste0("Mean = ", round(mean_value)),
           fontface = "bold",
           vjust = -40,
           hjust = -0.4) # adds, color, and positions Mean text and value to chart near top and slightly to the right

Key Insight(s)

At this point, we obtained some important insights from our summary statistics, aggregations and/or visualizations:

  1. Most participants fall within Sedentary (24.2), Barely Active (27.3) and Somewhat Active (27.3) or walk no more than 10000 steps a day. Ideal step goal is 7500 so participants are reaching ideal.
  2. Distribution of average steps appears to be normal, with a slight tail to the right.
3.3.2 Part 2: Active Days & Hours

To gain some further insights outside of our pie chart and histogram, we pivoted slightly and explored which days and hours participants were most active. We again used participant’s average daily steps along with participant’s average distance to plot two bar charts.

Obtaining Active Days

We started first with analyzing active weekdays. Prior to creating our visualizations, we needed to do the following:

  1. Create a new dataframe called weekday_steps_distance that will contain modified daily_activity values
  2. Create a new column (weekday) that would store each weekday within daily_activity
  3. Re-factor weekday so that the levels for weekday_steps_distance follow the standard 7-day week
  4. Lastly, group weekday_steps_distance by each weekday and summarize the average daily steps and distance for each day.

# adds new column to daily activity data that only includes weekday
weekday_steps_distance <- daily_activity %>% 
  mutate(weekday = weekdays(date)) 

# sets levels: Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday
weekday_steps_distance$weekday <- weekday_steps_distance$weekday %>% 
  ordered(levels = c("Monday",
                     "Tuesday",
                     "Wednesday",
                     "Thursday",
                     "Friday",
                     "Saturday",
                     "Sunday")) 

# calculates average steps, distance and groups by weekday
weekday_steps_distance <- weekday_steps_distance %>% 
  group_by(weekday) %>% 
  summarize(weekday_average_steps = mean(total_steps),
            weekday_average_distance = mean(total_distance)) # calculates average steps, distance and groups by weekday

Viz #3: Bar Chart

With this steps completed, we were able to complete our charts and arranged them side-by-side using ggarrange:


ggarrange( # arranges both bar charts into one plot
  ggplot(weekday_steps_distance, aes(weekday, weekday_average_steps)) + # charts for average steps
    # conditional used for coloring only bars that meet ideal levels
    geom_col(stat = "identity", aes(fill = ifelse(weekday_average_steps > 7500, "Ideal", "Below Ideal"))) + 
    geom_hline(aes(yintercept = 7500),
               linewidth = 1) + #adds a level at recommended steps and distance
    theme_light() + # desired viz theme
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5), # tilts x-axis text
          plot.title = element_text(hjust = 0.5, size = 14), # centers and enlarges chart title
          plot.subtitle = element_text(hjust = 0.5, size = 10), # centers and shrinks chart subtitle
          legend.position = "none") + # removes legend
    scale_fill_manual(values = c("Ideal" = bellabeat_palette[1], "Below Ideal" = bellabeat_palette[5])) + # supplies fill color for conditional
    labs(x = NULL,
         y = NULL,
         title = "Average Steps per Weekday",
         subtitle = "Recommended daily steps is 7500"), # adds labels
  ggplot(weekday_steps_distance, aes(weekday, weekday_average_distance)) +  # charts for average distance
    # conditional used for coloring only bars that meet ideal levels
    geom_col(stat = "identity", aes(fill = ifelse(weekday_average_distance > 5, "Ideal", "Below Ideal"))) + 
    geom_hline(aes(yintercept = 5),
               linewidth = 1) +
    theme_light() + # desired viz theme
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5),
          plot.title = element_text(hjust = 0.5, size = 14),
          plot.subtitle = element_text(hjust = 0.5, size = 10),
          legend.position = "none") + 
    scale_fill_manual(values = c("Ideal" = bellabeat_palette[1], "Below Ideal" = bellabeat_palette[5])) +
    labs(x = NULL,
         y = NULL,
         title = "Average Distance per Weekday",
         subtitle = "Recommended daily distance is 5 miles")
)
## Warning in geom_col(stat = "identity", aes(fill = ifelse(weekday_average_steps
## > : Ignoring unknown parameters: `stat`
## Warning in geom_col(stat = "identity", aes(fill =
## ifelse(weekday_average_distance > : Ignoring unknown parameters: `stat`

Obtaining Active Hours

Next, we started analyzing active hours. Similarly to weekdays, we needed to do some prep work prior to creating our visualization:

  1. Separated the time from the dttm object (date_time in hourly_steps into two columns - date and time.
    • Please note: Only the hour was taking from date_time
  2. Group hourly_steps data by each hour and aggregate the average steps

hourly_steps <-  hourly_steps %>% 
  mutate(date = as.Date(date_time),
         time = format(date_time, "%H")) # separates time from date_time column for aggregating and plotting

hours <- hourly_steps %>% 
  group_by(time) %>% 
  summarize(average_hour_steps = mean(step_total)) # aggregates average steps by hour

Viz #4: Bar Chart

With the prep work completed, we created our visualization:


ggplot(hours, aes(time, average_hour_steps)) + # plots average steps and time
  geom_col(stat = "identity", aes(fill = average_hour_steps)) + # fills using average steps
  theme_light() + # desired viz theme
  theme(plot.title = element_text(hjust = 0.5, size = 14)) + # centers and enlargers chart title
  # colors plot with custom palette and green color for contrast
  scale_fill_gradient(low = "#0dfbb2", high = bellabeat_palette[1]) + 
  labs(x = "Hours",
       y = "Average Steps",
       fill = "Average Hourly Steps", # changes legend title
       title = "Average Steps per Hour") # chart labels
## Warning in geom_col(stat = "identity", aes(fill = average_hour_steps)):
## Ignoring unknown parameters: `stat`

Key Insight(s)

At this point, we obtained some important insights from our summary statistics, aggregations and/or visualizations:

  1. Participants were most active M, T, W, and S based on steps and achieved ideal distance. Sunday was the least active day.
  2. Users are mostly active between the hours of 8 AM and 7 PM, with most walking occurring 12-2 and 5-7 (most likely after lunch and dinner
3.3.3 Part 3: Correlation
Correlation Testing

Pivoting again, we wanted to see if there were any correlation between participants daily total steps and calories logged. We used a visualization to explore this and confirmed our r- and p- values using cor.test.

sm_statCorr is a packaged that had to be installed from a remote github since it is not located within CRAN at the moment. For more information about this package, please visit https://rdrr.io/github/smin95/sesplot/man/sm_statCorr.html.

We chose the Pearson correlation method because we are using two continuous variables and want to see the linear relationship of these two variables.


ggplot(daily_activity, aes(total_steps, calories)) + # plots correlation chart
  geom_jitter(stat = "identity") + # ensures points are not overlaying
  sm_statCorr(color = bellabeat_palette[1],
              corr_method = "pearson") + # adds corr coefficient, stat significance and corr line
  theme_light() + # desired viz theme
  theme(plot.title = element_text(hjust = 0.5, size = 14)) + # centers and enlargers chart title
  labs(x = "Total Steps",
       y = "Calories",
       title = "Total Steps vs Calories") # chart labels
## `geom_smooth()` using formula = 'y ~ x'

# verifies R- and p- values along with other statistical info
cor.test(daily_activity$total_steps, daily_activity$calories, method = "pearson") 
## 
##  Pearson's product-moment correlation
## 
## data:  daily_activity$total_steps and daily_activity$calories
## t = 22.472, df = 938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5483688 0.6316184
## sample estimates:
##       cor 
## 0.5915681

Key Insight(s)

At this point, we obtained some important insights from our summary statistics, aggregations and/or visualizations:

  1. We are seeing a corr coefficient of .56 for daily steps vs calories, indicating a moderately positive relationship
3.3.4 Part 4: Usage and Wear
Defining Usage Levels

After analyzing participants activity levels, we first explored how frequently participants used their devices during the 31 day survey. We secondly explored how long participants wore their devices on a daily basis.

To begin our analysis of participants daily usage of their device, we created a separate set of categories based on how days participants logged their activity. The categories are as follows:

Usage Level Requirement
No Usage 0 days logged
Minimal Usage 1 - 9 days logged
Moderate Usage 10 - 19 days logged
Frequent Usage 20+ days logged
Summing Days Logged

Now that we have created the categories, We grouped daily_activity by ID, summarized each group by the number of rows for each ID, assigned a usage level based on this summation, and stored the results into a new dataframe called daily_usage.


daily_usage <- daily_activity %>% 
  group_by(id) %>% # groups rows by id
  summarize(num_days_logged = sum(n())) %>% # sums each participant's row
  # determines participants usage level based on summation
  mutate(usage_level = case_when( num_days_logged == 0 ~ "No Usage", 
                            num_days_logged >= 1 & num_days_logged < 10 ~ "Minimal Usage",
                            num_days_logged >= 10 & num_days_logged < 20 ~ "Moderate Usage",
                            num_days_logged >= 20 ~ "Frequent Usage")) 

Viz #6: Bar Chart

Now that we have assigned usage levels to our participants, we moved to create a visualization to understand their usage trends further. Prior to creating the visualization, we needed to complete some prep work:

  1. Group participants by usage_level, count how many participants fell in each level and then arrange levels in ascending order.
  2. Re-order the levels of usage_level so that it matchess our category system outlined above.

participants_usage_level <- daily_usage %>%   
group_by(usage_level) %>% # groups by usage level
  summarize(num_participants = sum(n())) %>% # counts the number of participants in each usage level
  arrange(num_participants) # sorts by number of participants

head(participants_usage_level)
participants_usage_level$usage_level <- participants_usage_level$usage_level %>% 
  ordered(levels = c("No Usage",
                     "Minimal Usage",
                     "Moderate Usage",
                     "Frequent Usage")) # re-orders the usage levels for plot viz

With this prep done, we completed our visualization and provided comments for context:


ggplot(participants_usage_level, aes(x=usage_level, y=num_participants)) + # begins plot
  geom_bar(stat="identity", color = bellabeat_palette[1], fill = bellabeat_palette[3]) + # creates, colors and fills plot
  geom_text(aes(label = num_participants), 
            vjust = -.5, 
            fontface = "bold",
            color = bellabeat_palette[1]) + #adds, adjust, bolden, and colors number of participants labels above bars
  theme_light() + # desired viz theme
  theme(plot.title = element_text(hjust = 0.5, size = 14)) + # centers and enlarges chart title
  labs(x = "Usage Level",
       y = "Number of Participants",
       title = "Distribution of Usage Levels") # chart labels

Defining Wear Levels

We explored how frequently participants logged their activity and now want to explore how often participants wore their device throughout the day. In order to do this, we needed to merge daily_activity and daily_usage to have a comprehensive dataframe.

Next, we examined participants wear levels by percentages, which was retrieved from provide several calculations. The results from this calculations were then used to create device wear category levels.

  1. First, we combined all minutes columns that came from daily_activity and stored this total into a new column called total_miniutes_worn.
  2. Next, we obtained the percentage of minutes worn each day for the total 31 day survey
  3. Lastly, we assigned a wear level to participants based on their percentage

Wear levels are as follows:

Wear Level Requirement
Barely Worn device worn < 25 % of the day
Somewhat Worn device worn 25 - 59% of the day
Mostly Worn device worn 50 - 74% of the day
Frequently Worn device worn 75 - 99% of the day
Constantly Worn device worn == 100 % of the day

daily_activity_usage <- merge(daily_activity, daily_usage, by = c("id")) # mixing daily activity and daily usage together
head(daily_activity_usage)
minutes_device_worn <- daily_activity_usage %>% 
  mutate(total_minutes_worn = very_active_minutes + fairly_active_minutes + lightly_active_minutes + sedentary_minutes) %>% 
  mutate(device_worn_percentage = round((total_minutes_worn / 1440) * 100)) %>% 
  mutate(device_worn_level = case_when(device_worn_percentage < 25 ~ "Barely Worn (< 25%)",
                                       device_worn_percentage >= 25 & device_worn_percentage < 50 ~ "Somewhat Worn (25-50%)",
                                       device_worn_percentage >= 50 & device_worn_percentage < 75 ~ "Mostly Worn (50-75%)",
                                       device_worn_percentage >= 75 & device_worn_percentage < 100 ~ "Frequently Worn (75-99%)",
                                       device_worn_percentage == 100 ~ "Constantly Worn (100%)"))
head(minutes_device_worn)

Obtaining Percentages

Next, we performing several calculations with the use of pipes to obtaining percentages that represented how often participants of all activity level wore their devices through the day. We also created labels to represent this percentages for plotting.

Please note: We did consider analyzing device wear for each activity level (such as “Highly Active”, “Barely Active”, etc) but this seemed like an overkill. We decided to focus on all participants regardless of activity level.


minutes_device_worn_all_usage_percentage <- minutes_device_worn %>% 
  group_by(device_worn_level) %>% # groups by device wear level
  summarize(minutes = n()) %>% # counts the number of rows 
  mutate(total_minutes = sum(minutes)) %>% # sums the rows
  group_by(device_worn_level) %>% # groups again by device wear level
  summarize(total_percentage = minutes / total_minutes) %>% # obtains percentage of minutes
  mutate(percentage_labels = scales::percent(total_percentage)) # creates labels for plotting

Re-factoring

We re-factored the levels for device_worn_level to represent our categories:


minutes_device_worn_all_usage_percentage$device_worn_level <- minutes_device_worn_all_usage_percentage$device_worn_level %>% 
  factor(levels = c("Barely Worn (< 25%)",
                    "Somewhat Worn (25-50%)",
                    "Mostly Worn (50-75%)",
                    "Frequently Worn (75-99%)",
                    "Constantly Worn (100%)"))

Viz #7: Pie Chart

This code chuck was created because percentage labels for ‘Barely Worn’ and ‘Somewhat Worn’ were overlapping. To remedy this, we first calculated the y position for the labels by obtaining the cumulative sum. Then the final positions of the labels were determined by calculating a midpoint from each slice of our pie chart. These two ensures that our labels do not overlap.


 # determines position of labels
minutes_device_worn_all_usage_percentage$cumsum <- cumsum(minutes_device_worn_all_usage_percentage$total_percentage) 
# determines y position of the labels
minutes_device_worn_all_usage_percentage$midpoint <- minutes_device_worn_all_usage_percentage$cumsum - (minutes_device_worn_all_usage_percentage$total_percentage / 2)

Once we had everything we needed, we plotted our final visualization:


ggplot(minutes_device_worn_all_usage_percentage, aes(x = "", y = total_percentage, fill = device_worn_level)) +
  # first part needed for pie chart, bar chart layer and ensuring that pre-calculated percentages are used
  geom_bar(stat = "identity", width = 1) + 
  coord_polar(theta = "y") + # second part needed for pie chart, y-values used for slices
  theme_void() + # uses void theme for aesthetic reasons
  theme(plot.title = element_text(hjust = 0.5, size = 14), 
        plot.subtitle = element_text(hjust = 0.5, size = 10)) + # centers title (subtitle_ and sets font size to 14 (10)
  # reverses and applies custom colors to pie slices (Constantly Worn = Darkest Tint and so forth)
  scale_fill_manual(values = rev(bellabeat_palette)) + 
  geom_label_repel(aes(y = midpoint,label = percentage_labels), # handles overlapping percentage labels
                   segment.color = NA, # removes label lanes
                   show.legend = FALSE) + # removes repel's legend (thus, removing the random "a" on legend)
  labs(x = NULL,
       y = NULL, # removes x- and y-axes titles
       fill = "Device Worn", # changes legend title
       title = "Time Worn per Day",# adds chart title
       subtitle = "% represents how often participants wore their device throughout the day") # adds chart subtitle

Key Insight(s)

At this point, we obtained some important insights from our summary statistics, aggregations and/or visualizations:

  1. 30 of the 33 participants used their devices for most of the 31 days
  2. 51% of participants wore their device constantly (100% of the time) and 35% of participant wore it for most of the day (50-75%)

4. Solutions

We are nearing the end of our analysis!

From our analysis, we compiled an in-dept overview our key sights along with questions we looked to answer. From this questions, we generated recommendations to not only answer our questions but also address our main task.

Our analysis provided us with 7 key insights that were evaluated, which resulted in 4 recommendations for Bellabeat’s marketing team. However, the best solution is detailed piece-by-piece below.


4.1 Best Solution

As a recap, our object or task for this analysis was to identify trends in how consumers use non-Bellabeat smart device and provide high-level recommendations for the Bellabeat app. The business task for Bellabeat is unlocking new growth opportunities for the Bellabeat app (and products) and launch the company into the global smart device market.

We identify several trends within our analysis and the best solution is to implement n-Day, Week-long and Month-long Step Challenges with Tiered Rewards as an Incentive. Our suggested breakdown of this incentive system is as follows:

n-day Step Challenge Information

  • Length: n-days but less than a week
  • Occurrence: No limit
  • Requirement & Incentive: Consumer must complete the challenge 100% (without missing a day) to receive a small level item such as an account badge or accumulate points for real-life rewards such as small amount gift cards
  • Idea: 3-day 10k per Day Challenge for “Top Stepper” Badge

Week-Long Step Challenge Information

  • Length: One week
  • Occurrence: 1-2 times in a month
  • Requirement & Incentive: Consumer must complete the challenge but are afforded one grace day to receive a higher level award such as a digital gift card valued at $25 or a small discount on Bellabeat products.
  • Idea: Week-long 15k per Day Challenge for $25 gift card or 5% off your favorite Bellabeat product

Month-Long Step Challenge Information

  • Length: One Month
  • Occurrence: Every other month or every other 2-3 months
  • Requirement & Incentive: Consumer must complete the challenge but are afforded one to two grace days to receive a free Bellabeat Spring Bottle or a high amount gift card no more than $100
  • Idea: Month-long 10k per Day Challenge for $100 gift card or a free Bellabeat Spring bottle

4.2 Additional Notes

Our best solution was achieved by considering our key insights found during our analysis and any questions that came up from such insights. The below section are the questions we considered along with the insight that provoke them:

Question 1

How can we get users to wear devices throughout the day?

(Why We Ask)

Of the 33 participants, we found that 30 of them used logged their information for most of the 31 day trail. Of this, 51% of them wore their device constantly or all day. This provides us with a good indication that users are willing and able to consistently use fitness related devices and application.

While this is a great start, we do not want to take this for guarantee as only 51% of the participants wore their devices all day. 35% of participants wore their devices only most of the day (everywhere between 50% and 75% of the day), which leaves an open market of participants that could be converted to all day users.

Question 2

How can we use these active windows (lunch and dinner) to our advantage?

(Why We Ask)

Now, let’s turn to the participants activity during the 31 day trail. We see a moderate positive correlation between the amount of daily steps a participant took and the calories they logged. This could indicate that the most a participant walked, the more calories they needed due to the lost they accumulated.

Unfortunately, because the data source did not include demographic information, such as gender, this can only be an assuming. What isn’t an assuming is the days and hours that participants were most active. Based on participants logged steps, we see that they were most active on Mondays, Tuesdays, Wednesdays and Saturdays, between the hours of 8 AM to 7 PM, with highest hourly steps occurring between 12 and 2 as well as 5 and 7.

We can safely assume that walking was completed during lunch and dinner hours.

Question 3

How can we move participants to higher levels of activity (namely “Active” and “Highly Active”)?

(Why We Ask)

If we dig a bit deeper into the participants activity, we are able to determine what active level they mostly fall in based on their average daily steps. The top two levels were Barely Active (27.3%) and Somewhat Active (27.3), mainly most participants logged a daily average of 5,000-10,000 steps, with the highest average of these two levels being 9795 steps.

Outside these two levels, 24.2 of participants fell in the Sedentary level, meaning they walked, on average, less than 5,000. The highest average steps logged for this group is 4797, just hitting the 5,000 mark.

Question 4

How can we motivate users to be take more steps (12,000 +)?

(Why We Ask)

According to the CDC and NIH, “higher steps counts lowers morality risks from all causes”. Based on a study in 2020, “compared with taking 4,000 steps per day, a number considered to be low for adults, taking 8,000 steps per day was associated with a 51% lower risk for all-cause mortality (or death from all causes).” The study also showed that with increases steps, the risk was lowered even more.

It stated, “taking 12,000 steps per day was associated with a 65% lower risk compared with taking 4,000 steps.” Based on our analysis, only 6.1% of participants took more than an average of 12,000 steps daily, which is only 2 participants out of 33. 11 of the 33 participants took between an average of 8,000 to 12,000 steps per day.

5. Appendices

5.1 Tables

Recommendations on how to improve the data source used for this analysis:

Problem Remedy
Unreliable Use a larger sample size and notation of limitations/assumptions. A larger sample size can be obtained from already established social media engagement.
Unoriginal Conduct a survey via social media networks as Bellabeat already has an established relationship with consumers.
Incomplete Conduct a survey via social media networks as Bellabeat already has an established relationship with consumers.
Out-of-Date Conduct a 6-month survey for current information.
Uncited Use data collected from social media networks, Google search investment, advertisements, and customer engagement.