Introduction

How Can a Wellness Technology Company Play It Smart?

In this capstone project, we delve into the real-world scenario of Bellabeat, a high-tech manufacturer of health-focused products for women. As a successful small company, Bellabeat has the potential to make a significant impact in the global smart device market. Urška Sršen, co-founder, and Chief Creative Officer of Bellabeat, recognizes the value of analyzing smart device fitness data as a means to unlock new growth opportunities. As a junior data analyst on the marketing analyst team, I have been assigned the task of focusing on one of Bellabeat’s products and analyzing smart device data to gain insights into consumer usage patterns. The objective is to provide actionable recommendations that will guide Bellabeat’s marketing strategy, enabling the company to capitalize on the vast potential of smart device data analysis.

Objective

The primary objective of this capstone project is to analyze smart device data to gain valuable insights into how consumers are utilizing Bellabeat’s products. By examining user behavior, engagement patterns, and other relevant metrics, we aim to uncover significant trends and behaviors that can inform Bellabeat’s marketing strategy. The insights derived from this analysis will serve as a foundation for developing targeted marketing campaigns, improving customer experiences, and driving innovation within the company.

Phase 1: Ask

Key Stakeholders

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mu: Mathematician and Bellabeat’s cofounder
  • The Bellabeat Marketing Analytics Team: a team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Bellabeat Products

  • Bellabeat App
    • The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
  • Leaf
    • Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
  • Time
    • This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
  • Spring
    • This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
  • Bellabeat Membership
    • Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals

Business Task

Analyze non-Bellabeat smart device data alongside a specific Bellabeat product to extract insights, identify growth opportunities, and provide recommendations for improving BellaBeat’s marketing strategy by leveraging trends in smart device usage.

Key Questions

  • What are some trends in smart device usage?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

Phase 2: Prepare

Dataset Source and Information

The FitBit Fitness Tracker Data is available on Kaggle and was made accessible through Mobius. The dataset consists of 18 CSV files containing smart health data from personal fitness trackers for thirty FitBit users. The data was collected via a survey of personal tracker data, which included minute-level output for physical activity, heart rate, and sleep monitoring. The survey was conducted using Amazon Mechanical Turk between March 12, 2016, and May 12, 2016.

The dataset provides comprehensive information about daily activity, steps, and heart rate. It was last updated two years ago, as of May 2023. The data was generated through a distributed survey via Amazon Mechanical Turk, in which thirty eligible Fitbit users consented to submit their personal tracker data. The minute-level output includes details on physical activity, heart rate, and sleep monitoring.

The variation observed in the data stems from the use of different types of Fitbit trackers and individual tracking behaviors and preferences. With this rich dataset, it is possible to explore and analyze the impact of various factors on fitness and health-related metrics captured by FitBit devices.

Accessibility and Privacy of Data

Verifying the metadata of our dataset we can confirm it is open-source. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

Organization of Data and Verification

The dataset consists of 18 CSV documents that contain diverse quantitative data tracked by FitBit. The data is structured in a long format, where each row represents a specific time point for a particular subject. Consequently, multiple rows exist for each subject, identified by their unique ID, as data is tracked on a daily and time basis.

Given the relatively small sample size, I employed sorting and filtering techniques in Google Sheets to organize the data. By creating Pivot Tables, I could examine the attributes and observations in each table, as well as establish relationships between them. Additionally, I performed a count of the sample size (number of users) in each table and verified that the analysis spanned a period of 31 days.

  • dailyActivity_merged
    • Activity over 31 days of 33 users.
    • Tracking daily: Steps, Distance, Intensities, Calories
  • dailyCalories_merged
    • Daily Calories over 31 days of 33 users
  • dailyIntensities_merged
    • Daily Intensity over 31 days of 33 users.
    • Measured in Minutes and Distance, dividing groups in 4 categories: Sedentary, Lightly Active, Fairly Active,Very Active
  • dailySteps_merged
    • Daily Steps over 31 days of 33 users
  • heartrate_seconds_merged
    • Exact day and time heartrate logs for just 7 users
  • hourlyCalories_merged
    • Hourly Calories burned over 31 days of 33 users
  • hourlyIntensities_merged
    • Hourly total and average intensity over 31 days of 33 users
  • hourlySteps_merged
    • Hourly Steps over 31 days of 33 users
  • minuteCaloriesNarrow_merged
    • Calories burned every minute over 31 days of 33 users (Every minute in single row)
  • minuteCaloriesWide_merged
    • Calories burned every minute over 31 days of 33 users (Every minute in single column)
  • minuteIntensitiesNarrow_merged
    • Intensity counted by minute over 31 days of 33 users (Every minute in single row)
  • minuteIntensitiesWide_merged
    • Intensity counted by minute over 31 days of 33 users (Every minute in single column)
  • minuteMETsNarrow_merged
    • Ratio of the energy you are using in a physical activity compared to the energy you would use at rest.
    • Counted in minutes
  • minuteSleep_merged
    • Log Sleep by Minute for 24 users over 31 days.
    • Value column not specified
  • minuteStepsNarrow_merged
    • Steps tracked every minute over 31 days of 33 users (Every minute in single row)
  • minuteStepsWide_merged
    • Steps tracked every minute over 31 days of 33 users (Every minute in single column)
  • sleepDay_merged
    • Daily sleep logs, tracked by: Total count of sleeps a day, Total minutes, Total Time in Bed
  • weightLogInfo_merged
    • Weight track by day in Kg and Pounds over 30 days.
    • Calculation of BMI.5 users report weight manually 3 users not.In total there are 8 users For a more throuogh look at the data: Fitabase Data Dictionary

Data Credibility and Integrity

Considering the limited size of the dataset (30 users) and the absence of demographic information, there is a possibility of encountering sampling bias. The representativeness of the sample in relation to the broader population cannot be guaranteed. Additionally, the dataset’s lack of currentness, along with the time restriction of the survey (spanning only 2 months), poses additional challenges.

To address these limitations, we will adopt an operational approach for our case study. By focusing on actionable insights and practical implications, we aim to derive meaningful conclusions and recommendations despite the inherent constraints. This approach allows us to leverage the available data effectively and provide valuable insights within the given context.

Phase 3: Process

I have made the decision to utilize R as the primary tool for my analysis due to its accessibility, data processing capabilities, and visualization features. R is a widely adopted open-source programming language that offers a multitude of packages and functions specifically designed for data analysis and statistical tasks. By leveraging R’s extensive ecosystem, I can take advantage of its robust functionality to efficiently handle and manipulate the dataset.

The dataset in question contains a substantial amount of data, making R’s efficiency in handling large datasets a crucial factor in my analysis. R’s optimized data processing capabilities allow me to effectively clean, transform, and derive insights from the dataset, enabling me to uncover valuable patterns and trends.

Furthermore, R’s powerful visualization libraries, including ggplot2 and plotly, empower me to create visually appealing and informative data visualizations. These visualizations serve as powerful tools for communicating the analysis results to stakeholders in a clear and concise manner. By presenting the findings through engaging and intuitive visual representations, I can enhance understanding, facilitate decision-making, and effectively convey the key insights derived from the dataset.

By leveraging the accessibility, data processing capabilities, and visualization features of R, I am confident that I can conduct a comprehensive analysis that not only uncovers valuable insights but also effectively communicates the findings to stakeholders.

Loading Packages and Libraries

In our R programming workflow, we will carefully select and load the essential packages to enhance our analysis capabilities. The following R packages have been curated specifically for our analysis:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ggpubr)
library(here)
## here() starts at /cloud/project
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggrepel)
library(hms)
## 
## Attaching package: 'hms'
## 
## The following object is masked from 'package:lubridate':
## 
##     hms
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(formatR)
library(glue)

Importing Datasets

Knowing the datasets we have, we will upload the datasets that will help us answer our business task. On our analysis we will focus on the following datasets.

  • Daily_activity
  • Daily_sleep
  • Hourly_steps

Due to the the small sample we won’t consider for this analysis Weight (8 Users) and heart rate (7 users)

daily_activity <- read_csv(file= "/cloud/project/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_sleep <- read_csv(file= "/cloud/project/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("/cloud/project/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Previewing Datasets

To gain an initial understanding of our selected data frames and their contents, we will preview them and examine the summary statistics for each column. This process will provide us with insights into the structure, variables, and distribution of the data. By doing so, we can gather valuable information that will aid us in subsequent analysis and decision-making.

head(daily_activity)
## # A tibble: 6 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 4/12/2016         13162          8.5             8.5 
## 2 1503960366 4/13/2016         10735          6.97            6.97
## 3 1503960366 4/14/2016         10460          6.74            6.74
## 4 1503960366 4/15/2016          9762          6.28            6.28
## 5 1503960366 4/16/2016         12669          8.16            8.16
## 6 1503960366 4/17/2016          9705          6.48            6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
str(daily_activity)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(daily_sleep)
## # A tibble: 6 × 5
##           Id SleepDay        TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                       <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:0…                 1                327            346
## 2 1503960366 4/13/2016 12:0…                 2                384            407
## 3 1503960366 4/15/2016 12:0…                 1                412            442
## 4 1503960366 4/16/2016 12:0…                 2                340            367
## 5 1503960366 4/17/2016 12:0…                 1                700            712
## 6 1503960366 4/19/2016 12:0…                 1                304            320
str(daily_sleep)
## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(hourly_steps)
## # A tibble: 6 × 3
##           Id ActivityHour          StepTotal
##        <dbl> <chr>                     <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366 4/12/2016 1:00:00 AM        160
## 3 1503960366 4/12/2016 2:00:00 AM        151
## 4 1503960366 4/12/2016 3:00:00 AM          0
## 5 1503960366 4/12/2016 4:00:00 AM          0
## 6 1503960366 4/12/2016 5:00:00 AM          0
str(hourly_steps)
## spc_tbl_ [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id          : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ StepTotal   : num [1:22099] 373 160 151 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityHour = col_character(),
##   ..   StepTotal = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Summarizing Datasets

daily_activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes, Calories) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900
daily_sleep %>%
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0
hourly_steps %>%  
  select(StepTotal) %>%
  summary()
##    StepTotal      
##  Min.   :    0.0  
##  1st Qu.:    0.0  
##  Median :   40.0  
##  Mean   :  320.2  
##  3rd Qu.:  357.0  
##  Max.   :10554.0

Based on the available data, several key observations can be made:

  1. Average sedentary time: The average sedentary time is calculated to be 991 minutes or approximately 16 hours. This indicates a significant amount of time spent in a sedentary state. It is evident that efforts should be made to reduce sedentary behavior for better overall health.

  2. Activity level: The majority of participants in the study exhibit light activity levels. This suggests that they engage in low-intensity physical activities throughout the day. While some activity is present, it may be beneficial for individuals to incorporate more moderate or vigorous activities into their routines to achieve optimal health outcomes.

  3. Sleep duration: On average, participants sleep once a day for approximately 7 hours (432.5 minutes). This aligns with the recommended sleep duration for adults. The data indicates that most participants maintain a consistent sleep pattern without any notable sleep disorders or disruptions.

  4. Average daily steps: The average total steps per day is calculated to be 7,638. While some level of physical activity is observed, this falls short of the commonly recommended goal of 10,000 steps per day for health benefits. Notably, research by the CDC suggests that taking 8,000 steps per day is associated with a 51% lower risk of all-cause mortality, while taking 12,000 steps per day is associated with a 65% lower risk compared to taking 4,000 steps. Therefore, there is room for improvement in increasing daily step count to reap more significant health benefits.

From these initial findings, it is evident that the average user of health-tracker data demonstrates a baseline level of activity, prioritizes sleep with an adequate duration, but may benefit from incorporating more physical activity into their daily routine to optimize their health outcomes. These observations provide valuable insights into user behavior and can inform future strategies for promoting healthier lifestyles.

Cleaning and Formatting

Having familiarized ourselves with the data structures, our next step is to process them in order to identify and rectify any errors or inconsistencies. By carefully examining the data, we can detect missing values, outliers, and other data quality issues that may impact the integrity and accuracy of our analysis. Through data processing techniques such as data cleaning, validation, and transformation, we aim to ensure that the data is reliable and suitable for further analysis. This meticulous approach will enhance the reliability and validity of our findings, enabling us to draw meaningful insights and make informed decisions based on the processed data.

Verifying Respondent Data

Before proceeding with the data cleaning process, it is important to determine the number of unique users in each data frame. While we acknowledge that the sample size is minimal with 30 users, we will still retain the sleep dataset for the purpose of practicing data cleaning techniques. By identifying the number of unique users in each data frame, we can gain insights into the diversity of the sample and understand the coverage of the data across different users.

n_unique(daily_activity$Id)
## [1] 33
n_unique(daily_sleep$Id)
## [1] 24
n_unique(hourly_steps$Id)
## [1] 33
Addressing Duplicates and Null Values

To ensure data integrity, our next step is to identify and remove any duplicate entries in the datasets. Given the length of observations in the daily_sleep dataset (413), we can confidently proceed with the removal of duplicates specifically in this dataset. By eliminating duplicate entries, we can avoid potential biases or inaccuracies in our analysis, leading to more reliable and accurate results. The removal of duplicates will enhance the overall quality of our data and contribute to a more robust analysis.

sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
sum(duplicated(hourly_steps))
## [1] 0
daily_activity <- daily_activity %>%
  distinct() %>%
  drop_na()

daily_sleep <- daily_sleep %>%
  distinct() %>%
  drop_na()

hourly_steps <- hourly_steps %>%
  distinct() %>%
  drop_na()

After removing the duplicates from the daily_sleep dataset, we will now verify that the duplicates have been successfully eliminated. This verification step is crucial to ensure the accuracy and integrity of our data. By confirming the absence of duplicates, we can be confident that our dataset is now free from redundant entries and ready for further analysis. This verification process adds an extra layer of quality control and allows us to proceed with our analysis using a reliable and clean dataset.

sum(duplicated(daily_sleep))
## [1] 0
Clean and Rename Appropriate Columns

To ensure consistency and compatibility among the datasets for future merging, we will standardize the column names by applying the right syntax and format. Specifically, we will convert all column names to lowercase to maintain uniformity throughout the datasets. This transformation will help avoid any potential issues when merging the datasets and ensure that the column names are in a standardized format for ease of analysis. By adhering to a consistent naming convention, we can streamline our data processing and analysis workflows.

clean_names(daily_activity)
## # A tibble: 940 × 15
##            id activity_date total_steps total_distance tracker_distance
##         <dbl> <chr>               <dbl>          <dbl>            <dbl>
##  1 1503960366 4/12/2016           13162           8.5              8.5 
##  2 1503960366 4/13/2016           10735           6.97             6.97
##  3 1503960366 4/14/2016           10460           6.74             6.74
##  4 1503960366 4/15/2016            9762           6.28             6.28
##  5 1503960366 4/16/2016           12669           8.16             8.16
##  6 1503960366 4/17/2016            9705           6.48             6.48
##  7 1503960366 4/18/2016           13019           8.59             8.59
##  8 1503960366 4/19/2016           15506           9.88             9.88
##  9 1503960366 4/20/2016           10544           6.68             6.68
## 10 1503960366 4/21/2016            9819           6.34             6.34
## # ℹ 930 more rows
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
daily_activity<- rename_with(daily_activity, tolower)
clean_names(daily_sleep)
## # A tibble: 410 × 5
##          id sleep_day total_sleep_records total_minutes_asleep total_time_in_bed
##       <dbl> <chr>                   <dbl>                <dbl>             <dbl>
##  1   1.50e9 4/12/201…                   1                  327               346
##  2   1.50e9 4/13/201…                   2                  384               407
##  3   1.50e9 4/15/201…                   1                  412               442
##  4   1.50e9 4/16/201…                   2                  340               367
##  5   1.50e9 4/17/201…                   1                  700               712
##  6   1.50e9 4/19/201…                   1                  304               320
##  7   1.50e9 4/20/201…                   1                  360               377
##  8   1.50e9 4/21/201…                   1                  325               364
##  9   1.50e9 4/23/201…                   1                  361               384
## 10   1.50e9 4/24/201…                   1                  430               449
## # ℹ 400 more rows
daily_sleep <- rename_with(daily_sleep, tolower)
clean_names(hourly_steps)
## # A tibble: 22,099 × 3
##            id activity_hour         step_total
##         <dbl> <chr>                      <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM        373
##  2 1503960366 4/12/2016 1:00:00 AM         160
##  3 1503960366 4/12/2016 2:00:00 AM         151
##  4 1503960366 4/12/2016 3:00:00 AM           0
##  5 1503960366 4/12/2016 4:00:00 AM           0
##  6 1503960366 4/12/2016 5:00:00 AM           0
##  7 1503960366 4/12/2016 6:00:00 AM           0
##  8 1503960366 4/12/2016 7:00:00 AM           0
##  9 1503960366 4/12/2016 8:00:00 AM         250
## 10 1503960366 4/12/2016 9:00:00 AM        1864
## # ℹ 22,089 more rows
hourly_steps <- rename_with(hourly_steps, tolower)
Ensuring Data and Time Consistencies

With the column names standardized and converted to lowercase, our attention now shifts to cleaning the date-time format in the daily_activity and daily_sleep data frames, as we intend to merge these two datasets. Given that the time component in the daily_sleep data frame can be disregarded for our analysis, we will utilize the as_date function instead of as_datetime to convert the date-time values in both data frames to date format only. This step will ensure consistency in the date representation across the datasets and facilitate the merging process. By harmonizing the date formats, we can effectively combine the relevant information from both data frames and proceed with the subsequent stages of our analysis.

daily_activity <- daily_activity %>%
  rename(date = activitydate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

daily_sleep <- daily_sleep %>%
  rename(date = sleepday) %>%
  mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p"))
Check Cleaned Datasets
head(daily_activity)
## # A tibble: 6 × 15
##           id date       totalsteps totaldistance trackerdistance
##        <dbl> <date>          <dbl>         <dbl>           <dbl>
## 1 1503960366 2016-04-12      13162          8.5             8.5 
## 2 1503960366 2016-04-13      10735          6.97            6.97
## 3 1503960366 2016-04-14      10460          6.74            6.74
## 4 1503960366 2016-04-15       9762          6.28            6.28
## 5 1503960366 2016-04-16      12669          8.16            8.16
## 6 1503960366 2016-04-17       9705          6.48            6.48
## # ℹ 10 more variables: loggedactivitiesdistance <dbl>,
## #   veryactivedistance <dbl>, moderatelyactivedistance <dbl>,
## #   lightactivedistance <dbl>, sedentaryactivedistance <dbl>,
## #   veryactiveminutes <dbl>, fairlyactiveminutes <dbl>,
## #   lightlyactiveminutes <dbl>, sedentaryminutes <dbl>, calories <dbl>
head(daily_sleep)
## # A tibble: 6 × 5
##           id date       totalsleeprecords totalminutesasleep totaltimeinbed
##        <dbl> <date>                 <dbl>              <dbl>          <dbl>
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320

In order to transform the date strings into date-time format, we will perform a conversion for the “date” column in the hourly_steps dataset. This conversion will allow us to represent the dates and times in a standardized and consistent manner, facilitating further analysis and comparison across the dataset.

hourly_steps<- hourly_steps %>% 
  rename(date_time = activityhour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

head(hourly_steps)
## # A tibble: 6 × 3
##           id date_time           steptotal
##        <dbl> <dttm>                  <dbl>
## 1 1503960366 2016-04-12 00:00:00       373
## 2 1503960366 2016-04-12 01:00:00       160
## 3 1503960366 2016-04-12 02:00:00       151
## 4 1503960366 2016-04-12 03:00:00         0
## 5 1503960366 2016-04-12 04:00:00         0
## 6 1503960366 2016-04-12 05:00:00         0

Merging Datasets

To explore potential correlations between variables, we will merge the daily_activity and daily_sleep datasets. This merging process will be based on the common identifiers (id) and the date values (date) as the primary keys. By combining the relevant information from both datasets, we can analyze the relationship between different variables and gain insights into how they may be correlated.

daily_activity_sleep <- merge(daily_activity, daily_sleep, by=c ("id", "date"))
glimpse(daily_activity_sleep)
## Rows: 410
## Columns: 18
## $ id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ date                     <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-…
## $ totalsteps               <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544…
## $ totaldistance            <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ trackerdistance          <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ loggedactivitiesdistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactivedistance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3…
## $ moderatelyactivedistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3…
## $ lightactivedistance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6…
## $ sedentaryactivedistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactiveminutes        <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3…
## $ fairlyactiveminutes      <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,…
## $ lightlyactiveminutes     <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, …
## $ sedentaryminutes         <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, …
## $ calories                 <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177…
## $ totalsleeprecords        <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ totalminutesasleep       <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ totaltimeinbed           <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, …

Phase 4: Analyze & Phase 5: Share

Combined Section

Organization and Identification + Data Visualization

To leverage the data from FitBit users and support BellaBeat’s marketing strategy, we will conduct an analysis to identify trends and patterns. By examining user behaviors and preferences captured by FitBit, we can gain valuable insights to inform BellaBeat’s marketing decisions. These insights may include understanding user engagement, activity levels, sleep patterns, and other relevant metrics. By analyzing these trends, BellaBeat can tailor its marketing strategies to better meet the needs and preferences of its target audience, ultimately enhancing customer satisfaction and driving business growth.

Activity Level

Since demographic variables are not available in our dataset, we can still gain insights into the type of users based on their activity levels using the available data on daily step counts. We can categorize users into four groups according to the following classification:

  • Sedentary: Users who take less than 5000 steps per day.
  • Lightly active: Users who take between 5000 and 7499 steps per day.
  • Fairly active: Users who take between 7500 and 9999 steps per day.
  • Very active: Users who take more than 10000 steps per day.

This classification is based on an article from “10,000 Steps” organization, which provides guidelines for step counts and activity levels (https://www.10000steps.org.au/articles/counting-steps/). By categorizing users in this way, we can gain a better understanding of the distribution of activity levels among our FitBit users and potentially identify different user segments. This information can be valuable in shaping BellaBeat’s marketing strategies to target and cater to the specific needs and preferences of each user segment.

Daily Data by User
daily_average <- daily_activity_sleep %>%
  group_by(id) %>%
  summarise (mean_daily_steps = mean(totalsteps), mean_daily_calories = mean(calories), mean_daily_sleep = mean(totalminutesasleep))

head(daily_average)
## # A tibble: 6 × 4
##           id mean_daily_steps mean_daily_calories mean_daily_sleep
##        <dbl>            <dbl>               <dbl>            <dbl>
## 1 1503960366           12406.               1872.             360.
## 2 1644430081            7968.               2978.             294 
## 3 1844505072            3477                1676.             652 
## 4 1927972279            1490                2316.             417 
## 5 2026352035            5619.               1541.             506.
## 6 2320127002            5079                1804               61
Classification
daily_average <- daily_activity_sleep %>% 
  group_by (id) %>% 
  summarise(avg_daily_steps= mean(totalsteps), 
            avg_daily_cal= mean(calories), 
            avg_daily_sleep= mean(totalminutesasleep, 
                                   na.rm = TRUE)) %>% 
  mutate(user_type= case_when(
    avg_daily_steps < 5000 ~ "sedentary",
    avg_daily_steps >= 5000 & avg_daily_steps <7499 ~"lightly active",
    avg_daily_steps >= 7499 & avg_daily_steps <9999 ~"fairly active",
    avg_daily_steps >= 10000 ~"very active"
  ))

head(daily_average)
## # A tibble: 6 × 5
##           id avg_daily_steps avg_daily_cal avg_daily_sleep user_type     
##        <dbl>           <dbl>         <dbl>           <dbl> <chr>         
## 1 1503960366          12406.         1872.            360. very active   
## 2 1644430081           7968.         2978.            294  fairly active 
## 3 1844505072           3477          1676.            652  sedentary     
## 4 1927972279           1490          2316.            417  sedentary     
## 5 2026352035           5619.         1541.            506. lightly active
## 6 2320127002           5079          1804              61  lightly active
Data Frame Creation
user_type_percent <- daily_average %>%
  group_by(user_type) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(user_type) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

user_type_percent$user_type <- factor(user_type_percent$user_type , levels = c("very active", "fairly active", "lightly active", "sedentary"))


head(user_type_percent)
## # A tibble: 4 × 3
##   user_type      total_percent labels
##   <fct>                  <dbl> <chr> 
## 1 fairly active          0.375 38%   
## 2 lightly active         0.208 21%   
## 3 sedentary              0.208 21%   
## 4 very active            0.208 21%
User Distribution
user_type_percent %>%
  ggplot(aes(x="",y=total_percent, fill=user_type)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#85e085","#FDFD96", "#FFB347", "#ff8080")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  labs(title="User Type Distribution")

Steps + Minutes Asleep by Day of Week

Our objective is to determine the weekdays when users are most active and when they sleep the most. Additionally, we will assess whether users meet the recommended levels of daily steps and sleep duration.

To accomplish this, we start by extracting the weekdays from the date column. Subsequently, we calculate the average number of steps walked and minutes slept for each weekday.

weekday_steps_sleep <- daily_activity_sleep %>%
  mutate(weekday = weekdays(date))

weekday_steps_sleep$weekday <-ordered(weekday_steps_sleep$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))

 weekday_steps_sleep <-weekday_steps_sleep%>%
  group_by(weekday) %>%
  summarize (daily_steps = mean(totalsteps), daily_sleep = mean(totalminutesasleep))

head(weekday_steps_sleep)
## # A tibble: 6 × 3
##   weekday   daily_steps daily_sleep
##   <ord>           <dbl>       <dbl>
## 1 Monday          9273.        420.
## 2 Tuesday         9183.        405.
## 3 Wednesday       8023.        435.
## 4 Thursday        8184.        401.
## 5 Friday          7901.        405.
## 6 Saturday        9871.        419.
ggarrange(
    ggplot(weekday_steps_sleep) +
      geom_col(aes(weekday, daily_steps), fill = "#FFB347") +
      geom_hline(yintercept = 10000) +
      labs(title = "Daily Steps Per Weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1)),
    ggplot(weekday_steps_sleep, aes(weekday, daily_sleep)) +
      geom_col(fill = "#FDFD96") +
      geom_hline(yintercept = 480) +
      labs(title = "Minutes Asleep Per Weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))
  )

In the graphs above, we can observe the following:

  • Users do not walk the recommended amount of steps (10000) on a daily basis.
  • Users do not sleep the recommended amount of minutes/hours any day of the week, which is 8 hours.

Activity Level by Hour of Day

Hourly Step Breakdown

To further analyze user activity patterns throughout the day, we will focus on the hourly_steps data frame. We will extract the time component from the date_time column.

library(hms)

hourly_steps <- hourly_steps %>%
  mutate(time = as_hms(date_time))

head(hourly_steps)
## # A tibble: 6 × 4
##           id date_time           steptotal time  
##        <dbl> <dttm>                  <dbl> <time>
## 1 1503960366 2016-04-12 00:00:00       373 00:00 
## 2 1503960366 2016-04-12 01:00:00       160 01:00 
## 3 1503960366 2016-04-12 02:00:00       151 02:00 
## 4 1503960366 2016-04-12 03:00:00         0 03:00 
## 5 1503960366 2016-04-12 04:00:00         0 04:00 
## 6 1503960366 2016-04-12 05:00:00         0 05:00
hourly_steps %>%
  group_by(time) %>%
  summarize(average_steps = mean(steptotal)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) + 
  labs(title = "Hourly Steps Throughout The Day", x="", y="") + 
  scale_fill_gradient(low = "#ff8080", high = "#85e085") +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(labels = function(x) paste0(x, " Steps"))

We can observe that users are more active during the hours of 8am to 7pm. Specifically, there is a higher level of activity during lunchtime between 12pm and 2pm, as well as in the evenings between 5pm and 7pm. These periods seem to correspond to times when users tend to walk more steps throughout the day.

Activity Level by Date of Month

hourly_steps <- hourly_steps %>%
  mutate(date = as.Date(date_time))

head(hourly_steps)
## # A tibble: 6 × 5
##           id date_time           steptotal time   date      
##        <dbl> <dttm>                  <dbl> <time> <date>    
## 1 1503960366 2016-04-12 00:00:00       373 00:00  2016-04-12
## 2 1503960366 2016-04-12 01:00:00       160 01:00  2016-04-12
## 3 1503960366 2016-04-12 02:00:00       151 02:00  2016-04-12
## 4 1503960366 2016-04-12 03:00:00         0 03:00  2016-04-12
## 5 1503960366 2016-04-12 04:00:00         0 04:00  2016-04-12
## 6 1503960366 2016-04-12 05:00:00         0 05:00  2016-04-12
hourly_step_trend<-(hourly_steps) %>%
  group_by(date) %>%
  summarise(average_hr= n()/33)

head(hourly_step_trend)
## # A tibble: 6 × 2
##   date       average_hr
##   <date>          <dbl>
## 1 2016-04-12       24  
## 2 2016-04-13       24  
## 3 2016-04-14       24  
## 4 2016-04-15       23.8
## 5 2016-04-16       23.3
## 6 2016-04-17       23.3
ggplot(hourly_step_trend, aes(x = date, y = average_hr)) +
  geom_line(color = "#006699", size = 1) +
  scale_x_date(breaks = date_breaks("1 day"), 
               labels = date_format("%b-%d"), 
               limits = c(min(hourly_step_trend$date), max(hourly_step_trend$date)),
               expand = c(0.02, 0.02)) +
  scale_y_continuous(limits = c(0, 25),
                     breaks = seq(0, max(hourly_step_trend$average_hr), by = 4),
                     expand = c(0, 0.7)) +
  labs(title = "Daily Usage in a Month", 
       x = "Date", y = "Worn Hours per Day",
       caption = "Fitabase Data 4.12.16-5.12.16") +
  scale_fill_brewer(palette = "BuPu") + 
  annotate("rect", xmin = as.Date("2016-04-28"), 
           xmax = as.Date("2016-05-11"),
           ymin = -Inf, ymax = Inf, 
           fill = "#E0ECF4", alpha = 0.4) + 
  theme(axis.text.x = element_text(angle = 90), 
        plot.title = element_text(size = 16),
        panel.grid.major.x = element_line(color = "grey60",
                                          linetype = "solid", size = 0.1),
        panel.background = element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The line chart shows a descending trend, indicating that users initially wear the tracker for the entire day, maintaining an average wearing time of 23 hours per day. However, after two weeks (starting from April 28th), the wearing time gradually decreases.

Correlations

Next, we will investigate whether there is a relationship between different variables by examining their correlation:

  1. We will analyze the correlation between daily steps and daily sleep.
  2. We will examine the correlation between daily steps and calories burned

By exploring these correlations, we can gain insights into the potential connections between the variables and uncover any patterns or dependencies that may exist.

ggarrange(
ggplot(daily_activity_sleep, aes(x=totalsteps, y=totalminutesasleep))+
  geom_jitter() +
  geom_smooth(color = "#ff8080") + 
  labs(title = "Daily Steps vs Minutes Asleep", x = "Daily Steps", y= "Minutes Asleep") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(daily_activity_sleep, aes(x=totalsteps, y=calories))+
  geom_jitter() +
  geom_smooth(color = "#ff8080") + 
  labs(title = "Daily Steps vs Calories Burned", x = "Daily Steps", y= "Calories Burned") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Based on our plots and analysis:

  1. There appears to be no significant correlation between the daily activity level, as measured by the number of steps taken, and the amount of sleep users get each day. The data suggests that users’ activity levels, represented by steps, do not necessarily impact their sleep duration.

  2. On the other hand, we observe a positive correlation between the number of steps walked and the calories burned. As expected, as users take more steps, they tend to burn more calories. This finding aligns with the assumption that increased physical activity leads to a higher calorie expenditure.

These insights provide valuable information about the relationships between these variables and can help inform our understanding of user behavior and the impact of physical activity on health and calorie expenditure.

Accounting for Smart Devices

After analyzing trends in activity, sleep, and calories burned, our focus now shifts to understanding the frequency of device usage among users in our sample. This information will help us shape our marketing strategy and identify features that can enhance the user experience of smart devices.

To accomplish this, we will calculate the number of users who use their smart devices on a daily basis and classify them into three categories based on a 31-day date interval:

  • High Use: Users who use their device between 21 and 31 days.
  • Moderate Use: Users who use their device between 10 and 20 days.
  • Low Use: Users who use their device between 1 and 10 days.

To begin, we will create a new data frame by grouping the data by User ID. Then, we will calculate the number of unique days each user has used their device. Finally, we will add a new column to the data frame, assigning the appropriate classification based on the number of days used.

By performing this analysis, we will gain insights into the usage patterns of our sample users and be able to tailor our marketing strategy and device features to meet their needs effectively.

daily_use <- daily_activity_sleep %>%
  group_by(id) %>%
  summarize(days_used=sum(n())) %>%
  mutate(usage = case_when(
    days_used >= 1 & days_used <= 10 ~ "low use",
    days_used >= 11 & days_used <= 20 ~ "moderate use", 
    days_used >= 21 & days_used <= 31 ~ "high use", 
  ))
  
head(daily_use)
## # A tibble: 6 × 3
##           id days_used usage   
##        <dbl>     <int> <chr>   
## 1 1503960366        25 high use
## 2 1644430081         4 low use 
## 3 1844505072         3 low use 
## 4 1927972279         5 low use 
## 5 2026352035        28 high use
## 6 2320127002         1 low use
Percentage Data
daily_use_percent <- daily_use %>%
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(usage) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

daily_use_percent$usage <- factor(daily_use_percent$usage, levels = c("high use", "moderate use", "low use"))

head(daily_use_percent)
## # A tibble: 3 × 3
##   usage        total_percent labels
##   <fct>                <dbl> <chr> 
## 1 high use             0.5   50%   
## 2 low use              0.375 38%   
## 3 moderate use         0.125 12%
Utilization of Dataframe
daily_use_percent %>%
  ggplot(aes(x="",y=total_percent, fill=usage)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("#b784ff","#c6a1ff","#d4baff"),
                    labels = c("High Use - 21 to 31 days",
                                 "Moderate Use - 11 to 20 days",
                                 "Low Use - 1 to 10 days"))+
  labs(title="Daily Use of Smart Device")

Analyzing our results, we can observe the following:

  • Approximately 50% of the users in our sample use their device frequently, using it between 21 and 31 days within the given date interval.
  • Around 12% of the users use their device moderately, using it between 11 and 20 days.
  • The remaining 38% of the sample uses their device rarely, with usage ranging from 1 to 10 days.
Time Usage of Smart Device

To gain more precise insights, we aim to determine the duration in minutes for which users wear their devices each day. To achieve this, we will merge the previously created daily_use data frame with the daily_activity data frame. This merging process will enable us to filter and analyze the results based on the daily usage of the device.

daily_use_merged <- merge(daily_activity, daily_use, by=c ("id"))
head(daily_use_merged)
##           id       date totalsteps totaldistance trackerdistance
## 1 1503960366 2016-05-07      11992          7.71            7.71
## 2 1503960366 2016-05-06      12159          8.03            8.03
## 3 1503960366 2016-05-01      10602          6.81            6.81
## 4 1503960366 2016-04-30      14673          9.25            9.25
## 5 1503960366 2016-04-12      13162          8.50            8.50
## 6 1503960366 2016-04-13      10735          6.97            6.97
##   loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1                        0               2.46                     2.12
## 2                        0               1.97                     0.25
## 3                        0               2.29                     1.60
## 4                        0               3.56                     1.42
## 5                        0               1.88                     0.55
## 6                        0               1.57                     0.69
##   lightactivedistance sedentaryactivedistance veryactiveminutes
## 1                3.13                       0                37
## 2                5.81                       0                24
## 3                2.92                       0                33
## 4                4.27                       0                52
## 5                6.06                       0                25
## 6                4.71                       0                21
##   fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories days_used
## 1                  46                  175              833     1821        25
## 2                   6                  289              754     1896        25
## 3                  35                  246              730     1820        25
## 4                  34                  217              712     1947        25
## 5                  13                  328              728     1985        25
## 6                  19                  217              776     1797        25
##      usage
## 1 high use
## 2 high use
## 3 high use
## 4 high use
## 5 high use
## 6 high use

To further analyze the usage patterns of the device, we will create a new data frame that calculates the total number of minutes users wore their device each day. We will then categorize the usage into three distinct categories:

  1. “All day” - indicating that the device was worn for the entire day.
  2. “More than half day” - indicating that the device was worn for more than half of the day.
  3. “Less than half day” - indicating that the device was worn for less than half of the day.

This categorization will provide insights into the extent of device usage by the users on a daily basis.

minutes_worn <- daily_use_merged %>% 
  mutate(total_minutes_worn = veryactiveminutes+fairlyactiveminutes+lightlyactiveminutes+sedentaryminutes)%>%
  mutate (percent_minutes_worn = (total_minutes_worn/1440)*100) %>%
  mutate (worn = case_when(
    percent_minutes_worn == 100 ~ "All day",
    percent_minutes_worn < 100 & percent_minutes_worn >= 50~ "More than half day", 
    percent_minutes_worn < 50 & percent_minutes_worn > 0 ~ "Less than half day"
  ))

head(minutes_worn)
##           id       date totalsteps totaldistance trackerdistance
## 1 1503960366 2016-05-07      11992          7.71            7.71
## 2 1503960366 2016-05-06      12159          8.03            8.03
## 3 1503960366 2016-05-01      10602          6.81            6.81
## 4 1503960366 2016-04-30      14673          9.25            9.25
## 5 1503960366 2016-04-12      13162          8.50            8.50
## 6 1503960366 2016-04-13      10735          6.97            6.97
##   loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1                        0               2.46                     2.12
## 2                        0               1.97                     0.25
## 3                        0               2.29                     1.60
## 4                        0               3.56                     1.42
## 5                        0               1.88                     0.55
## 6                        0               1.57                     0.69
##   lightactivedistance sedentaryactivedistance veryactiveminutes
## 1                3.13                       0                37
## 2                5.81                       0                24
## 3                2.92                       0                33
## 4                4.27                       0                52
## 5                6.06                       0                25
## 6                4.71                       0                21
##   fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories days_used
## 1                  46                  175              833     1821        25
## 2                   6                  289              754     1896        25
## 3                  35                  246              730     1820        25
## 4                  34                  217              712     1947        25
## 5                  13                  328              728     1985        25
## 6                  19                  217              776     1797        25
##      usage total_minutes_worn percent_minutes_worn               worn
## 1 high use               1091             75.76389 More than half day
## 2 high use               1073             74.51389 More than half day
## 3 high use               1044             72.50000 More than half day
## 4 high use               1015             70.48611 More than half day
## 5 high use               1094             75.97222 More than half day
## 6 high use               1033             71.73611 More than half day
Introduction of Additional Dataframes

To enhance the visualization of our findings, we will generate four distinct data frames. These data frames will allow us to present the results in a comprehensive manner.

The first data frame will display the total number of users and calculate the percentage of time the device was worn, taking into account the three previously defined categories.

The remaining three data frames will be filtered based on the daily use categories, enabling us to observe the variations in both the frequency of device usage and the duration of usage.

By organizing the data in this manner, we can gain a deeper understanding of the relationship between daily use patterns and the amount of time users wear the device.

minutes_worn_percent<- minutes_worn%>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))


minutes_worn_highuse <- minutes_worn%>%
  filter (usage == "high use")%>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_moduse <- minutes_worn%>%
  filter(usage == "moderate use") %>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_lowuse <- minutes_worn%>%
  filter (usage == "low use") %>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_percent$worn <- factor(minutes_worn_percent$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_highuse$worn <- factor(minutes_worn_highuse$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_moduse$worn <- factor(minutes_worn_moduse$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_lowuse$worn <- factor(minutes_worn_lowuse$worn, levels = c("All day", "More than half day", "Less than half day"))

head(minutes_worn_percent)
## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                   0.365  36%   
## 2 Less than half day        0.0351 4%    
## 3 More than half day        0.600  60%
head(minutes_worn_highuse)
## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                   0.0676 6.8%  
## 2 Less than half day        0.0432 4.3%  
## 3 More than half day        0.889  88.9%
head(minutes_worn_moduse)
## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                    0.267 27%   
## 2 Less than half day         0.04  4%    
## 3 More than half day         0.693 69%
head(minutes_worn_lowuse)
## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                   0.802  80%   
## 2 Less than half day        0.0224 2%    
## 3 More than half day        0.175  18%

After creating the four data frames and organizing the worn level categories, we can now visualize our results through a set of plots. These plots have been carefully arranged together to enhance the visual representation and facilitate comparisons between different aspects of our analysis.

Multi-Visualization
ggarrange(
  ggplot(minutes_worn_percent, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)) +
    scale_fill_manual(values = c("#FF6FA8", "#FF9AA2", "#FFC0CB"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3.5)+
  labs(title="Time Worn Per Day", subtitle = "Total Users"),
  ggarrange(
  ggplot(minutes_worn_highuse, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
    theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = "none")+
    scale_fill_manual(values = c("#FF6FA8", "#FF9AA2", "#FFC0CB"))+
  geom_text_repel(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="", subtitle = "High Use Users"), 
  ggplot(minutes_worn_moduse, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
    theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "none") +
    scale_fill_manual(values = c("#FF6FA8", "#FF9AA2", "#FFC0CB"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="", subtitle = "Moderate Use Users"), 
  ggplot(minutes_worn_lowuse, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
    theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "none") +
    scale_fill_manual(values = c("#FF6FA8", "#FF9AA2", "#FFC0CB"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="", subtitle = "Low Use Users"), 
  ncol = 3), 
  nrow = 2)

Based on our plots, we observe that 36% of the total users wear the device throughout the entire day, 60% wear it for more than half of the day, and only 4% wear it for less than half of the day.

When we consider the usage patterns based on the number of days the device is used, we find the following results:

  • High users: Among the users who use their device for 21 to 31 days, only 6.8% wear it all day. However, 88.9% wear the device for more than half of the day, but not the entire day.
  • Moderate users: This category of users tends to wear the device for a shorter duration on a daily basis.
  • Low users: Interestingly, users who fall into the low usage category tend to wear their device for a longer duration on the days they use it.

These findings highlight the varying usage patterns among different user groups, suggesting that marketing strategies and features can be tailored accordingly to better serve their needs.

Activities of Each User Type: Steps, Calories, Distance and Sleep

activity_sleep_final <- merge(daily_activity_sleep, daily_average[c("id","user_type")], by="id") 

activity_sleep_final$user_type <-ordered(activity_sleep_final$user_type, levels= c("sedentary","lightly active","fairly active","very active")) 

head(activity_sleep_final)
##           id       date totalsteps totaldistance trackerdistance
## 1 1503960366 2016-04-12      13162          8.50            8.50
## 2 1503960366 2016-04-13      10735          6.97            6.97
## 3 1503960366 2016-04-15       9762          6.28            6.28
## 4 1503960366 2016-04-16      12669          8.16            8.16
## 5 1503960366 2016-04-17       9705          6.48            6.48
## 6 1503960366 2016-04-19      15506          9.88            9.88
##   loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.14                     1.26
## 4                        0               2.71                     0.41
## 5                        0               3.19                     0.78
## 6                        0               3.53                     1.32
##   lightactivedistance sedentaryactivedistance veryactiveminutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                2.83                       0                29
## 4                5.04                       0                36
## 5                2.51                       0                38
## 6                5.03                       0                50
##   fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  34                  209              726     1745
## 4                  10                  221              773     1863
## 5                  20                  164              539     1728
## 6                  31                  264              775     2035
##   totalsleeprecords totalminutesasleep totaltimeinbed   user_type
## 1                 1                327            346 very active
## 2                 2                384            407 very active
## 3                 1                412            442 very active
## 4                 2                340            367 very active
## 5                 1                700            712 very active
## 6                 1                304            320 very active
Steps vs Types
ggplot(activity_sleep_final[which(activity_sleep_final$totalsteps>0),], 
       aes(user_type,totalsteps, fill=user_type))+
  geom_boxplot()+
  stat_summary(fun="mean", geom="point", 
               shape=23,size=2, fill="white")+
  labs(title= "Daily Steps by User Type", 
       x= " ", y="total steps",)+
  scale_fill_brewer(palette="BuPu")+
  theme(plot.title= element_text(hjust= 0.5,vjust= 0.8, size=16),
        legend.position= "none")

The box plot above illustrates that individuals classified as “very active” take more than 10,000 steps per day, with some outliers exceeding 20,000 steps. This wide range of step counts indicates significant variation in the daily activity level among this user type.

Calories vs Types
ggplot(activity_sleep_final[which(activity_sleep_final$calories>0),], aes(user_type,calories, fill=user_type))+
  geom_boxplot()+
  stat_summary(fun= "mean", geom= "point", 
               shape= 23,size= 2, fill= "white")+
  labs(title= "Daily Calories Burnt by User Type", 
       x= " ", y="calories burnt",)+
  scale_fill_brewer(palette="BuPu")+
  theme(plot.title= element_text(hjust= 0.5,vjust= 0.8, size=16),
        legend.position= "none")

The average calories burnt by each user type aligns with their daily activity level. Among the user types, “fairly active” has only one outlier and a mean value that is very close to the median. This suggests that, during the observed time frame, individuals in the “lightly active” category exhibit a more consistent trend in terms of calories burnt.

Distance vs Types
ggplot(activity_sleep_final[which(activity_sleep_final$totaldistance>0),], 
       aes(user_type,totaldistance, fill= user_type))+
  geom_boxplot()+
  stat_summary(fun= "mean", geom= "point", 
               shape= 23,size= 2, fill= "white")+
  labs(title= "Daily Distance by User Type", 
       x= " ", y = "total distance (miles)",)+
  scale_fill_brewer(palette= "BuPu")+
  theme(plot.title= element_text(hjust= 0.5,vjust= 0.8, size=16),
        legend.position = "none")

The average distance covered by each user type corresponds to their daily step count. It is observed that individuals who take more steps also cover a longer distance. Among the user types, “very active” exhibits the most outliers, indicating that their activity pattern is more inconsistent compared to other types.

Sleep vs Types
ggplot(subset(activity_sleep_final,!is.na(totalminutesasleep)),
       aes(user_type,totalminutesasleep, fill=user_type))+
  geom_boxplot()+
  stat_summary(fun="mean", geom="point", 
               shape=23,size=2, fill="white")+
  labs(title= "Sleep by User Type", 
       x= " ", y=" minutes asleep",)+
  scale_fill_brewer(palette="BuPu")+
  theme(plot.title= element_text(hjust= 0.5,vjust= 0.8, size=16),
        legend.position= "none")

The presence of numerous outliers in the data suggests significant variation in the amount of sleep among different user types. Specifically, the “lightly active” type tends to have the longest sleep duration, while the “very active” type has the shortest sleep duration.

Phase 6: Act

Brief Summary

Bellabeat, a tech-driven wellness company for women, has leveraged data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits. With its rapid growth since its establishment in 2013, Bellabeat aims to shape its marketing strategy based on insights derived from analyzing FitBit Fitness Tracker Data.

The target audience for Bellabeat includes women who work full-time jobs and spend a significant amount of time engaged in sedentary activities such as computer work or meetings. Although these women engage in light activity to maintain their health, they need to improve their everyday activity levels to reap the full benefits of a healthy lifestyle. Providing knowledge about developing healthy habits and motivation to sustain their efforts could greatly benefit this audience.

Key Findings

  • Users take an average of 7,638 steps per day, which is lower than the recommended daily goal of 10,000 steps by the CDC. Additionally, users spend approximately 70% of their time being sedentary or inactive each day.
  • There is a correlation between the number of daily steps taken and the number of calories burned. However, there is no correlation between the number of steps taken and the amount of time users sleep per day.
  • Among different activity types, the lightly active type has the longest average sleep duration, while the very active type has the shortest average sleep duration.
  • Users exhibit the highest activity levels during the time frame of 11 am to 1 pm on Saturdays, and from 5 pm to 6 pm on Wednesdays.
  • Over the course of the observation period, the average duration of tracker usage gradually decreases after two weeks, starting from April 28th.

Based on the findings, here are some recommendations.

Personalized Daily Step Targets

The app should provide personalized daily step targets based on the user’s profile, lifestyle, and goals. Sending reminders to users who are falling behind their targets can help motivate them to stay active. Additionally, incorporating features like mini-games or wellness trivia can create a sense of reward and increase user engagement and retention.

Sedentary Alerts

The app can send alerts to users who remain seated or inactive for an extended period of time. This feature would be particularly useful for users who work from home and may forget to take breaks. Encouraging regular activity breaks can help improve overall health and combat sedentary behavior.

Social Networking and Team Goal Setting

Incorporating social networking features such as in-app chats or team goal setting can enhance user engagement and promote exercise habits. Studies have shown that social support interventions increase physical activity among adults. By fostering a sense of community and accountability, users can motivate and inspire each other to stay active.

Sleep Improvement Features

For users looking to improve their sleep, the app can recommend light activities before bedtime and alert users if their activity level is too intense based on their profile. Additionally, incorporating features to assist with meditation or relaxation techniques can help users wind down and prepare for better sleep quality.

Investigate Wear Time Decline

Further investigation is needed to understand why the average wear time of the tracker decreases over time. Analyzing user feedback and conducting user surveys can provide insights into potential reasons for the decline. Additionally, considering features such as water-proof design, minimalist aesthetics, long battery life, and comfortable wear can help encourage users to wear the tracker consistently throughout the day.

By implementing these ideas, Bellabeat can further empower women to lead healthier lifestyles and achieve a balance between their personal and professional lives with the support of their app.

Thank you for taking the time to view my Bellabeat Case Study in R!