Bellabeat Market Analysis

#1. About Bellabeat is a high-tech company that manufactures health-focused smart products.They offer different smart devices that collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits.

The main focus of this case is to analyze smart devices fitness data and determine how it could help unlock new growth opportunities for Bellabeat. We will focus on one of Bellabeat’s products: Bellabeat app.

The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products

#2. Ask Phase ##2.1 Business Task Identify trends in how consumers use non-Bellabeat smart devices to apply insights into Bellabeat’s marketing strategy.

Stakeholders Urška Sršen - Bellabeat cofounder and Chief Creative Officer Sando Mur - Bellabeat cofounder and key member of Bellabeat executive team Bellabeat Marketing Analytics team

#3. Prepare Phase ##3.1 Dataset used The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through Mobius. ##3.2 Accessibility and privacy of data Verifying the metadata of our dataset we can confirm it is open-source. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. ##3.3 Information about our dataset These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences. ##3.4 Data Organization and verification Available to us are 18 CSV documents. Each document represents different quantitative data tracked by Fitbit. The data is considered long since each row is one time point per subject, so each subject will have data in multiple rows.Every user has a unique ID and different rows since data is tracked by day and time. Due to the small size of sample I sorted and filtered tables creating Pivot Tables in Excel. I was able to verify attributes and observations of each table and relations between tables. Counted sample size (users) of each table and verified time length of analysis - 31 days. ##3.5 Data Credibility and Integrity Due to the limitation of size (30 users) and not having any demographic information we could encounter a sampling bias. We are not sure if the sample is representative of the population as a whole. Another problem we would encounter is that the dataset is not current and also the time limitation of the survey (2 months long). That is why we will give our case study an operational approach.

#4. Process Phase I will focus my analysis in R due to the accessibility, amount of data and to be able to create data visualization to share my results with stakeholders. ##4.1 Installing packages and opening libraries We will choose the packages that will help us on our analysis and open them. We will use the following packages for our analysis: tidyverse here skimr janitor lubridate ggpubr ggrepel

options(repos = c(CRAN = "https://cran.rstudio.com"))
install.packages("tidyverse")

## 
## The downloaded binary packages are in
##  /var/folders/28/yglm384s7jj70bpkxf3ckdvh0000gn/T//RtmpJGu7Qt/downloaded_packages

install.packages("here")

## 
## The downloaded binary packages are in
##  /var/folders/28/yglm384s7jj70bpkxf3ckdvh0000gn/T//RtmpJGu7Qt/downloaded_packages

install.packages("skimr")

## 
## The downloaded binary packages are in
##  /var/folders/28/yglm384s7jj70bpkxf3ckdvh0000gn/T//RtmpJGu7Qt/downloaded_packages

install.packages("janitor")

## 
## The downloaded binary packages are in
##  /var/folders/28/yglm384s7jj70bpkxf3ckdvh0000gn/T//RtmpJGu7Qt/downloaded_packages

install.packages("lubridate")

## 
## The downloaded binary packages are in
##  /var/folders/28/yglm384s7jj70bpkxf3ckdvh0000gn/T//RtmpJGu7Qt/downloaded_packages

install.packages("ggpubr")

## 
## The downloaded binary packages are in
##  /var/folders/28/yglm384s7jj70bpkxf3ckdvh0000gn/T//RtmpJGu7Qt/downloaded_packages

install.packages("ggrepel")

## 
## The downloaded binary packages are in
##  /var/folders/28/yglm384s7jj70bpkxf3ckdvh0000gn/T//RtmpJGu7Qt/downloaded_packages

install.packages("readr")

## 
## The downloaded binary packages are in
##  /var/folders/28/yglm384s7jj70bpkxf3ckdvh0000gn/T//RtmpJGu7Qt/downloaded_packages

library("ggpubr")

## Loading required package: ggplot2

library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("here")

## here() starts at /Users/admin/Downloads/Fitabase Data 4.12.16-5.12.16

library("skimr")
library("janitor")

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library("lubridate")
library("ggrepel")
library(readr)

##4.2 Importing datasets Knowing the datasets we have, we will upload the datasets that will help us answer our business task. On our analysis we will focus on the following datasets:

Daily_activity Daily_sleep Hourly_steps Data on Weight and heart rate were not considered because they contain data for 8 and 7 users respectively.

daily_activity <- read_csv("dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

daily_sleep <- read_csv("sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hourly_steps <- read_csv("hourlySteps_merged.csv")

## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

##4.3 Preview our datasets We will preview our selected data frames and check the summary of each column.

head(daily_activity)

## # A tibble: 6 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 4/12/2016         13162          8.5             8.5 
## 2 1503960366 4/13/2016         10735          6.97            6.97
## 3 1503960366 4/14/2016         10460          6.74            6.74
## 4 1503960366 4/15/2016          9762          6.28            6.28
## 5 1503960366 4/16/2016         12669          8.16            8.16
## 6 1503960366 4/17/2016          9705          6.48            6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

str(daily_activity)

## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(daily_sleep)

## # A tibble: 6 × 5
##           Id SleepDay        TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                       <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:0…                 1                327            346
## 2 1503960366 4/13/2016 12:0…                 2                384            407
## 3 1503960366 4/15/2016 12:0…                 1                412            442
## 4 1503960366 4/16/2016 12:0…                 2                340            367
## 5 1503960366 4/17/2016 12:0…                 1                700            712
## 6 1503960366 4/19/2016 12:0…                 1                304            320

str(daily_sleep)

## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(hourly_steps)

## # A tibble: 6 × 3
##           Id ActivityHour          StepTotal
##        <dbl> <chr>                     <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366 4/12/2016 1:00:00 AM        160
## 3 1503960366 4/12/2016 2:00:00 AM        151
## 4 1503960366 4/12/2016 3:00:00 AM          0
## 5 1503960366 4/12/2016 4:00:00 AM          0
## 6 1503960366 4/12/2016 5:00:00 AM          0

str(hourly_steps)

## spc_tbl_ [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id          : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ StepTotal   : num [1:22099] 373 160 151 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityHour = col_character(),
##   ..   StepTotal = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

##4.4 Cleaning and formatting Now that we got to know more about our data structures we will process them to look for any errors and inconsistencies.

###4.4.1 Veryfying number of users Before we continue with our cleaning we want to make sure how many unique users are per data frame. Even though 30 is the minimal sample size we will still keep the sleep dataset for practice only.

n_unique(daily_activity$Id)

## [1] 33

n_unique(daily_sleep$Id)

## [1] 24

n_unique(hourly_steps$Id)

## [1] 33

###4.4.2 Duplicates

We will now look for any duplicates:

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(daily_sleep))

## [1] 3

sum(duplicated(hourly_steps))

## [1] 0

###4.4.3 Remove duplicates and N/A Knowing the length of our observations (daily_sleep 413) we are able to delete duplicates for daily_sleep.

daily_activity <- daily_activity %>%
  distinct() %>%
  drop_na()

daily_sleep <- daily_sleep %>%
  distinct() %>%
  drop_na()

hourly_steps <- hourly_steps %>%
  distinct() %>%
  drop_na()

We will verify that duplicates have been removed

sum(duplicated(daily_sleep))

## [1] 0

###4.4.4 Clean and rename columns We want to ensure that column names are using right syntax and same format in all datasets since we will merge them later on. We are changing the format of all columns to lower case.

clean_names(daily_activity)

## # A tibble: 940 × 15
##            id activity_date total_steps total_distance tracker_distance
##         <dbl> <chr>               <dbl>          <dbl>            <dbl>
##  1 1503960366 4/12/2016           13162           8.5              8.5 
##  2 1503960366 4/13/2016           10735           6.97             6.97
##  3 1503960366 4/14/2016           10460           6.74             6.74
##  4 1503960366 4/15/2016            9762           6.28             6.28
##  5 1503960366 4/16/2016           12669           8.16             8.16
##  6 1503960366 4/17/2016            9705           6.48             6.48
##  7 1503960366 4/18/2016           13019           8.59             8.59
##  8 1503960366 4/19/2016           15506           9.88             9.88
##  9 1503960366 4/20/2016           10544           6.68             6.68
## 10 1503960366 4/21/2016            9819           6.34             6.34
## # ℹ 930 more rows
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>

daily_activity<- rename_with(daily_activity, tolower)
clean_names(daily_sleep)

## # A tibble: 410 × 5
##          id sleep_day total_sleep_records total_minutes_asleep total_time_in_bed
##       <dbl> <chr>                   <dbl>                <dbl>             <dbl>
##  1   1.50e9 4/12/201…                   1                  327               346
##  2   1.50e9 4/13/201…                   2                  384               407
##  3   1.50e9 4/15/201…                   1                  412               442
##  4   1.50e9 4/16/201…                   2                  340               367
##  5   1.50e9 4/17/201…                   1                  700               712
##  6   1.50e9 4/19/201…                   1                  304               320
##  7   1.50e9 4/20/201…                   1                  360               377
##  8   1.50e9 4/21/201…                   1                  325               364
##  9   1.50e9 4/23/201…                   1                  361               384
## 10   1.50e9 4/24/201…                   1                  430               449
## # ℹ 400 more rows

daily_sleep <- rename_with(daily_sleep, tolower)
clean_names(hourly_steps)

## # A tibble: 22,099 × 3
##            id activity_hour         step_total
##         <dbl> <chr>                      <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM        373
##  2 1503960366 4/12/2016 1:00:00 AM         160
##  3 1503960366 4/12/2016 2:00:00 AM         151
##  4 1503960366 4/12/2016 3:00:00 AM           0
##  5 1503960366 4/12/2016 4:00:00 AM           0
##  6 1503960366 4/12/2016 5:00:00 AM           0
##  7 1503960366 4/12/2016 6:00:00 AM           0
##  8 1503960366 4/12/2016 7:00:00 AM           0
##  9 1503960366 4/12/2016 8:00:00 AM         250
## 10 1503960366 4/12/2016 9:00:00 AM        1864
## # ℹ 22,089 more rows

hourly_steps <- rename_with(hourly_steps, tolower)

###4.4.5 Consistency of date and time columns Now that we have verified our column names and change them to lower case, we will focus on cleaning date-time format for daily_activity and daily_sleep since we will merge both data frames. Since we can disregard the time on daily_sleep data frame we are using as_date instead as as_datetime

daily_activity <- daily_activity %>%
  rename(date = activitydate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

daily_sleep <- daily_sleep %>%
  rename(date = sleepday) %>%
  mutate(date = as_datetime(date, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone()))

We will check our cleaned datasets

head(daily_activity)

## # A tibble: 6 × 15
##           id date       totalsteps totaldistance trackerdistance
##        <dbl> <date>          <dbl>         <dbl>           <dbl>
## 1 1503960366 2016-04-12      13162          8.5             8.5 
## 2 1503960366 2016-04-13      10735          6.97            6.97
## 3 1503960366 2016-04-14      10460          6.74            6.74
## 4 1503960366 2016-04-15       9762          6.28            6.28
## 5 1503960366 2016-04-16      12669          8.16            8.16
## 6 1503960366 2016-04-17       9705          6.48            6.48
## # ℹ 10 more variables: loggedactivitiesdistance <dbl>,
## #   veryactivedistance <dbl>, moderatelyactivedistance <dbl>,
## #   lightactivedistance <dbl>, sedentaryactivedistance <dbl>,
## #   veryactiveminutes <dbl>, fairlyactiveminutes <dbl>,
## #   lightlyactiveminutes <dbl>, sedentaryminutes <dbl>, calories <dbl>

head(daily_sleep)

## # A tibble: 6 × 5
##       id date                totalsleeprecords totalminutesasleep totaltimeinbed
##    <dbl> <dttm>                          <dbl>              <dbl>          <dbl>
## 1 1.50e9 2016-04-12 00:00:00                 1                327            346
## 2 1.50e9 2016-04-13 00:00:00                 2                384            407
## 3 1.50e9 2016-04-15 00:00:00                 1                412            442
## 4 1.50e9 2016-04-16 00:00:00                 2                340            367
## 5 1.50e9 2016-04-17 00:00:00                 1                700            712
## 6 1.50e9 2016-04-19 00:00:00                 1                304            320

To check Hourly dataset:

hourly_steps<- hourly_steps %>% 
  rename(date_time = activityhour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y%I:%M:%S %p" , tz=Sys.timezone()))

head(hourly_steps)

## # A tibble: 6 × 3
##           id date_time           steptotal
##        <dbl> <dttm>                  <dbl>
## 1 1503960366 2016-04-12 00:00:00       373
## 2 1503960366 2016-04-12 01:00:00       160
## 3 1503960366 2016-04-12 02:00:00       151
## 4 1503960366 2016-04-12 03:00:00         0
## 5 1503960366 2016-04-12 04:00:00         0
## 6 1503960366 2016-04-12 05:00:00         0

##4.5 Merging Datasets We will merge daily_activity and daily_sleep to see any correlation between variables by using id and date as their primary keys.

daily_activity_sleep <- merge(daily_activity, daily_sleep, by=c ("id", "date"))
glimpse(daily_activity_sleep)

## Rows: 410
## Columns: 18
## $ id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ date                     <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-…
## $ totalsteps               <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544…
## $ totaldistance            <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ trackerdistance          <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ loggedactivitiesdistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactivedistance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3…
## $ moderatelyactivedistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3…
## $ lightactivedistance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6…
## $ sedentaryactivedistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactiveminutes        <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3…
## $ fairlyactiveminutes      <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,…
## $ lightlyactiveminutes     <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, …
## $ sedentaryminutes         <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, …
## $ calories                 <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177…
## $ totalsleeprecords        <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ totalminutesasleep       <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ totaltimeinbed           <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, …

#5. Analyze Phase and Share Phase We will analyze trends of the users of FitBit and determine if that can help us on BellaBeat’s marketing strategy.

##5.1 Type of users per activity level Since we don’t have any demographic variables from our sample we want to determine the type of users with the data we have. We can classify the users by activity considering the daily amount of steps. We can categorize users as follows:

Sedentary - Less than 5000 steps a day. Lightly active - Between 5000 and 7499 steps a day. Fairly active - Between 7500 and 9999 steps a day. Very active - 10000 or More steps a day. Classification has been made per the following article https://www.10000steps.org.au/articles/counting-steps/

First we will calculate the daily steps average by user.

daily_average <- daily_activity_sleep %>%
  group_by(id) %>%
  summarise (mean_daily_steps = mean(totalsteps), mean_daily_calories = mean(calories), mean_daily_sleep = mean(totalminutesasleep))

head(daily_average)

## # A tibble: 6 × 4
##           id mean_daily_steps mean_daily_calories mean_daily_sleep
##        <dbl>            <dbl>               <dbl>            <dbl>
## 1 1503960366           12406.               1872.             360.
## 2 1644430081            7968.               2978.             294 
## 3 1844505072            3477                1676.             652 
## 4 1927972279            1490                2316.             417 
## 5 2026352035            5619.               1541.             506.
## 6 2320127002            5079                1804               61

We will now classify our users by the daily average steps.

user_type <- daily_average %>%
  mutate(user_type = case_when(
    mean_daily_steps < 5000 ~ "sedentary",
    mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "lightly active", 
    mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "fairly active", 
    mean_daily_steps >= 10000 ~ "very active"
  ))

head(user_type)

## # A tibble: 6 × 5
##           id mean_daily_steps mean_daily_calories mean_daily_sleep user_type    
##        <dbl>            <dbl>               <dbl>            <dbl> <chr>        
## 1 1503960366           12406.               1872.             360. very active  
## 2 1644430081            7968.               2978.             294  fairly active
## 3 1844505072            3477                1676.             652  sedentary    
## 4 1927972279            1490                2316.             417  sedentary    
## 5 2026352035            5619.               1541.             506. lightly acti…
## 6 2320127002            5079                1804               61  lightly acti…

Now that we have a new column with the user type we will create a data frame with the percentage of each user type to better visualize them on a graph.

user_type_percent <- user_type %>%
  group_by(user_type) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(user_type) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

user_type_percent$user_type <- factor(user_type_percent$user_type , levels = c("very active", "fairly active", "lightly active", "sedentary"))


head(user_type_percent)

## # A tibble: 4 × 3
##   user_type      total_percent labels
##   <fct>                  <dbl> <chr> 
## 1 fairly active          0.375 38%   
## 2 lightly active         0.208 21%   
## 3 sedentary              0.208 21%   
## 4 very active            0.208 21%

Below we can see that users are fairly distributed by their activity considering the daily amount of steps. We can determine that based on users activity all kind of users wear smart-devices.

user_type_percent %>%
  ggplot(aes(x="",y=total_percent, fill=user_type)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#4B5320","#FFD300", "#FF8C00", "#660000")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  labs(title="User type distribution")

##5.2 Steps and minutes asleep per weekday We want to know now what days of the week are the users more active and also what days of the week users sleep more. We will also verify if the users walk the recommended amount of steps and have the recommended amount of sleep.

Below we are calculating the weekdays based on our column date. We are also calculating the average steps walked and minutes sleeped by weekday.

weekday_steps_sleep <- daily_activity_sleep %>%
  mutate(weekday = weekdays(date))

weekday_steps_sleep$weekday <-ordered(weekday_steps_sleep$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))

 weekday_steps_sleep <-weekday_steps_sleep%>%
  group_by(weekday) %>%
  summarize (daily_steps = mean(totalsteps), daily_sleep = mean(totalminutesasleep))

head(weekday_steps_sleep)

## # A tibble: 6 × 3
##   weekday   daily_steps daily_sleep
##   <ord>           <dbl>       <dbl>
## 1 Monday          9273.        420.
## 2 Tuesday         9183.        405.
## 3 Wednesday       8023.        435.
## 4 Thursday        8184.        401.
## 5 Friday          7901.        405.
## 6 Saturday        9871.        419.

ggarrange(
    ggplot(weekday_steps_sleep) +
      geom_col(aes(weekday, daily_steps), fill = "#FFD300") +
      geom_hline(yintercept = 7500) +
      labs(title = "Steps by weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1)),
    ggplot(weekday_steps_sleep, aes(weekday, daily_sleep)) +
      geom_col(fill = "#FF8C00") +
      geom_hline(yintercept = 480) +
      labs(title = "Sleep Minutes by weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))
  )

In the graphs above we can determine the following:

*Users walk daily the recommended amount of steps of 7500 besides Sunday’s.

*Users don’t sleep the recommended amount of minutes/ hours - 8 hours.

##5.3 Hourly steps throughout the day Getting deeper into our analysis we want to know when exactly are users more active in a day.

We will use the hourly_steps data frame and separate date_time column.

hourly_steps <- hourly_steps %>%
  separate(date_time, into = c("date", "time"), sep = " ") %>%
  mutate(date = ymd(date))

head(hourly_steps)

## # A tibble: 6 × 4
##           id date       time     steptotal
##        <dbl> <date>     <chr>        <dbl>
## 1 1503960366 2016-04-12 00:00:00       373
## 2 1503960366 2016-04-12 01:00:00       160
## 3 1503960366 2016-04-12 02:00:00       151
## 4 1503960366 2016-04-12 03:00:00         0
## 5 1503960366 2016-04-12 04:00:00         0
## 6 1503960366 2016-04-12 05:00:00         0

hourly_steps %>%
  group_by(time) %>%
  summarize(average_steps = mean(steptotal)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) + 
  labs(title = "Average Hourly steps per time of the day", x="", y="") + 
  scale_fill_gradient(low = "#660000", high = "#4B5320")+
  theme(axis.text.x = element_text(angle = 90))

We can see that users are more active between 8am and 7pm. Walking more steps during lunch time from 12pm to 2pm and evenings from 5pm and 7pm.

##5.4 Correlations We will now determine if there is any correlation between different variables:

Daily steps and daily sleep Daily steps and calories

ggarrange(
ggplot(daily_activity_sleep, aes(x=totalsteps, y=totalminutesasleep))+
  geom_jitter() +
  geom_smooth(color = "#660000") + 
  labs(title = "Daily steps vs Minutes asleep", x = "Minutes asleep", y= "Daily steps") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(daily_activity_sleep, aes(x=totalsteps, y=calories))+
  geom_jitter() +
  geom_smooth(color = "#660000") + 
  labs(title = "Daily steps vs Calories", x = "Calories", y= "Daily steps") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=16))
)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Per our plots:

There is no correlation between daily steps and the amount of minutes users sleep a day.

Otherwise we can see a positive correlation between daily steps and calories burned. This means that the higer the steps taken, the higher the calories burnt.

##5.5 Use of smart device ###5.5.1 Days used smart device Now that we have seen some trends in activity, sleep and calories burned, we want to see how often do the users in our sample use their device. That way we can plan our marketing strategy and see what features would benefit the use of smart devices.

We will calculate the number of users that use their smart device on a daily basis, classifying our sample into three categories knowing that the date interval is 31 days:

*high use - users who use their device between 21 and 31 days. *moderate use - users who use their device between 10 and 20 days. *low use - users who use their device between 1 and 10 days.

First we will create a new data frame grouping by Id, calculating number of days used and creating a new column with the classification explained above.

daily_use <- daily_activity_sleep %>%
  group_by(id) %>%
  summarize(days_used=sum(n())) %>%
  mutate(usage = case_when(
    days_used >= 1 & days_used <= 10 ~ "low use",
    days_used >= 11 & days_used <= 20 ~ "moderate use", 
    days_used >= 21 & days_used <= 31 ~ "high use", 
  ))
  
head(daily_use)

## # A tibble: 6 × 3
##           id days_used usage   
##        <dbl>     <int> <chr>   
## 1 1503960366        25 high use
## 2 1644430081         4 low use 
## 3 1844505072         3 low use 
## 4 1927972279         5 low use 
## 5 2026352035        28 high use
## 6 2320127002         1 low use

We will now create a percentage data frame to better visualize the results in the graph. We are also ordering our usage levels.

daily_use_percent <- daily_use %>%
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(usage) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

daily_use_percent$usage <- factor(daily_use_percent$usage, levels = c("high use", "moderate use", "low use"))

head(daily_use_percent)

## # A tibble: 3 × 3
##   usage        total_percent labels
##   <fct>                <dbl> <chr> 
## 1 high use             0.5   50%   
## 2 low use              0.375 38%   
## 3 moderate use         0.125 12%

Now that we have our new table we can create our plot:

daily_use_percent %>%
  ggplot(aes(x="",y=total_percent, fill=usage)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=18, face = "bold")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("#006633","#00e673","#80ffbf"),
                    labels = c("High use - 21 to 31 days",
                                 "Moderate use - 11 to 20 days",
                                 "Low use - 1 to 10 days"))+
  labs(title="Daily use of smart device")

Analyzing our results we can see that

*50% of the users of our sample use their device frequently - between 21 to 31 days. * 12% use their device 11 to 20 days. *38% of our sample use really rarely their device. ###5.5.2 Time used smart device Being more precise we want to see how many minutes do users wear their device per day. For that we will merge the created daily_use data frame and daily_activity to be able to filter results by daily use of device as well.

daily_use_merged <- merge(daily_activity, daily_use, by=c ("id"))
head(daily_use_merged)

##           id       date totalsteps totaldistance trackerdistance
## 1 1503960366 2016-05-07      11992          7.71            7.71
## 2 1503960366 2016-05-06      12159          8.03            8.03
## 3 1503960366 2016-05-01      10602          6.81            6.81
## 4 1503960366 2016-04-30      14673          9.25            9.25
## 5 1503960366 2016-04-12      13162          8.50            8.50
## 6 1503960366 2016-04-13      10735          6.97            6.97
##   loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1                        0               2.46                     2.12
## 2                        0               1.97                     0.25
## 3                        0               2.29                     1.60
## 4                        0               3.56                     1.42
## 5                        0               1.88                     0.55
## 6                        0               1.57                     0.69
##   lightactivedistance sedentaryactivedistance veryactiveminutes
## 1                3.13                       0                37
## 2                5.81                       0                24
## 3                2.92                       0                33
## 4                4.27                       0                52
## 5                6.06                       0                25
## 6                4.71                       0                21
##   fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories days_used
## 1                  46                  175              833     1821        25
## 2                   6                  289              754     1896        25
## 3                  35                  246              730     1820        25
## 4                  34                  217              712     1947        25
## 5                  13                  328              728     1985        25
## 6                  19                  217              776     1797        25
##      usage
## 1 high use
## 2 high use
## 3 high use
## 4 high use
## 5 high use
## 6 high use

We need to create a new data frame calculating the total amount of minutes users wore the device every day and creating three different categories:

*All day - device was worn all day. * More than half day - device was worn more than half of the day. *Less than half day - device was worn less than half of the day.

minutes_worn <- daily_use_merged %>% 
  mutate(total_minutes_worn = veryactiveminutes+fairlyactiveminutes+lightlyactiveminutes+sedentaryminutes)%>%
  mutate (percent_minutes_worn = (total_minutes_worn/1440)*100) %>%
  mutate (worn = case_when(
    percent_minutes_worn == 100 ~ "All day",
    percent_minutes_worn < 100 & percent_minutes_worn >= 50~ "More than half day", 
    percent_minutes_worn < 50 & percent_minutes_worn > 0 ~ "Less than half day"
  ))

head(minutes_worn)

##           id       date totalsteps totaldistance trackerdistance
## 1 1503960366 2016-05-07      11992          7.71            7.71
## 2 1503960366 2016-05-06      12159          8.03            8.03
## 3 1503960366 2016-05-01      10602          6.81            6.81
## 4 1503960366 2016-04-30      14673          9.25            9.25
## 5 1503960366 2016-04-12      13162          8.50            8.50
## 6 1503960366 2016-04-13      10735          6.97            6.97
##   loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1                        0               2.46                     2.12
## 2                        0               1.97                     0.25
## 3                        0               2.29                     1.60
## 4                        0               3.56                     1.42
## 5                        0               1.88                     0.55
## 6                        0               1.57                     0.69
##   lightactivedistance sedentaryactivedistance veryactiveminutes
## 1                3.13                       0                37
## 2                5.81                       0                24
## 3                2.92                       0                33
## 4                4.27                       0                52
## 5                6.06                       0                25
## 6                4.71                       0                21
##   fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories days_used
## 1                  46                  175              833     1821        25
## 2                   6                  289              754     1896        25
## 3                  35                  246              730     1820        25
## 4                  34                  217              712     1947        25
## 5                  13                  328              728     1985        25
## 6                  19                  217              776     1797        25
##      usage total_minutes_worn percent_minutes_worn               worn
## 1 high use               1091             75.76389 More than half day
## 2 high use               1073             74.51389 More than half day
## 3 high use               1044             72.50000 More than half day
## 4 high use               1015             70.48611 More than half day
## 5 high use               1094             75.97222 More than half day
## 6 high use               1033             71.73611 More than half day

As we have done before, to better visualize our results we will create new data frames. In this case we will create four different data frames to arrange them later on on a same visualization.

First data frame will show the total of users and will calculate percentage of minutes worn the device taking into consideration the three categories created.

The three other data frames are filtered by category of daily users so that we can see also the difference of daily use and time use.

minutes_worn_percent<- minutes_worn%>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))


minutes_worn_highuse <- minutes_worn%>%
  filter (usage == "high use")%>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_moduse <- minutes_worn%>%
  filter(usage == "moderate use") %>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_lowuse <- minutes_worn%>%
  filter (usage == "low use") %>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_highuse$worn <- factor(minutes_worn_highuse$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_percent$worn <- factor(minutes_worn_percent$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_moduse$worn <- factor(minutes_worn_moduse$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_lowuse$worn <- factor(minutes_worn_lowuse$worn, levels = c("All day", "More than half day", "Less than half day"))

head(minutes_worn_percent)

## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                   0.365  36%   
## 2 Less than half day        0.0351 4%    
## 3 More than half day        0.600  60%

head(minutes_worn_highuse)

## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                   0.0676 6.8%  
## 2 Less than half day        0.0432 4.3%  
## 3 More than half day        0.889  88.9%

head(minutes_worn_moduse)

## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                    0.267 27%   
## 2 Less than half day         0.04  4%    
## 3 More than half day         0.693 69%

head(minutes_worn_lowuse)

## # A tibble: 3 × 3
##   worn               total_percent labels
##   <fct>                      <dbl> <chr> 
## 1 All day                   0.802  80%   
## 2 Less than half day        0.0224 2%    
## 3 More than half day        0.175  18%

Now that we have created the four data frames and also ordered worn level categories, we can visualize our results in the following plots. All the plots have been arranged together for a better visualization.

ggarrange(
  ggplot(minutes_worn_percent, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)) +
    scale_fill_manual(values = c("#004d99", "#3399ff", "#cce6ff"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3.5)+
  labs(title="Time worn per day", subtitle = "Total Users"),
  ggarrange(
  ggplot(minutes_worn_highuse, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = "none")+
    scale_fill_manual(values = c("#004d99", "#3399ff", "#cce6ff"))+
  geom_text_repel(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="", subtitle = "High use - Users"), 
  ggplot(minutes_worn_moduse, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "none") +
    scale_fill_manual(values = c("#004d99", "#3399ff", "#cce6ff"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="", subtitle = "Moderate use - Users"), 
  ggplot(minutes_worn_lowuse, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "none") +
    scale_fill_manual(values = c("#004d99", "#3399ff", "#cce6ff"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="", subtitle = "Low use - Users"), 
  ncol = 3), 
  nrow = 2)

Per our plots we can see that 36% of the total of users wear the device all day long, 60% more than half day long and just 4% less than half day.

If we filter the total users considering the days they have used the device and also check each day how long they have worn the device, we have the following results:

Just a reminder:

*high use - users who use their device between 21 and 31 days. * moderate use - users who use their device between 10 and 20 days. *low use - users who use their device between 1 and 10 days.

High users - Just 6.8% of the users that have used their device between 21 and 31 days wear it all day. 88.9% wear the device more than half day but not all day.

Moderate users are the ones who wear the device less on a daily basis.

Being low users who wear more time their device the day they use it.

#6. Conclusion (Act Phase) Bellabeat’s mission is to empower women by providing them with the data to discover themselves. In order for us to respond to our business task and help Bellabeat on their mission, based on our results, I would advice to use own tracking data for further analysis. Datasets used have a small sample and can be biased since we didn’t have any demographic details of users. Knowing that our main target are young and adult women I would encourage to continue finding trends to be able to create a marketing stragety focused on them.

Bellabeat Market Analysis

Ayoade Fakeye

2023-11-14