Google Capstone Project: Bellabeat Analysis

A Glimpse of Bellabeat

Bellabeat is a high-tech manufacturer of health-focused products for woman. Founded by Urska Srsen with background as an artist, she developed beautifully the technology design of Bellabeat to informs and inspires women around the world. Besides Srsen, there is Sando Mur that also founded Bellabeat. Mur has a background as a mathematician. Eventhought Bellabeat is a small company but they are a successful company. Founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

The Products

Bellabeat app: provides data related to activity, sleep, stress, menstrual cycle, and mindfulness habits that can help users better understand their current habits and make healthy decisions.

Leaf: the tracker that connect to the Bellabeat app to track activity, sleep, and stress.

Time: wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress.

Spring: a water bottle that tracks daily water intake using smart technology

Bellabeat membership: the membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, and beauty, and mindfulness based on their lifestyle and goals.

ASK

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

PREPARE

Find the dataset through https://www.kaggle.com/datasets/arashnic/fitbit. I chosed 5 out of 18 datasets, the data is open for public that we can see the license, CCO: Public Domain. The dataset that I used are; dailyActivity_merged.csv, hourlyIntensities_merged.csv, sleepDay_merged.csv, weightLogInfo_merged.csv, heartrate_seconds_merged,csv.

PROCESS

I used R programming language for this project. There are several steps in the process of cleansing the data, they are; Install the packages, import the dataset, determine which data that will be continue to use, finding missing value and empty object, checking the duplicate data, standardized, and clean the names.

INSTALL THE PACKAGES

library(readr) #read_csv()
library(utils) #head()
library(tidyr) #gather() #extract() #drop_na()
library(ggplot2) #ploting
library(skimr) #skim_without_charts()
library(janitor) #clean_names()

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr) #mutate() #group_by()

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

IMPORT THE DATASET

# Import the Dataset (from those 18 of the datasets, I chosed 5 to analyzed)
dailyActivity_merged <- read_csv("dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(dailyActivity_merged)

## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## #   ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## #   ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance

hourlyIntensities_merged <- read_csv("Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")

## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(hourlyIntensities_merged)

## # A tibble: 6 × 4
##           Id ActivityHour          TotalIntensity AverageIntensity
##        <dbl> <chr>                          <dbl>            <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM             20            0.333
## 2 1503960366 4/12/2016 1:00:00 AM               8            0.133
## 3 1503960366 4/12/2016 2:00:00 AM               7            0.117
## 4 1503960366 4/12/2016 3:00:00 AM               0            0    
## 5 1503960366 4/12/2016 4:00:00 AM               0            0    
## 6 1503960366 4/12/2016 5:00:00 AM               0            0

sleepDay_merged <- read_csv("sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(sleepDay_merged)

## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹TotalTimeInBed

weightLogInfo_merged <- read_csv("weightLogInfo_merged.csv")

## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(weightLogInfo_merged)

## # A tibble: 6 × 8
##           Id Date                  WeightKg Weight…¹   Fat   BMI IsMan…²   LogId
##        <dbl> <chr>                    <dbl>    <dbl> <dbl> <dbl> <lgl>     <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM      52.6     116.    22  22.6 TRUE    1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM      52.6     116.    NA  22.6 TRUE    1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM     134.      294.    NA  47.5 FALSE   1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.    NA  21.5 TRUE    1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.    NA  21.7 TRUE    1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     160.    25  27.5 TRUE    1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport

heartrate_seconds_merged <- read_csv("Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(heartrate_seconds_merged)

## # A tibble: 6 × 3
##           Id Time                 Value
##        <dbl> <chr>                <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM    97
## 2 2022484408 4/12/2016 7:21:05 AM   102
## 3 2022484408 4/12/2016 7:21:10 AM   105
## 4 2022484408 4/12/2016 7:21:20 AM   103
## 5 2022484408 4/12/2016 7:21:25 AM   101
## 6 2022484408 4/12/2016 7:22:05 AM    95

DETERMINE THE DATASET

Before going too far, I will check the amount of observation first, if the observation < 30 then I will eliminate the datasets.

merged_amount_dataset <- data.frame(dailyActivity=n_distinct(dailyActivity_merged$Id), 
                 hourlyIntensities=n_distinct(hourlyIntensities_merged$Id),
                 sleepDay=n_distinct(sleepDay_merged$Id), 
                 weightLogInfo=n_distinct(weightLogInfo_merged$Id), 
                 heartrate=n_distinct(heartrate_seconds_merged$Id))
merged_amount_dataset

##   dailyActivity hourlyIntensities sleepDay weightLogInfo heartrate
## 1            33                33       24             8        14

The result shows us that only two data table, “dailyActivity” and “hourlyIntensities” had > 30 observation. Based on central limit theorem (CLT), sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold. Thus, I will only analyze those both data table.

reference of CLT: https://www.investopedia.com/terms/c/central_limit_theorem.asp#:~:text=The%20central%20limit%20theorem%20%28CLT%29%20states%20that%20the,often%20considered%20sufficient%20for%20the%20CLT%20to%20hold.

DATA CLEANING

A. dailyActivity_merged data

Finding missing and empty value

skim_without_charts(dailyActivity_merged)

Data summary
Name	dailyActivity_merged
Number of rows	940
Number of columns	15
_______________________
Column type frequency:
character	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityDate	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	4.855407e+09	2.424805e+09	1503960366	2.320127e+09	4.445115e+09	6.962181e+09	8.877689e+09
TotalSteps	1	7.637910e+03	5.087150e+03	0	3.789750e+03	7.405500e+03	1.072700e+04	3.601900e+04
TotalDistance	1	5.490000e+00	3.920000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
TrackerDistance	1	5.480000e+00	3.910000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
LoggedActivitiesDistance	1	1.100000e-01	6.200000e-01	0	0.000000e+00	0.000000e+00	0.000000e+00	4.940000e+00
VeryActiveDistance	1	1.500000e+00	2.660000e+00	0	0.000000e+00	2.100000e-01	2.050000e+00	2.192000e+01
ModeratelyActiveDistance	1	5.700000e-01	8.800000e-01	0	0.000000e+00	2.400000e-01	8.000000e-01	6.480000e+00
LightActiveDistance	1	3.340000e+00	2.040000e+00	0	1.950000e+00	3.360000e+00	4.780000e+00	1.071000e+01
SedentaryActiveDistance	1	0.000000e+00	1.000000e-02	0	0.000000e+00	0.000000e+00	0.000000e+00	1.100000e-01
VeryActiveMinutes	1	2.116000e+01	3.284000e+01	0	0.000000e+00	4.000000e+00	3.200000e+01	2.100000e+02
FairlyActiveMinutes	1	1.356000e+01	1.999000e+01	0	0.000000e+00	6.000000e+00	1.900000e+01	1.430000e+02
LightlyActiveMinutes	1	1.928100e+02	1.091700e+02	0	1.270000e+02	1.990000e+02	2.640000e+02	5.180000e+02
SedentaryMinutes	1	9.912100e+02	3.012700e+02	0	7.297500e+02	1.057500e+03	1.229500e+03	1.440000e+03
Calories	1	2.303610e+03	7.181700e+02	0	1.828500e+03	2.134000e+03	2.793250e+03	4.900000e+03

By using skim function that provide us the quick and broad overview about the data, we can see that there is no missing value or empty object in that data.

Checking the duplicate of data

unique(dailyActivity_merged$Id)

##  [1] 1503960366 1624580081 1644430081 1844505072 1927972279 2022484408
##  [7] 2026352035 2320127002 2347167796 2873212765 3372868164 3977333714
## [13] 4020332650 4057192912 4319703577 4388161847 4445114986 4558609924
## [19] 4702921684 5553957443 5577150313 6117666160 6290855005 6775888955
## [25] 6962181067 7007744171 7086361926 8053475328 8253242879 8378563200
## [31] 8583815059 8792009665 8877689391

sum(duplicated(dailyActivity_merged))

## [1] 0

No duplicate value detected, all the observations are unique.

Clean the names

dailyActivity_merged <- clean_names(dailyActivity_merged)
head(dailyActivity_merged)

## # A tibble: 6 × 15
##       id activ…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸ seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>, and abbreviated variable names
## #   ¹activity_date, ²total_steps, ³total_distance, ⁴tracker_distance,
## #   ⁵logged_activities_distance, ⁶very_active_distance,
## #   ⁷moderately_active_distance, ⁸light_active_distance,
## #   ⁹sedentary_active_distance

B. hourlyIntensities_merged data

Finding missing and empty value

skim_without_charts(hourlyIntensities_merged)

Data summary
Name	hourlyIntensities_merged
Number of rows	22099
Number of columns	4
_______________________
Column type frequency:
character	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityHour	0	1	19	21	0	736	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	4.848235e+09	2.4225e+09	1503960366	2320127002	4.445115e+09	6.962181e+09	8877689391
TotalIntensity	1	1.204000e+01	2.1130e+01	0	0	3.000000e+00	1.600000e+01	180
AverageIntensity	1	2.000000e-01	3.5000e-01	0	0	5.000000e-02	2.700000e-01	3

Skim function can quickly provide a broad overview of a data frame. As we can see, hourlyIntensities data has no missing value and empty object to fixed. Those are in “n_missing” and “character.empty” column.

Checking the duplicate of data

unique(hourlyIntensities_merged$Id)

##  [1] 1503960366 1624580081 1644430081 1844505072 1927972279 2022484408
##  [7] 2026352035 2320127002 2347167796 2873212765 3372868164 3977333714
## [13] 4020332650 4057192912 4319703577 4388161847 4445114986 4558609924
## [19] 4702921684 5553957443 5577150313 6117666160 6290855005 6775888955
## [25] 6962181067 7007744171 7086361926 8053475328 8253242879 8378563200
## [31] 8583815059 8792009665 8877689391

sum(duplicated(hourlyIntensities_merged))

## [1] 0

There is no duplicate data, all the data are unique.

Standardization (separate the date and time)

hourlyIntensities_merged$ActivityHour=as.POSIXct(hourlyIntensities_merged$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourlyIntensities_merged$time <- format(hourlyIntensities_merged$ActivityHour, format = "%H:%M:%S")
hourlyIntensities_merged$date <- format(hourlyIntensities_merged$ActivityHour, format = "%m/%d/%y")

head(hourlyIntensities_merged)

## # A tibble: 6 × 6
##           Id ActivityHour        TotalIntensity AverageIntensity time     date  
##        <dbl> <dttm>                       <dbl>            <dbl> <chr>    <chr> 
## 1 1503960366 2016-04-12 00:00:00             20            0.333 00:00:00 04/12…
## 2 1503960366 2016-04-12 01:00:00              8            0.133 01:00:00 04/12…
## 3 1503960366 2016-04-12 02:00:00              7            0.117 02:00:00 04/12…
## 4 1503960366 2016-04-12 03:00:00              0            0     03:00:00 04/12…
## 5 1503960366 2016-04-12 04:00:00              0            0     04:00:00 04/12…
## 6 1503960366 2016-04-12 05:00:00              0            0     05:00:00 04/12…

Clean the names

hourlyIntensities_merged <- clean_names(hourlyIntensities_merged)
head(hourlyIntensities_merged)

## # A tibble: 6 × 6
##           id activity_hour       total_intensity average_intensity time    date 
##        <dbl> <dttm>                        <dbl>             <dbl> <chr>   <chr>
## 1 1503960366 2016-04-12 00:00:00              20             0.333 00:00:… 04/1…
## 2 1503960366 2016-04-12 01:00:00               8             0.133 01:00:… 04/1…
## 3 1503960366 2016-04-12 02:00:00               7             0.117 02:00:… 04/1…
## 4 1503960366 2016-04-12 03:00:00               0             0     03:00:… 04/1…
## 5 1503960366 2016-04-12 04:00:00               0             0     04:00:… 04/1…
## 6 1503960366 2016-04-12 05:00:00               0             0     05:00:… 04/1…

ANALYZE & SHARE

After cleaning the data, I ensure the data already eligible to analyze.

A. dailyActivity_merged

Add a new column for active_minutes (I wanna see the trend between active minutes and sandatary minutes).

dailyActivity_merged <- mutate(dailyActivity_merged, active_minutes=very_active_minutes+fairly_active_minutes+lightly_active_minutes)

Add a new column for categories of least active, active, most active from total step variable (least active < 4363, active >= 4363 & <= 8442, most active)

dailyActivity_merged$Category <- ifelse(dailyActivity_merged$total_steps < 4363, 'least active', ifelse(dailyActivity_merged$total_steps >= 4363 & dailyActivity_merged$total_steps <= 8442, 'active', 'most active'))
head(dailyActivity_merged)

## # A tibble: 6 × 17
##       id activ…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸ seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 7 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>, active_minutes <dbl>,
## #   Category <chr>, and abbreviated variable names ¹activity_date,
## #   ²total_steps, ³total_distance, ⁴tracker_distance,
## #   ⁵logged_activities_distance, ⁶very_active_distance,
## #   ⁷moderately_active_distance, ⁸light_active_distance, …

Statistical summary

summary(dailyActivity_merged)

##        id            activity_date       total_steps    total_distance  
##  Min.   :1.504e+09   Length:940         Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Mode  :character   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09                      Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09                      3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09                      Max.   :36019   Max.   :28.030  
##  tracker_distance logged_activities_distance very_active_distance
##  Min.   : 0.000   Min.   :0.0000             Min.   : 0.000      
##  1st Qu.: 2.620   1st Qu.:0.0000             1st Qu.: 0.000      
##  Median : 5.245   Median :0.0000             Median : 0.210      
##  Mean   : 5.475   Mean   :0.1082             Mean   : 1.503      
##  3rd Qu.: 7.710   3rd Qu.:0.0000             3rd Qu.: 2.053      
##  Max.   :28.030   Max.   :4.9421             Max.   :21.920      
##  moderately_active_distance light_active_distance sedentary_active_distance
##  Min.   :0.0000             Min.   : 0.000        Min.   :0.000000         
##  1st Qu.:0.0000             1st Qu.: 1.945        1st Qu.:0.000000         
##  Median :0.2400             Median : 3.365        Median :0.000000         
##  Mean   :0.5675             Mean   : 3.341        Mean   :0.001606         
##  3rd Qu.:0.8000             3rd Qu.: 4.782        3rd Qu.:0.000000         
##  Max.   :6.4800             Max.   :10.710        Max.   :0.110000         
##  very_active_minutes fairly_active_minutes lightly_active_minutes
##  Min.   :  0.00      Min.   :  0.00        Min.   :  0.0         
##  1st Qu.:  0.00      1st Qu.:  0.00        1st Qu.:127.0         
##  Median :  4.00      Median :  6.00        Median :199.0         
##  Mean   : 21.16      Mean   : 13.56        Mean   :192.8         
##  3rd Qu.: 32.00      3rd Qu.: 19.00        3rd Qu.:264.0         
##  Max.   :210.00      Max.   :143.00        Max.   :518.0         
##  sedentary_minutes    calories    active_minutes    Category        
##  Min.   :   0.0    Min.   :   0   Min.   :  0.0   Length:940        
##  1st Qu.: 729.8    1st Qu.:1828   1st Qu.:146.8   Class :character  
##  Median :1057.5    Median :2134   Median :247.0   Mode  :character  
##  Mean   : 991.2    Mean   :2304   Mean   :227.5                     
##  3rd Qu.:1229.5    3rd Qu.:2793   3rd Qu.:317.2                     
##  Max.   :1440.0    Max.   :4900   Max.   :552.0

From the summary statistic above, I found that the majority of the users of Bellabeat mostly spent their time with inactive or less active activity. The average of sedentary minutes > active minutes = 991.2 > 227.5 with the average total steps equals 7638, which I know based on firstquotehealth article that 7000 steps are ideal or active for adults (20-65 y.o). Thus, even though sedentary minutes > active minutes yet the majority of the users still the active people.

reference: https://firstquotehealth.com/health-insurance/news/recommended-steps-day

Finding the trend between the variable total steps and calories (per-ID)

ggplot(data=dailyActivity_merged,aes(x=total_steps,y=calories))+
  geom_point(aes(color=id))+
  facet_wrap(~id) +
  labs(title="Total Steps Vs Calories per ID")

Individual sample of Bellabeat users for behavioral measures in total steps and calories. I found that the trend almost same for each user.

Total steps Vs Calories burned

ggplot(data=dailyActivity_merged,aes(x=total_steps,y=calories))+
  geom_point(color= "green") +
  geom_smooth(color = "red") +
  labs(title="Total Steps Vs Calories Burned")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Based on the result and for sure we can know it better that our total steps will significantly affects the calories burn on our body, so does the data show us. More your steps more the calories burned.

Make new table for sum of active minutes and sedentary minutes

amount_active_sedentary <- data.frame(sum_active_minutes = sum(dailyActivity_merged$active_minutes), sum_sedentary_minutes = sum(dailyActivity_merged$sedentary_minutes))

Convert horizontal value into a vertical value

amount_active_sedentary <- gather(amount_active_sedentary, key = "Category", value = "Value", sum_active_minutes, sum_sedentary_minutes)

Make bar plot for active minutes and sedentary minutes

ggplot(amount_active_sedentary, aes(x = Category, y = Value, fill = Category)) +
  geom_bar(stat = "identity") +
  labs(title="Sum of Active Vs Sedentary Minutes")

The result has significantly gap between total of sedentary minus and total of active minutes, that means the users spent their time mostly inactive or less active activity.

Find the trend between active minutes and calories

ggplot(data=dailyActivity_merged,aes(x=active_minutes,y=calories))+
  geom_point(color="blue") +
  geom_smooth(color="red") +
  labs(title="Active Minutes Vs Calories")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The result shows us that the active minutes significantly affects the burn of calories.

Find the trend between total steps and total distance

ggplot(data=dailyActivity_merged,aes(x=total_steps,y=total_distance))+
  geom_point() +
  labs(title="Total Steps Vs Total Distance")

Based on the result above, total steps have positive correlation with total distance. More steps you do, more distance you have.

Categories of steps

ggplot(dailyActivity_merged, aes(x = Category, fill = Category)) + 
  geom_bar() +
  labs(title = "Categories of Steps")

As I mentioned before (see the statistical summary for dailyActivity_merged data) that I categorized the number of steps to see Bellabeat’s users’ behavior and based on the result, most active at the top, followed by active at the second and least active at the last. That means the majority of Bellabeat users are more active than least active.

Categories of steps per-ID

ggplot(dailyActivity_merged, aes(x = Category, fill = id)) + 
  geom_bar(position = position_dodge()) + 
  facet_wrap(~id) +
  labs(title="Categories of Active per ID")

This finding is representing the behavioral statistic of each user for categories active, and we can see various trend among the users.

B. hourlyIntensities Data

Statistical summary

print(summary(hourlyIntensities_merged))

##        id            activity_hour                    total_intensity 
##  Min.   :1.504e+09   Min.   :2016-04-12 00:00:00.00   Min.   :  0.00  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19 01:00:00.00   1st Qu.:  0.00  
##  Median :4.445e+09   Median :2016-04-26 06:00:00.00   Median :  3.00  
##  Mean   :4.848e+09   Mean   :2016-04-26 11:46:42.58   Mean   : 12.04  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-03 19:00:00.00   3rd Qu.: 16.00  
##  Max.   :8.878e+09   Max.   :2016-05-12 15:00:00.00   Max.   :180.00  
##  average_intensity     time               date          
##  Min.   :0.0000    Length:22099       Length:22099      
##  1st Qu.:0.0000    Class :character   Class :character  
##  Median :0.0500    Mode  :character   Mode  :character  
##  Mean   :0.2006                                         
##  3rd Qu.:0.2667                                         
##  Max.   :3.0000

Average total intensities Vs Time

int_new <- hourlyIntensities_merged %>%
  group_by(time) %>%
  drop_na() %>%
  summarise(mean_total_int = mean(total_intensity))

ggplot(data=int_new, aes(x=time, y=mean_total_int)) + geom_histogram(stat = "identity", fill='black') +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="Average Total Intensity vs. Time")

## Warning in geom_histogram(stat = "identity", fill = "black"): Ignoring unknown
## parameters: `binwidth`, `bins`, and `pad`

I can see there are two peaks of intensities time based on the result, first, between 12:00 - 14:00 I assumed that in those hours the users have a break time and looking for lunch. Second, between 17:00 - 19:00 I assumed that the users already finished their work and time to go back home. Hence, the first and the second peak are the time when the users finished their work after spent so much energy and concentration.

NB: we need to do further analysis

ACT

Conclusion

The very useful data table to analyzed is dailyActivity_merged data. We can see the categories active users of Bellabeat has most active users than least active, even though the number of sedentary minutes is higher than active minutes, but calories burned much more alongside with the total steps and total active minutes. The intensities hours of Bellabeat users are between 12:00 - 14:00, the time when we usually have a break time for lunch after work for a half day. And, at 17:00 - 19:00, the time when we usually finished our work and go back home after spent so much energy for all day.

Suggestions

I suggest using the “Categories of active per ID” result for the marketing team to see our potential customers (users).

I suggest Bellabeat make a new feature to track the users’ history place, with place tracking we can understand better the activity of our customers. Because when I make a report for Average total Intensities vs Time, I just assumed the customer activity based on what people commonly do for their daily activities. Some said 17:00 - 19:00 are the time when the users go to the gym, we have no clear evidence, so I highly recommend the places tracker feature (for sure, still concerned for users’ privacy).

Because Bellabeat focuses on women, we can make another feature (or product) to help women track their periods of menstruation. Remind them when they are already overworked or have excessive activity when they are on their periods.

Give Bellabeat customers daily, weekly, and annual dashboard reports. Daily: we can give them a report about their total steps and calories. Weekly: we can give them a report about total steps, calories, their peak time hour, categories of active, etc. Annual: we can give them a summary report of their health and give them some recommendations for a healthy life based on the report.

The most important thing is to be a “best friend” for our customers, always reminding them about their health. There is no great approach to making your customer become a loyal customer than a psychological approach.

Google Capstone Project: Bellabeat Analysis

Khadijah, Sitti

2022-12-27