A Glimpse of Bellabeat
Bellabeat is a high-tech manufacturer of health-focused products for woman. Founded by Urska Srsen with background as an artist, she developed beautifully the technology design of Bellabeat to informs and inspires women around the world. Besides Srsen, there is Sando Mur that also founded Bellabeat. Mur has a background as a mathematician. Eventhought Bellabeat is a small company but they are a successful company. Founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
The Products
Bellabeat app: provides data related to activity, sleep, stress, menstrual cycle, and mindfulness habits that can help users better understand their current habits and make healthy decisions.
Leaf: the tracker that connect to the Bellabeat app to track activity, sleep, and stress.
Time: wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress.
Spring: a water bottle that tracks daily water intake using smart technology
Bellabeat membership: the membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, and beauty, and mindfulness based on their lifestyle and goals.
ASK
PREPARE
Find the dataset through https://www.kaggle.com/datasets/arashnic/fitbit. I chosed 5 out of 18 datasets, the data is open for public that we can see the license, CCO: Public Domain. The dataset that I used are; dailyActivity_merged.csv, hourlyIntensities_merged.csv, sleepDay_merged.csv, weightLogInfo_merged.csv, heartrate_seconds_merged,csv.
PROCESS
I used R programming language for this project. There are several steps in the process of cleansing the data, they are; Install the packages, import the dataset, determine which data that will be continue to use, finding missing value and empty object, checking the duplicate data, standardized, and clean the names.
INSTALL THE PACKAGES
library(readr) #read_csv()
library(utils) #head()
library(tidyr) #gather() #extract() #drop_na()
library(ggplot2) #ploting
library(skimr) #skim_without_charts()
library(janitor) #clean_names()
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr) #mutate() #group_by()
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
IMPORT THE DATASET
# Import the Dataset (from those 18 of the datasets, I chosed 5 to analyzed)
dailyActivity_merged <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dailyActivity_merged)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
hourlyIntensities_merged <- read_csv("Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hourlyIntensities_merged)
## # A tibble: 6 × 4
## Id ActivityHour TotalIntensity AverageIntensity
## <dbl> <chr> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 20 0.333
## 2 1503960366 4/12/2016 1:00:00 AM 8 0.133
## 3 1503960366 4/12/2016 2:00:00 AM 7 0.117
## 4 1503960366 4/12/2016 3:00:00 AM 0 0
## 5 1503960366 4/12/2016 4:00:00 AM 0 0
## 6 1503960366 4/12/2016 5:00:00 AM 0 0
sleepDay_merged <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sleepDay_merged)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalT…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## # … with abbreviated variable name ¹TotalTimeInBed
weightLogInfo_merged <- read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(weightLogInfo_merged)
## # A tibble: 6 × 8
## Id Date WeightKg Weight…¹ Fat BMI IsMan…² LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 160. 25 27.5 TRUE 1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport
heartrate_seconds_merged <- read_csv("Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(heartrate_seconds_merged)
## # A tibble: 6 × 3
## Id Time Value
## <dbl> <chr> <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM 97
## 2 2022484408 4/12/2016 7:21:05 AM 102
## 3 2022484408 4/12/2016 7:21:10 AM 105
## 4 2022484408 4/12/2016 7:21:20 AM 103
## 5 2022484408 4/12/2016 7:21:25 AM 101
## 6 2022484408 4/12/2016 7:22:05 AM 95
DETERMINE THE DATASET
Before going too far, I will check the amount of observation first, if the observation < 30 then I will eliminate the datasets.
merged_amount_dataset <- data.frame(dailyActivity=n_distinct(dailyActivity_merged$Id),
hourlyIntensities=n_distinct(hourlyIntensities_merged$Id),
sleepDay=n_distinct(sleepDay_merged$Id),
weightLogInfo=n_distinct(weightLogInfo_merged$Id),
heartrate=n_distinct(heartrate_seconds_merged$Id))
merged_amount_dataset
## dailyActivity hourlyIntensities sleepDay weightLogInfo heartrate
## 1 33 33 24 8 14
The result shows us that only two data table, “dailyActivity” and “hourlyIntensities” had > 30 observation. Based on central limit theorem (CLT), sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold. Thus, I will only analyze those both data table.
reference of CLT: https://www.investopedia.com/terms/c/central_limit_theorem.asp#:~:text=The%20central%20limit%20theorem%20%28CLT%29%20states%20that%20the,often%20considered%20sufficient%20for%20the%20CLT%20to%20hold.
DATA CLEANING
A. dailyActivity_merged data
skim_without_charts(dailyActivity_merged)
| Name | dailyActivity_merged |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDate | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| TotalSteps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| TotalDistance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| TrackerDistance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| LoggedActivitiesDistance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| VeryActiveDistance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| ModeratelyActiveDistance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| LightActiveDistance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| VeryActiveMinutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| FairlyActiveMinutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| LightlyActiveMinutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| SedentaryMinutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| Calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
By using skim function that provide us the quick and broad overview about the data, we can see that there is no missing value or empty object in that data.
unique(dailyActivity_merged$Id)
## [1] 1503960366 1624580081 1644430081 1844505072 1927972279 2022484408
## [7] 2026352035 2320127002 2347167796 2873212765 3372868164 3977333714
## [13] 4020332650 4057192912 4319703577 4388161847 4445114986 4558609924
## [19] 4702921684 5553957443 5577150313 6117666160 6290855005 6775888955
## [25] 6962181067 7007744171 7086361926 8053475328 8253242879 8378563200
## [31] 8583815059 8792009665 8877689391
sum(duplicated(dailyActivity_merged))
## [1] 0
No duplicate value detected, all the observations are unique.
dailyActivity_merged <- clean_names(dailyActivity_merged)
head(dailyActivity_merged)
## # A tibble: 6 × 15
## id activ…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸ seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>, and abbreviated variable names
## # ¹activity_date, ²total_steps, ³total_distance, ⁴tracker_distance,
## # ⁵logged_activities_distance, ⁶very_active_distance,
## # ⁷moderately_active_distance, ⁸light_active_distance,
## # ⁹sedentary_active_distance
B. hourlyIntensities_merged data
skim_without_charts(hourlyIntensities_merged)
| Name | hourlyIntensities_merged |
| Number of rows | 22099 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityHour | 0 | 1 | 19 | 21 | 0 | 736 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.848235e+09 | 2.4225e+09 | 1503960366 | 2320127002 | 4.445115e+09 | 6.962181e+09 | 8877689391 |
| TotalIntensity | 0 | 1 | 1.204000e+01 | 2.1130e+01 | 0 | 0 | 3.000000e+00 | 1.600000e+01 | 180 |
| AverageIntensity | 0 | 1 | 2.000000e-01 | 3.5000e-01 | 0 | 0 | 5.000000e-02 | 2.700000e-01 | 3 |
Skim function can quickly provide a broad overview of a data frame. As we can see, hourlyIntensities data has no missing value and empty object to fixed. Those are in “n_missing” and “character.empty” column.
unique(hourlyIntensities_merged$Id)
## [1] 1503960366 1624580081 1644430081 1844505072 1927972279 2022484408
## [7] 2026352035 2320127002 2347167796 2873212765 3372868164 3977333714
## [13] 4020332650 4057192912 4319703577 4388161847 4445114986 4558609924
## [19] 4702921684 5553957443 5577150313 6117666160 6290855005 6775888955
## [25] 6962181067 7007744171 7086361926 8053475328 8253242879 8378563200
## [31] 8583815059 8792009665 8877689391
sum(duplicated(hourlyIntensities_merged))
## [1] 0
There is no duplicate data, all the data are unique.
hourlyIntensities_merged$ActivityHour=as.POSIXct(hourlyIntensities_merged$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourlyIntensities_merged$time <- format(hourlyIntensities_merged$ActivityHour, format = "%H:%M:%S")
hourlyIntensities_merged$date <- format(hourlyIntensities_merged$ActivityHour, format = "%m/%d/%y")
head(hourlyIntensities_merged)
## # A tibble: 6 × 6
## Id ActivityHour TotalIntensity AverageIntensity time date
## <dbl> <dttm> <dbl> <dbl> <chr> <chr>
## 1 1503960366 2016-04-12 00:00:00 20 0.333 00:00:00 04/12…
## 2 1503960366 2016-04-12 01:00:00 8 0.133 01:00:00 04/12…
## 3 1503960366 2016-04-12 02:00:00 7 0.117 02:00:00 04/12…
## 4 1503960366 2016-04-12 03:00:00 0 0 03:00:00 04/12…
## 5 1503960366 2016-04-12 04:00:00 0 0 04:00:00 04/12…
## 6 1503960366 2016-04-12 05:00:00 0 0 05:00:00 04/12…
hourlyIntensities_merged <- clean_names(hourlyIntensities_merged)
head(hourlyIntensities_merged)
## # A tibble: 6 × 6
## id activity_hour total_intensity average_intensity time date
## <dbl> <dttm> <dbl> <dbl> <chr> <chr>
## 1 1503960366 2016-04-12 00:00:00 20 0.333 00:00:… 04/1…
## 2 1503960366 2016-04-12 01:00:00 8 0.133 01:00:… 04/1…
## 3 1503960366 2016-04-12 02:00:00 7 0.117 02:00:… 04/1…
## 4 1503960366 2016-04-12 03:00:00 0 0 03:00:… 04/1…
## 5 1503960366 2016-04-12 04:00:00 0 0 04:00:… 04/1…
## 6 1503960366 2016-04-12 05:00:00 0 0 05:00:… 04/1…
ANALYZE & SHARE
After cleaning the data, I ensure the data already eligible to analyze.
A. dailyActivity_merged
Add a new column for active_minutes (I wanna see the trend between active minutes and sandatary minutes).
dailyActivity_merged <- mutate(dailyActivity_merged, active_minutes=very_active_minutes+fairly_active_minutes+lightly_active_minutes)
Add a new column for categories of least active, active, most active from total step variable (least active < 4363, active >= 4363 & <= 8442, most active)
dailyActivity_merged$Category <- ifelse(dailyActivity_merged$total_steps < 4363, 'least active', ifelse(dailyActivity_merged$total_steps >= 4363 & dailyActivity_merged$total_steps <= 8442, 'active', 'most active'))
head(dailyActivity_merged)
## # A tibble: 6 × 17
## id activ…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸ seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 7 more variables: very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>, active_minutes <dbl>,
## # Category <chr>, and abbreviated variable names ¹activity_date,
## # ²total_steps, ³total_distance, ⁴tracker_distance,
## # ⁵logged_activities_distance, ⁶very_active_distance,
## # ⁷moderately_active_distance, ⁸light_active_distance, …
Statistical summary
summary(dailyActivity_merged)
## id activity_date total_steps total_distance
## Min. :1.504e+09 Length:940 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Mode :character Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :36019 Max. :28.030
## tracker_distance logged_activities_distance very_active_distance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## moderately_active_distance light_active_distance sedentary_active_distance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## very_active_minutes fairly_active_minutes lightly_active_minutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
## Median : 4.00 Median : 6.00 Median :199.0
## Mean : 21.16 Mean : 13.56 Mean :192.8
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
## Max. :210.00 Max. :143.00 Max. :518.0
## sedentary_minutes calories active_minutes Category
## Min. : 0.0 Min. : 0 Min. : 0.0 Length:940
## 1st Qu.: 729.8 1st Qu.:1828 1st Qu.:146.8 Class :character
## Median :1057.5 Median :2134 Median :247.0 Mode :character
## Mean : 991.2 Mean :2304 Mean :227.5
## 3rd Qu.:1229.5 3rd Qu.:2793 3rd Qu.:317.2
## Max. :1440.0 Max. :4900 Max. :552.0
From the summary statistic above, I found that the majority of the users of Bellabeat mostly spent their time with inactive or less active activity. The average of sedentary minutes > active minutes = 991.2 > 227.5 with the average total steps equals 7638, which I know based on firstquotehealth article that 7000 steps are ideal or active for adults (20-65 y.o). Thus, even though sedentary minutes > active minutes yet the majority of the users still the active people.
reference: https://firstquotehealth.com/health-insurance/news/recommended-steps-day
Finding the trend between the variable total steps and calories (per-ID)
ggplot(data=dailyActivity_merged,aes(x=total_steps,y=calories))+
geom_point(aes(color=id))+
facet_wrap(~id) +
labs(title="Total Steps Vs Calories per ID")
Individual sample of Bellabeat users for behavioral measures in total steps and calories. I found that the trend almost same for each user.
Total steps Vs Calories burned
ggplot(data=dailyActivity_merged,aes(x=total_steps,y=calories))+
geom_point(color= "green") +
geom_smooth(color = "red") +
labs(title="Total Steps Vs Calories Burned")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Based on the result and for sure we can know it better that our total steps will significantly affects the calories burn on our body, so does the data show us. More your steps more the calories burned.
Make new table for sum of active minutes and sedentary minutes
amount_active_sedentary <- data.frame(sum_active_minutes = sum(dailyActivity_merged$active_minutes), sum_sedentary_minutes = sum(dailyActivity_merged$sedentary_minutes))
Convert horizontal value into a vertical value
amount_active_sedentary <- gather(amount_active_sedentary, key = "Category", value = "Value", sum_active_minutes, sum_sedentary_minutes)
Make bar plot for active minutes and sedentary minutes
ggplot(amount_active_sedentary, aes(x = Category, y = Value, fill = Category)) +
geom_bar(stat = "identity") +
labs(title="Sum of Active Vs Sedentary Minutes")
The result has significantly gap between total of sedentary minus and total of active minutes, that means the users spent their time mostly inactive or less active activity.
Find the trend between active minutes and calories
ggplot(data=dailyActivity_merged,aes(x=active_minutes,y=calories))+
geom_point(color="blue") +
geom_smooth(color="red") +
labs(title="Active Minutes Vs Calories")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The result shows us that the active minutes significantly affects the burn of calories.
Find the trend between total steps and total distance
ggplot(data=dailyActivity_merged,aes(x=total_steps,y=total_distance))+
geom_point() +
labs(title="Total Steps Vs Total Distance")
Based on the result above, total steps have positive correlation with total distance. More steps you do, more distance you have.
Categories of steps
ggplot(dailyActivity_merged, aes(x = Category, fill = Category)) +
geom_bar() +
labs(title = "Categories of Steps")
As I mentioned before (see the statistical summary for dailyActivity_merged data) that I categorized the number of steps to see Bellabeat’s users’ behavior and based on the result, most active at the top, followed by active at the second and least active at the last. That means the majority of Bellabeat users are more active than least active.
Categories of steps per-ID
ggplot(dailyActivity_merged, aes(x = Category, fill = id)) +
geom_bar(position = position_dodge()) +
facet_wrap(~id) +
labs(title="Categories of Active per ID")
This finding is representing the behavioral statistic of each user for categories active, and we can see various trend among the users.
B. hourlyIntensities Data
Statistical summary
print(summary(hourlyIntensities_merged))
## id activity_hour total_intensity
## Min. :1.504e+09 Min. :2016-04-12 00:00:00.00 Min. : 0.00
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 01:00:00.00 1st Qu.: 0.00
## Median :4.445e+09 Median :2016-04-26 06:00:00.00 Median : 3.00
## Mean :4.848e+09 Mean :2016-04-26 11:46:42.58 Mean : 12.04
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-03 19:00:00.00 3rd Qu.: 16.00
## Max. :8.878e+09 Max. :2016-05-12 15:00:00.00 Max. :180.00
## average_intensity time date
## Min. :0.0000 Length:22099 Length:22099
## 1st Qu.:0.0000 Class :character Class :character
## Median :0.0500 Mode :character Mode :character
## Mean :0.2006
## 3rd Qu.:0.2667
## Max. :3.0000
Average total intensities Vs Time
int_new <- hourlyIntensities_merged %>%
group_by(time) %>%
drop_na() %>%
summarise(mean_total_int = mean(total_intensity))
ggplot(data=int_new, aes(x=time, y=mean_total_int)) + geom_histogram(stat = "identity", fill='black') +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Average Total Intensity vs. Time")
## Warning in geom_histogram(stat = "identity", fill = "black"): Ignoring unknown
## parameters: `binwidth`, `bins`, and `pad`
I can see there are two peaks of intensities time based on the result, first, between 12:00 - 14:00 I assumed that in those hours the users have a break time and looking for lunch. Second, between 17:00 - 19:00 I assumed that the users already finished their work and time to go back home. Hence, the first and the second peak are the time when the users finished their work after spent so much energy and concentration.
NB: we need to do further analysis
ACT
Conclusion
The very useful data table to analyzed is dailyActivity_merged data. We can see the categories active users of Bellabeat has most active users than least active, even though the number of sedentary minutes is higher than active minutes, but calories burned much more alongside with the total steps and total active minutes. The intensities hours of Bellabeat users are between 12:00 - 14:00, the time when we usually have a break time for lunch after work for a half day. And, at 17:00 - 19:00, the time when we usually finished our work and go back home after spent so much energy for all day.
Suggestions
I suggest using the “Categories of active per ID” result for the marketing team to see our potential customers (users).
I suggest Bellabeat make a new feature to track the users’ history place, with place tracking we can understand better the activity of our customers. Because when I make a report for Average total Intensities vs Time, I just assumed the customer activity based on what people commonly do for their daily activities. Some said 17:00 - 19:00 are the time when the users go to the gym, we have no clear evidence, so I highly recommend the places tracker feature (for sure, still concerned for users’ privacy).
Because Bellabeat focuses on women, we can make another feature (or product) to help women track their periods of menstruation. Remind them when they are already overworked or have excessive activity when they are on their periods.
Give Bellabeat customers daily, weekly, and annual dashboard reports. Daily: we can give them a report about their total steps and calories. Weekly: we can give them a report about total steps, calories, their peak time hour, categories of active, etc. Annual: we can give them a summary report of their health and give them some recommendations for a healthy life based on the report.
The most important thing is to be a “best friend” for our customers, always reminding them about their health. There is no great approach to making your customer become a loyal customer than a psychological approach.