Bellabeat is a high-tech company that prides itself as ‘the go-to wellness brand for women with an ecosystem of products and services focused on women’s health’. They manufacture health-focused smart products that collect data on activity, sleep, stress and hydration levels as well as the reproductive health of women with the goal of empowering them with an understanding of their health and hitherto unknown habits.
Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.
· Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
· Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
· Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
· Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
· Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
It is envisaged that a focus on a Bellabeat products and the analysis of the FitBit Fitness Tracker Data will help key stakeholders at Bellabeat to gain insights into how people are already using their smart devices and reveal more opportunities for growth.
In order to adequately analyze these data to answer the key business questions and make recommendations, I will follow the key steps of Data Analysis Process: Ask, Prepare, Process, Analyze, Share and Act
a. Business task
Analyze data about select users of FitBit smart devices to draw gainful insights into trends, patterns and relationships between health parameters; identify potential opportunities for growth and make high-level marketing recommendation and strategies to the marketing department of Bellabeat.
b. Questions
What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?
c. Key Stakeholders:
· Urška Sršen — Bellabeat’s co-founder and Chief Creative Officer
· Sando Mur — Mathematician and Bellabeat’s co-founder;
· Bellabeat marketing analytics team — A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.
a. Data Source
For the purpose of this analysis, Bellabeat’s Chief Creative, Urška Sršen gave his nod to the usage of a public data that explored smart device user’s daily measures - the FitBit Fitness Tracker Data.
The FitBit Fitness Tracker Data is a public domain dataset made available by Möbius under CC0 database protection license. The dataset, comprising of 18 .csv files, has the combined personal fitness tracker statistics from thirty (30) FitBit users who consented to submit their personal data which includes their heart rate, sleep details, intensities, physical activities and other related data necessary to assess their habits.
b. Data Assessment for credibility & integrity
To determine the credibility, reliability and integrity of the dataset presented, I will utilize the ROCCC (Reliable, Original, Comprehensive, Current & Cited) data test model.
Reliability: (LOW) There were only 30 individuals involved in this survey. This is a very small sample size for making far-reaching analysis & recommendation for the required business task.
Originality: (LOW) Data is sourced from a third-party survey by Amazon Mechanical Turk.
Comprehensive: (MEDIUM) – Data is within the parameters required for the Bellabeat’s business task.
Current: (LOW) The dataset was sourced back in 2016 (over 8 years ago) and covered a short period of March – May 2016. It is my opinion that this data is somewhat stale given the pace of better and improved health data tracking methods over the years. More so, a 2-month data collection window is so short for the highly dynamic data type.
Cited: (MEDIUM) The third-party dataset was available by Mobius via Kaggle.
I also observed some limitations to the data provided as it did not give information on key characteristics such as gender, age, location, lifestyle of the participants.
c. Data Selection
The 18 datasets were first opened on Excel for preliminary review, filtering and sorting to observe for blanks, inconsistent naming convention, missing data and possible duplicity of data.
From the review, I observed the data for daily_calories, daily_intensities, and daily_steps data frames are contained in daily_activity data frame. For ease of analysis, these data frames will be deemed well represented and taken out of our analysis to avoid duplicity.
I also noticed a lot of empty cells in the Fat column of the weight_log data frame. The Fat column was then removed.
d. RStudio Cloud: Installation and Loading of packages
I will use RStudio Cloud for this analysis. This is because of the wide range of functionalities it has for data manipulation, cleaning, analysis and visualization. To use RStudio, key packages needed for the analysis are to be installed.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("forcats")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("scales")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages ("geosphere")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("plotrix")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages ("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
These installed packages are then loaded to use their functionalities.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
library(ggplot2)
library(readr)
library(forcats)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(geosphere)
library(plotrix)
##
## Attaching package: 'plotrix'
##
## The following object is masked from 'package:scales':
##
## rescale
library(here)
## here() starts at /cloud/project
library(skimr)
library(tidyr)
e. Data Importation
The datasets are now imported into the RStudio application and given simplified names using the assignment operator
daily_activity <- read_csv("dailyActivity.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_calories <- read_csv("dailyCalories.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_intensities <- read_csv("dailyIntensities.csv")
## Rows: 940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_steps <- read_csv("dailySteps.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heart_rate <- read_csv("heartrate_seconds.csv")
## Rows: 1048575 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_sleep <- read_csv("sleepDay.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_log_data <- read_csv("weightLogInfo.csv")
## Rows: 67 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (5): Id, WeightKg, WeightPounds, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
minute_METs <- read_csv("minuteMETsNarrow.csv")
## Rows: 1048575 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityMinute
## dbl (2): Id, METs
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here, we will explore the data frames to find area of commonalities and confirm that the data were appropriately imported. Functions like head(), colnames() glimpse() and str() will be used.
head(daily_activity)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
head(heart_rate)
## # A tibble: 6 × 3
## Id Time Value
## <dbl> <chr> <dbl>
## 1 2022484408 4/12/2016 7:21 97
## 2 2022484408 4/12/2016 7:21 102
## 3 2022484408 4/12/2016 7:21 105
## 4 2022484408 4/12/2016 7:21 103
## 5 2022484408 4/12/2016 7:21 101
## 6 2022484408 4/12/2016 7:22 95
colnames(heart_rate)
## [1] "Id" "Time" "Value"
head(daily_sleep)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 0:00 1 327 346
## 2 1503960366 4/13/2016 0:00 2 384 407
## 3 1503960366 4/15/2016 0:00 1 412 442
## 4 1503960366 4/16/2016 0:00 2 340 367
## 5 1503960366 4/17/2016 0:00 1 700 712
## 6 1503960366 4/19/2016 0:00 1 304 320
colnames(daily_sleep)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(weight_log_data)
## # A tibble: 6 × 7
## Id Date WeightKg WeightPounds BMI IsManualReport LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 23:59 52.6 116. 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 23:59 52.6 116. 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016 1:08 134. 294. 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016 23:59 56.7 125. 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016 23:59 57.3 126. 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016 23:59 72.4 160. 27.5 TRUE 1.46e12
colnames(weight_log_data)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "BMI" "IsManualReport" "LogId"
head(minute_METs)
## # A tibble: 6 × 3
## Id ActivityMinute METs
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 0:00 10
## 2 1503960366 4/12/2016 0:01 10
## 3 1503960366 4/12/2016 0:02 10
## 4 1503960366 4/12/2016 0:03 10
## 5 1503960366 4/12/2016 0:04 10
## 6 1503960366 4/12/2016 0:05 12
colnames(minute_METs)
## [1] "Id" "ActivityMinute" "METs"
We then run a quick summary on the various data frames by using the skim_without_chart() function to provide broader overview of a data frames.
skim_without_charts(daily_activity)
| Name | daily_activity |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDate | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| TotalSteps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| TotalDistance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| TrackerDistance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| LoggedActivitiesDistance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| VeryActiveDistance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| ModeratelyActiveDistance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| LightActiveDistance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| VeryActiveMinutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| FairlyActiveMinutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| LightlyActiveMinutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| SedentaryMinutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| Calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
skim_without_charts(daily_sleep)
| Name | daily_sleep |
| Number of rows | 413 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| SleepDay | 0 | 1 | 13 | 14 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 5.000979e+09 | 2.06036e+09 | 1503960366 | 3977333714 | 4702921684 | 6962181067 | 8792009665 |
| TotalSleepRecords | 0 | 1 | 1.120000e+00 | 3.50000e-01 | 1 | 1 | 1 | 1 | 3 |
| TotalMinutesAsleep | 0 | 1 | 4.194700e+02 | 1.18340e+02 | 58 | 361 | 433 | 490 | 796 |
| TotalTimeInBed | 0 | 1 | 4.586400e+02 | 1.27100e+02 | 61 | 403 | 463 | 526 | 961 |
skim_without_charts(weight_log_data)
| Name | weight_log_data |
| Number of rows | 67 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Date | 0 | 1 | 13 | 15 | 0 | 56 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| IsManualReport | 0 | 1 | 0.61 | TRU: 41, FAL: 26 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 7.009282e+09 | 1.950322e+09 | 1.50396e+09 | 6.962181e+09 | 6.962181e+09 | 8.877689e+09 | 8.877689e+09 |
| WeightKg | 0 | 1 | 7.204000e+01 | 1.392000e+01 | 5.26000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 |
| WeightPounds | 0 | 1 | 1.588100e+02 | 3.070000e+01 | 1.15960e+02 | 1.353600e+02 | 1.377900e+02 | 1.875000e+02 | 2.943200e+02 |
| BMI | 0 | 1 | 2.519000e+01 | 3.070000e+00 | 2.14500e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 |
| LogId | 0 | 1 | 1.460000e+12 | 0.000000e+00 | 1.46000e+12 | 1.460000e+12 | 1.460000e+12 | 1.460000e+12 | 1.460000e+12 |
Running these summaries, we discovered that;
daily_activity has 940 observations
daily_sleep has 413 observations
weight_log_data has 67 observations
Explore number of users in each dataset using the common foreign key - ID
n_distinct(daily_activity$Id)
## [1] 33
This give us 33 IDs; this means some users might have created more IDs
n_distinct(daily_sleep$Id)
## [1] 24
This gives 24 IDs; this means 6 participants’ information were missing in the survey
n_distinct(weight_log_data$Id)
## [1] 8
8 entries recorded; 22 participants’ data were not populated
Explore Average Sleep time & Average Time in Bed
Avg_minutes_asleep <- daily_sleep %>% summarize(avg_sleeptime = mean(TotalMinutesAsleep))
Avg_minutes_asleep
## # A tibble: 1 × 1
## avg_sleeptime
## <dbl>
## 1 419.
Avg_TimeInBed <- daily_sleep %>%
summarize(avg_TimeInBed = mean(TotalTimeInBed))
Avg_TimeInBed
## # A tibble: 1 × 1
## avg_TimeInBed
## <dbl>
## 1 459.
The above data exploration shows the participants stayed up in bed for an additional 40 minutes before they fall asleep.
I converted the ActivityDate column to days of the week (Monday-Friday), from the daily_activity data set
daily_activity <- daily_activity %>%
mutate(weekday1 = weekdays(as.Date(ActivityDate, "%m/%d/%Y")))
glimpse(daily_activity)
## Rows: 940
## Columns: 16
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
## $ weekday1 <chr> "Tuesday", "Wednesday", "Thursday", "Friday",…
daily_activity$weekday1 <- ordered(daily_activity$weekday1, levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
activity_data <- daily_activity %>%
group_by(weekday1) %>%
summarize(count_of = n())
glimpse(activity_data)
## Rows: 7
## Columns: 2
## $ weekday1 <ord> Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
## $ count_of <int> 120, 152, 150, 147, 126, 124, 121
ggplot(activity_data, aes(x=weekday1, y=count_of)) +
geom_bar(stat="identity",color="black",fill="#b75dab") +
labs(title="Tracker user count across the week", x="Day of the week", y="Count") +
geom_label(aes(label=count_of),color="black")
The visualization above shows that more tracker records were captured on Tuesday, Wednesday and Thursday.
ggplot(data=daily_activity,aes(x=TotalSteps,y=SedentaryMinutes, color=Calories)) +
geom_point() +
geom_smooth(method="lm",color="blue") +
labs(title="Total Steps vs. Sedentary Minutes",x="Total Steps",y="Sedentary Minutes")+
scale_color_gradient(low="#ffdca7",high="#422d9e")
## `geom_smooth()` using formula 'y ~ x'
From the visualization above, we can see an inverse relationship between Total steps taken and sedentary time in any given time.
mean_steps <- mean(daily_activity$TotalSteps)
mean_steps
## [1] 7637.911
mean_calories <- mean(daily_activity$Calories)
mean_calories
## [1] 2303.61
ggplot(data=daily_activity, aes(x=TotalSteps,y=Calories,color=Calories)) +
geom_point() +
labs(title="Calories burned for every step taken",x="Total Steps Taken",y="Calories Burned") +
geom_smooth(method="lm") +
geom_hline(mapping = aes(yintercept=mean_calories),color="yellow",lwd=1.0)+
geom_vline(mapping = aes(xintercept=mean_steps),color="red",lwd=1.0) +
geom_text(mapping = aes(x=10000,y=500,label="Average Steps",srt=-90)) +
geom_text(mapping = aes(x=29000,y=2500,label="Average Calories")) +
scale_color_gradient(low="#ffdca7",high="#422d9e")
## `geom_smooth()` using formula 'y ~ x'
The visualization above shows a positive correlation between the steps taken and the calories burnt.
ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
geom_point() +
labs(title="Time Asleep vs. Time in Bed",x="Time Asleep",y="Time in Bed") +
geom_smooth(method="lm") + geom_jitter()
## `geom_smooth()` using formula 'y ~ x'
The visualization above shows total time in bed is positively correlated to total time asleep
ggplot(data=daily_activity, aes(x=VeryActiveMinutes, y=Calories, color=Calories)) +
geom_point() +
geom_smooth(method="loess",color="blue") +
labs(title="Very Active Minutes vs. Calories",x="Very Active Minutes",y="Calories") +
scale_color_gradient(low="#ffdca7",high="#422d9e")
## `geom_smooth()` using formula 'y ~ x'
The visualization above shows a positive correlation between active minutes and calories burned.
ggplot(data=daily_activity, aes(x=SedentaryMinutes,y=Calories,color=Calories)) +
geom_point() +
geom_smooth(method="loess",color="blue") +
labs(title="Sedentary Minutes vs. Calories Burned",x="Sedentary Minutes",y="Calories") +
scale_color_gradient(low="#ffdca7",high="#422d9e")
## `geom_smooth()` using formula 'y ~ x'
The visualization above initially showed a positive correlation but then turned negative – lesser burned calories as sedentary minutes increased.