Bellebeat main website corporative image
This work is a study case part of the eighth course “Google Data Analytics Capstone” of the “Google Data Analyst” program.
Although it’s no the first time that I had perform a data analysis, both at the academic and professional level, it’s the first time following the methodology proposed in the study program, by serving the R programming language and the database query language (SQL).
Under normal circumstances, with the data we had from the start, this work would not have been possible, but I had priories on putting the skills learned into practice and carrying out this case study in a relatively short period of time, less than a week.
Bellabeat is a successful, small, high-tech company of health products for women. The heads believe that analyzing competence device data could help unlock growth chances. We should find insights in the data about the user’s behavior and make suggestions.
Due to find new opportunities to grow business, we will analyze competence smart device usage data by gain insights into the uses. Do apply these insights into one Bellabeat product and make recommendations.
The dataset from which we are recommended to start working is public. It refers to a set of data on consumption habits carried out through “Amazon Mechanical Turk” between 10/4/2016 and 12/05/2016, where the respondents (30 chosen) agreed to share the data (biometrics, minute-level output for physical activity, heart rate, and sleep monitoring) of theirs wearable devices for prospective study purposes.
The information is stored in long format, although some specific tables are arranged in wide format. Especially, the most important tables, those that collect the information grouped by larger time intervals (case of “dailyActivity_merged”) are configured in long format in relation to the date.
The data does not meet the ROCC parameters. The information is not reliable, since they do not specify more parameters than user ID numbers, we do not know if the information contains some kind of bias. For example, we do not know the gender of the user, if this survey has been carried out only by men.
What is the point of applying the discoveries made here to a smart device designed for women? Likewise, we do not know ethnicity, nationality and most importantly, the age of the respondents.
About the data of dataset creation, we should say the data set isn’t current, it dates from 2016, six years old. We can say that, when talking about technology, six years is the prehistory.
Finally and none the less, the information isn’t original, the data set has been retouched to be published on the “Kaggle” platform.
For all these reasons, we cannot consider the information reliable at all.
About Data integrity, the datasets are in .csv format, meeting the integrity requirements with a fair level of confidence. Not for less, the datasets has been obtained from a platform whose members are passionate about data science. However, we confirmed the integrity analyzing the data set using some R programming language functions.
Although the data is clearly compromised, we can still draw some conclusions that can help us meet our goals.
In normal circumstances, a meeting with the stakeholders would have to be held. It would be necessary for them to agree to carry out their own survey and to provide data and primary information, that is, that is in the possession of the company.
Also, if it did not exceed the scope and requirements of this work, I would propose incorporating other open data, such as this Apple dataset:Apple Watch and Fitbit data, a much more complete and in tune with the ROCCC parameters.
fb_dailyAct <- read.csv("fb_data/dailyActivity_merged.csv")
fb_dailyCal <- read.csv("fb_data/dailyCalories_merged.csv")
fb_dailyInt <- read.csv("fb_data/dailyIntensities_merged.csv")
fb_dailySteps <- read.csv("fb_data/dailySteps_merged.csv")
fb_heartrate_sec <- read.csv("fb_data/heartrate_seconds_merged.csv")
fb_hourlyCal <- read.csv("fb_data/hourlyCalories_merged.csv")
fb_hourlyInt <- read.csv("fb_data/hourlyIntensities_merged.csv")
fb_hourlySteps <- read.csv("fb_data/hourlySteps_merged.csv")
fb_minuteCaloriesNarrow <- read.csv("fb_data/minuteCaloriesNarrow_merged.csv")
fb_minuteCaloriesWide <- read.csv("fb_data/minuteCaloriesWide_merged.csv")
fb_minuteIntensitiesNarrow <- read.csv("fb_data/minuteIntensitiesNarrow_merged.csv")
fb_minuteIntensitiesWide <- read.csv("fb_data/minuteIntensitiesWide_merged.csv")
fb_minuteSleep <- read.csv("fb_data/minuteSleep_merged.csv")
fb_minuteStepsNarrow <- read.csv("fb_data/minuteStepsNarrow_merged.csv")
fb_minuteStepsWide <- read.csv("fb_data/minuteStepsWide_merged.csv")
fb_sleepDay <- read.csv("fb_data/sleepDay_merged.csv")
fb_weightLogInfo <- read.csv("fb_data/weightLogInfo_merged.csv")
fb_minuteMETsNarrow <- read.csv("fb_data/weightLogInfo_merged.csv")
Since the table frame with the heart rate is relevant to the analysis, and since its size is considerable, we decided to work with this data from Bigquery, combining the use of R and SQL language, while implementing some visualizations from Tableau:
library(DBI)
con <- dbConnect(odbc::odbc(), "Bellabeat", timeout = 10)
We will need to load an additional library:
library(RMySQL)
Loading the data set “heartrate_seconds_merged.csv in the Rstudio environment from bigQuery environment:
fb_heartrate_sec <- dbReadTable(con, "fb_heartrate_sec")
As a result, we obtain this table:
head(fb_heartrate_sec)
## int64_field_0 Id Value time date
## 1 154299 2026352035 106 09:37:30 2016-04-25
## 2 154300 2026352035 108 09:37:35 2016-04-25
## 3 154326 2026352035 107 09:41:50 2016-04-25
## 4 154327 2026352035 108 09:41:55 2016-04-25
## 5 154328 2026352035 108 09:42:10 2016-04-25
## 6 154329 2026352035 107 09:42:25 2016-04-25
Finally, we had all the packages we need to be able to work with R in combination with datasets hosted in Bigquery and to use SQL.
Loading the R libraries nedeed in our Rstudio envionment:
library("rmarkdown")
library("tidyr")
library("tibble")
library("ggplot2")
library("skimr")
library("tibble")
library("janitor")
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("kableExtra")
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 3 > 1' in coercion to 'logical(1)'
library("dplyr")
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ readr 2.1.2 ✔ stringr 1.4.0
## ✔ purrr 0.3.4 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
As a summary of the visualization and complete study of all the data sets, we show the most relevant results of the tables that group the data in a wide interval (daily activity) as a sample.
head(fb_dailyAct)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
skim_without_charts("fb_dailyAct")
| Name | "fb_dailyAct" |
| Number of rows | 1 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 11 | 11 | 0 | 1 | 0 |
summarise(fb_dailyAct)
## data frame with 0 columns and 1 row
We decided to eliminate the data sets that are structured in small time intervals (minutes) and because they have redundant data compared to datasets mad with broad time periods (Daily Grouped Data). We maintain, therefore, the “Dailyactivity, Sleepday and Weightloginfo” tables, which we will group in a single table (fb_Final_daily).
We also maintain the “Heartrate_Seconds” table, for being relevant to research, whose “Heart Rate” variable we will convert to minutes and from which we will create an additional variable with the average
#Hourly Intensity
fb_hourlyInt$ActivityHour=as.POSIXct(fb_hourlyInt$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
#New time variable
fb_hourlyInt$time <- format(fb_hourlyInt$ActivityHour, format = "%H:%M:%S")
#New date variable
fb_hourlyInt$date <- format(fb_hourlyInt$ActivityHour, format = "%m/%d/%Y")
#Erasing duplicates
fb_hourlyInt$ActivityHour <- NULL
#backup
write.csv(fb_hourlyInt, file= "fb_data/hourlyIntensities_merged2.csv")
#Hourly Calories format fixing
fb_hourlyCal$ActivityHour=as.POSIXct(fb_hourlyCal$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_hourlyCal$time <- format(fb_hourlyCal$ActivityHour, format = "%H:%M:%S")
fb_hourlyCal$date <- format(fb_hourlyCal$ActivityHour, format = "%m/%d/%Y")
fb_hourlyCal$ActivityHour <- NULL
write.csv(fb_hourlyCal, file= "fb_data/hourlyCalories_merged2.csv")
#Fixing date format in fb_dailyAct
fb_dailyAct$ActivityDate=as.POSIXct(fb_dailyAct$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
fb_dailyAct$date <- format(fb_dailyAct$ActivityDate, format = "%m/%d/%Y")
fb_dailyAct$ActivityDate <- NULL
write.csv(fb_dailyAct, file= "fb_data/dailyActivity_merged2.csv")
#Fixing date format data in fb_sleepDay
fb_sleepDay$SleepDay=as.POSIXct(fb_sleepDay$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_sleepDay$date <- format(fb_sleepDay$SleepDay, format = "%m/%d/%Y")
fb_sleepDay$SleepDay <- NULL
write.csv(fb_sleepDay, file= "fb_data/sleepDay_merged2.csv")
#5a. Fixing date format fb_heartrate_sec
fb_heartrate_sec$Time=as.POSIXct(fb_heartrate_sec$Time, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_heartrate_sec$time <- format(fb_heartrate_sec$Time, format = "%H:%M:%S")
fb_heartrate_sec$date <- format(fb_heartrate_sec$Time, format = "%m/%d/%y")
fb_heartrate_sec$Time <- NULL
write.csv(fb_heartrate_sec, file= "fb_data/heartrate_seconds_merged2.csv")
#Fixing date format in fb_hourlySteps
fb_hourlySteps$ActivityHour=as.POSIXct(fb_hourlySteps$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_hourlySteps$time <- format(fb_hourlySteps$ActivityHour, format = "%H:%M:%S")
fb_hourlySteps$date <- format(fb_hourlySteps$ActivityHour, format = "%m/%d/%Y")
fb_hourlySteps$ActivityHour <- NULL
write.csv(fb_hourlySteps, file= "fb_data/hourlySteps_merged2.csv")
#Fixing date format in fb_minuteCaloriesNarrow
fb_minuteCaloriesNarrow$ActivityMinute=as.POSIXct(fb_minuteCaloriesNarrow$ActivityMinute, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_minuteCaloriesNarrow$time <- format(fb_minuteCaloriesNarrow$ActivityMinute, format = "%H:%M:%S")
fb_minuteCaloriesNarrow$date <- format(fb_minuteCaloriesNarrow$ActivityMinute, format = "%m/%d/%y")
fb_minuteCaloriesNarrow$ActivityMinute <- NULL
write.csv(fb_minuteCaloriesNarrow, file= "fb_data/minuteCaloriesNarrow_merged2.csv")
#Fixing date format in fb_minuteCaloriesWide
fb_minuteCaloriesWide$ActivityHour=as.POSIXct(fb_minuteCaloriesWide$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_minuteCaloriesWide$time <- format(fb_minuteCaloriesWide$ActivityHour, format = "%H:%M:%S")
fb_minuteCaloriesWide$date <- format(fb_minuteCaloriesWide$ActivityHour, format = "%m/%d/%y")
fb_minuteCaloriesWide$ActivityHour <- NULL
write.csv(fb_minuteCaloriesWide, file= "fb_data/minuteCaloriesWide_merged2.csv")
#Fixing date format fb_minuteIntensitiesNarrow
fb_minuteIntensitiesNarrow$ActivityMinute=as.POSIXct(fb_minuteIntensitiesNarrow$ActivityMinute, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_minuteIntensitiesNarrow$time <- format(fb_minuteIntensitiesNarrow$ActivityMinute, format = "%H:%M:%S")
fb_minuteIntensitiesNarrow$date <- format(fb_minuteIntensitiesNarrow$ActivityMinute, format = "%m/%d/%y")
fb_minuteIntensitiesNarrow$ActivityMinute <- NULL
write.csv(fb_minuteIntensitiesNarrow, file= "fb_data/minuteIntensitiesNarrow_merged2.csv")
#Fixing date format in fb_minuteIntensitiesWide
fb_minuteIntensitiesWide$ActivityHour=as.POSIXct(fb_minuteIntensitiesWide$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_minuteIntensitiesWide$time <- format(fb_minuteIntensitiesWide$ActivityHour, format = "%H:%M:%S")
fb_minuteIntensitiesWide$date <- format(fb_minuteIntensitiesWide$ActivityHour, format = "%m/%d/%y")
fb_minuteIntensitiesWide$ActivityHour <- NULL
write.csv(fb_minuteIntensitiesWide, file= "fb_data/minuteIntensitiesWide_merged2.csv")
#Fixing date format in fb_minuteSleep
fb_minuteSleep$date=as.POSIXct(fb_minuteSleep$date, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_minuteSleep$time <- format(fb_minuteSleep$date, format = "%H:%M:%S")
fb_minuteSleep$datev_2 <- format(fb_minuteSleep$date, format = "%m/%d/%Y")
fb_minuteSleep$date <- NULL
write.csv(fb_minuteSleep, file= "fb_data/minuteSleep_merged2.csv")
#Fixing date format in fb_minuteStepsNarrow
fb_minuteStepsNarrow$ActivityMinute=as.POSIXct(fb_minuteStepsNarrow$ActivityMinute, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_minuteStepsNarrow$time <- format(fb_minuteStepsNarrow$ActivityMinute, format = "%H:%M:%S")
fb_minuteStepsNarrow$date <- format(fb_minuteStepsNarrow$ActivityMinute, format = "%m/%d/%y")
fb_minuteStepsNarrow$ActivityMinute <- NULL
write.csv(fb_minuteStepsNarrow, file= "fb_data/minuteStepsNarrow_merged2.csv")
#Fixing date format in fb_minuteStepsWide
fb_minuteStepsWide$ActivityHour=as.POSIXct(fb_minuteStepsWide$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_minuteStepsWide$time <- format(fb_minuteStepsWide$ActivityHour, format = "%H:%M:%S")
fb_minuteStepsWide$date <- format(fb_minuteStepsWide$ActivityHour, format = "%m/%d/%y")
fb_minuteStepsWide$ActivityHour <- NULL
write.csv(fb_minuteStepsWide, file= "fb_data/minuteStepsWide_merged2.csv")
#Fixing date format in fb_weightLogInfo
fb_weightLogInfo$Date=as.POSIXct(fb_weightLogInfo$Date, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_weightLogInfo$time <- format(fb_weightLogInfo$Date, format = "%H:%M:%S")
fb_weightLogInfo$date <- format(fb_weightLogInfo$Date, format = "%m/%d/%Y")
fb_weightLogInfo$Date <- NULL
#Factorizing 'IsManualReport' and excluding unnecessary in 'fb_weightLogInfo
fb_weightLogInfo <- fb_weightLogInfo %>%
select(-LogId) %>%
mutate(IsManualReport = as.factor(IsManualReport))
write.csv(fb_weightLogInfo, file= "fb_data/WeightLogInfo_merged2.csv")
#Fixing date format in fb_minuteMETsNarrow
fb_minuteMETsNarrow$Date=as.POSIXct(fb_minuteMETsNarrow$Date, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
fb_minuteMETsNarrow$time <- format(fb_minuteMETsNarrow$Date, format = "%H:%M:%S")
fb_minuteMETsNarrow$date <- format(fb_minuteMETsNarrow$Date, format = "%m/%d/%y")
fb_minuteMETsNarrow$Date <- NULL
write.csv(fb_minuteMETsNarrow, file= "fb_data/minuteMETsNarrow_merged2.csv")
#fb_dailyCal backup
write.csv(fb_dailyCal, file= "fb_data/dailyCalories_merged2.csv")
#fb_dailyInt backup
write.csv(fb_dailyInt, file= "fb_data/dailyIntensities_merged2.csv")
#fb_dailySteps backup
write.csv(fb_dailySteps, file= "fb_data/dailySteps_merged2.csv")
#Renaming variables for uniformity
fb_dailyCal <- fb_dailyCal %>%
mutate(date = ActivityDay) %>%
select(-ActivityDay)
fb_dailyInt <- fb_dailyInt %>%
mutate(date = ActivityDay) %>%
select(-ActivityDay)
fb_dailySteps <- fb_dailySteps %>%
mutate(date = ActivityDay) %>%
select(-ActivityDay)
We will combine 3 Datasets (“fb_dailyAct”, “fb_sleepDay”, “fb_weightLogInfo”), after having cleaned and reviewed each one of them and ensured that they contain variables of the same type and name, to ensure their compatibility and that can merge without problems:
fb_final_daily <- merge(merge(fb_dailyAct, fb_sleepDay, by= c('Id','date'), all = TRUE ), fb_weightLogInfo, by= c('Id','date'), all = TRUE)
Thus, we have the “fb_final_daily” dataset from which we can work more comfortably and adequately, which we will create a backup of:
write.csv(fb_final_daily, file= "fb_data/fb_final_daily.csv")
fb_final_daily <- read.csv("fb_data/fb_final_daily.csv")
As we said before, due to size issues, we need to import the “heartrate_seconds_merged.csv” table into BigQuery and next, add it to the Rstudio workbench:
fb_heartrate_sec <- dbReadTable(con, "fb_heartrate_sec")
obtaining the next table:
Heart Rate in BigQuery
Figure 1:Heart-rate table in BigQuery
We need to obtain the average of the heartbeats per hour and clean the variables, so we proceed through SQL to perform these tasks
-- !preview conn=con
SELECT
date AS ymd,
Id,
ROUND(AVG(Value),2) AS Heartrate
FROM `bellabeat-356005.Bellabeat.fb_heartrate_sec`
GROUP BY
date, Id
ORDER BY
Id
And saving a new bigQuery table “c_fb_heartrateAvg”, then loading in the RStudio environment:
fb_heartrateAvg <- dbReadTable(con, "c_fb_heartrateAvg")
And obtaining the next table:
head(fb_heartrateAvg)
## ymd Id Heartrate
## 1 2016-04-12 2022484408 75.80
## 2 2016-04-13 2022484408 80.34
## 3 2016-04-14 2022484408 72.63
## 4 2016-04-15 2022484408 80.44
## 5 2016-04-16 2022484408 75.96
## 6 2016-04-17 2022484408 83.92
clean_names(fb_final_daily)
As a snapshot of the result process: Figure
2:“Clean_names” function summary
glimpse(fb_heartrateAvg)
## Rows: 334
## Columns: 3
## $ ymd <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-16,…
## $ Id <int64> 2022484408, 2022484408, 2022484408, 2022484408, 2022484408…
## $ Heartrate <dbl> 75.80, 80.34, 72.63, 80.44, 75.96, 83.92, 82.71, 81.95, 83.4…
head(fb_final_daily)
## X Id date TotalSteps TotalDistance VeryActiveDistance
## 1 1 1503960366 04/12/2016 13162 8.50 1.88
## 2 2 1503960366 04/13/2016 10735 6.97 1.57
## 3 3 1503960366 04/14/2016 10460 6.74 2.44
## 4 4 1503960366 04/15/2016 9762 6.28 2.14
## 5 5 1503960366 04/16/2016 12669 8.16 2.71
## 6 6 1503960366 04/17/2016 9705 6.48 3.19
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## 1 0.55 6.06 0
## 2 0.69 4.71 0
## 3 0.40 3.91 0
## 4 1.26 2.83 0
## 5 0.41 5.04 0
## 6 0.78 2.51 0
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## 1 25 13 328 728
## 2 21 19 217 776
## 3 30 11 181 1218
## 4 29 34 209 726
## 5 36 10 221 773
## 6 38 20 164 539
## Calories TotalSleepRecords TotalMinutesAsleep TotalTimeInBed WeightKg Fat BMI
## 1 1985 1 327 346 NA NA NA
## 2 1797 2 384 407 NA NA NA
## 3 1776 NA NA NA NA NA NA
## 4 1745 1 412 442 NA NA NA
## 5 1863 2 340 367 NA NA NA
## 6 1728 1 700 712 NA NA NA
## time
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
skim_without_charts(fb_final_daily)
| Name | fb_final_daily |
| Number of rows | 943 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| date | 0 | 1.00 | 10 | 10 | 0 | 31 | 0 |
| time | 876 | 0.07 | 8 | 8 | 0 | 26 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| X | 0 | 1.00 | 4.720000e+02 | 2.723600e+02 | 1.00000e+00 | 2.365000e+02 | 4.720000e+02 | 7.075000e+02 | 9.430000e+02 |
| Id | 0 | 1.00 | 4.858486e+09 | 2.423712e+09 | 1.50396e+09 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| TotalSteps | 0 | 1.00 | 7.652190e+03 | 5.086530e+03 | 0.00000e+00 | 3.795000e+03 | 7.439000e+03 | 1.073400e+04 | 3.601900e+04 |
| TotalDistance | 0 | 1.00 | 5.500000e+00 | 3.930000e+00 | 0.00000e+00 | 2.620000e+00 | 5.260000e+00 | 7.720000e+00 | 2.803000e+01 |
| VeryActiveDistance | 0 | 1.00 | 1.500000e+00 | 2.660000e+00 | 0.00000e+00 | 0.000000e+00 | 2.200000e-01 | 2.060000e+00 | 2.192000e+01 |
| ModeratelyActiveDistance | 0 | 1.00 | 5.700000e-01 | 8.800000e-01 | 0.00000e+00 | 0.000000e+00 | 2.400000e-01 | 8.100000e-01 | 6.480000e+00 |
| LightActiveDistance | 0 | 1.00 | 3.350000e+00 | 2.050000e+00 | 0.00000e+00 | 1.950000e+00 | 3.380000e+00 | 4.790000e+00 | 1.071000e+01 |
| SedentaryActiveDistance | 0 | 1.00 | 0.000000e+00 | 1.000000e-02 | 0.00000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| VeryActiveMinutes | 0 | 1.00 | 2.124000e+01 | 3.295000e+01 | 0.00000e+00 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| FairlyActiveMinutes | 0 | 1.00 | 1.363000e+01 | 2.000000e+01 | 0.00000e+00 | 0.000000e+00 | 7.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| LightlyActiveMinutes | 0 | 1.00 | 1.930300e+02 | 1.093100e+02 | 0.00000e+00 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| SedentaryMinutes | 0 | 1.00 | 9.903500e+02 | 3.012600e+02 | 0.00000e+00 | 7.290000e+02 | 1.057000e+03 | 1.229000e+03 | 1.440000e+03 |
| Calories | 0 | 1.00 | 2.307510e+03 | 7.208200e+02 | 0.00000e+00 | 1.829500e+03 | 2.140000e+03 | 2.796500e+03 | 4.900000e+03 |
| TotalSleepRecords | 530 | 0.44 | 1.120000e+00 | 3.500000e-01 | 1.00000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+00 |
| TotalMinutesAsleep | 530 | 0.44 | 4.194700e+02 | 1.183400e+02 | 5.80000e+01 | 3.610000e+02 | 4.330000e+02 | 4.900000e+02 | 7.960000e+02 |
| TotalTimeInBed | 530 | 0.44 | 4.586400e+02 | 1.271000e+02 | 6.10000e+01 | 4.030000e+02 | 4.630000e+02 | 5.260000e+02 | 9.610000e+02 |
| WeightKg | 876 | 0.07 | 7.204000e+01 | 1.392000e+01 | 5.26000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 |
| Fat | 941 | 0.00 | 2.350000e+01 | 2.120000e+00 | 2.20000e+01 | 2.275000e+01 | 2.350000e+01 | 2.425000e+01 | 2.500000e+01 |
| BMI | 876 | 0.07 | 2.519000e+01 | 3.070000e+00 | 2.14500e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 |
fb_final_daily$X <- NULL %>%
fb_final_daily$LoggedActivitiesDistance <- NULL %>%
fb_final_daily$TrackerDistance <- NULL %>%
fb_final_daily$IsManualReport <- NULL %>%
fb_final_daily$Date <- NULL %>%
fb_final_daily$WeightPounds <- NULL %>%
We could start this section, summarizing the state of affairs, which would happen by saying that we have obtained as a product, two tables with which we are going to proceed with the analysis: -“fb_final_daily” -“fb_heartrateAvg”
Before starting the analysis is needed to mention that this section contains insights and ideas from the MIGUEL FZZZdesign.
First, we need to set a theme for the plots:
custom_theme_original <- function() {
theme(
panel.border = element_rect(colour = "black",
fill = NA,
linetype = 1),
panel.background = element_rect(fill = "white",
color = 'grey50'),
panel.grid.minor.y = element_blank(),
axis.text = element_text(colour = "blue",
face = "italic",
family = "Arial"),
axis.title = element_text(colour = "gray",
family = "Arial"),
axis.ticks = element_line(colour = "blue"),
plot.title = element_text(size=20,
hjust = 0.5,
family = "Arial"),
plot.subtitle=element_text(size=13,
hjust = 0.5),
plot.caption = element_text(colour = "brown",
face = "Arial",
family = "Arial")
)
}
fb_heartrateAvg %>%
group_by(Id) %>%
ggplot(aes(x=ymd, y=Heartrate, color=Heartrate)) +
geom_point(alpha=0.3, position = position_jitter())+
geom_smooth()+
labs(title = "Daily heartrate average", subtitle= "Daily distribution with mean scores", x= "Date", y="Heart Rate", caption = "Plot 1")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You can display an interactive plot by clicking [here](https://public.tableau.com/app/profile/anbamo/viz/BellabeatInsightsfromFitbitHeartRate-Date/Physiological)
fb_final_daily %>%
group_by(TotalDistance, Calories) %>%
ggplot(aes(x = TotalSteps, y = Calories, color = Calories)) +
geom_point(alpha=0.3, position = position_jitter()) +
geom_smooth() +
theme(legend.position = c(.8, .3),
legend.spacing.y = unit(1, "mm"),
panel.border = element_rect(colour = "black", fill=NA),
legend.background = element_blank(),
legend.box.background = element_rect(colour = "black")) +
labs(title = 'Calories burned by distance',
y = 'Calories',
x = 'Total Steps',
caption = 'Plot 2')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Pearson correlation index
cor.test(fb_final_daily$TotalDistance, fb_final_daily$Calories, method = 'pearson', conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: fb_final_daily$TotalDistance and fb_final_daily$Calories
## t = 26.002, df = 941, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6078539 0.6822785
## sample estimates:
## cor
## 0.6466023
You can display an interactive plot by clicking here
Plot 3: Daily Activity
Plot 4: Excercise Intensity
You can display an interactive plot by clicking here
fb_final_daily %>%
select(TotalMinutesAsleep) %>%
drop_na() %>%
mutate(sleep_quality = ifelse(TotalMinutesAsleep <= 420, 'Less than 7h',
ifelse(TotalMinutesAsleep <= 540, '7h to 9h',
'More than 9h'))) %>%
mutate(sleep_quality = factor(sleep_quality,
levels = c('Less than 7h','7h to 9h',
'More than 9h'))) %>%
ggplot(aes(x = TotalMinutesAsleep, fill = sleep_quality)) +
geom_histogram(position = 'dodge', bins = 30) +
scale_fill_manual(values=c("tan1", "#66CC99", "lightcoral")) +
theme(legend.position = c(.80, .80),
legend.title = element_blank(),
legend.spacing.y = unit(0, "mm"),
panel.border = element_rect(colour = "black", fill=NA),
legend.background = element_blank(),
legend.box.background = element_rect(colour = "black")) +
labs(
title = "Sleep distribution",
x = "Time slept (minutes)",
y = "Count",
caption = 'Plot 5'
)
fb_final_daily%>%
select(Id, TotalDistance, TotalMinutesAsleep) %>%
group_by(Id) %>%
summarise_all(list(~mean(., na.rm=TRUE))) %>%
drop_na() %>%
mutate(Id = factor(Id)) %>%
ggplot() +
geom_bar(aes(x = Id, y = TotalDistance), stat = "identity", fill = 'lightblue', alpha = 0.7) +
geom_point(aes(x = Id, y = TotalMinutesAsleep/60), color = 'gold4') +
geom_segment(aes(x = Id, xend = Id, y = 0, yend = TotalMinutesAsleep/60), color = 'gold4' ,group = 1) +
scale_y_continuous(limits=c(0, 12), name = "Total Distance",
sec.axis = sec_axis(~.*60, name = "Sleep in minutes")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(axis.title.y.right = element_text(color = "gold4"),
axis.ticks.y.right = element_line(color = "gold4"),
axis.text.y.right = element_text(color = "gold4")) +
labs(
title = "Distance vs sleep",
x = "Users",
caption = 'Plot 6'
)