Bellabeat is a high-tech manufacturer of women’s health products. Bellabeat is a successful little business with the potential to grow into a major player in the global smart device industry. Urka Sren, cofounder and Chief Creative Officer of Bellabeat, believes that examining smart device fitness data could help the company discover new development prospects.
Where I get to ask the right questions to understand the business questions and also identify key stakeholders on the project.
How consumers use non-Bellabeat smart devices to gain insights
Here’s where I get to gather the dataset to use, identify the source, the security, credibility and integrity.
By verifying the metadata of our dataset, we can confirm that it is open-source. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent permitted by law. You may copy, modify, distribute, and perform the work without asking permission.
These datasets were created by respondents to a distributed survey via Amazon Mechanical Turk between December 3rd and December 5th, 2016. Thirty (30) Fitbit users agreed to submit personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. The variation in output represents the use of various Fitbit trackers and individual tracking behaviors/preferences.
This Kaggle data set contains thirty fitbit users’ personal fitness trackers. Thirty Fitbit users agreed to submit personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It contains data on daily activity, steps, and heart rate that can be used to investigate users’ habits.
In this phase we will carryout some data cleaning and formatting tasks to ensure the data variables are thorough and ready for visualization.
Setting up my R environment by loading the ‘tidyverse’ and other needed packages
library(tidyverse) # Data import and wrangling
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2) # For data Visualization
library(dplyr)
library(tidyr)
library(scales) # For transforming numbers in percentage
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
getwd() # Displays the working directory
## [1] "C:/Users/Ola/Documents"
There are 18 csv files in the dataset. Each of them displays data related to the device’s various functions: calories, activity level, daily steps, and so on.
To simplify the analysis, we will concentrate on daily data in this study.
daily_activity <- read_csv("Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(daily_activity)
daily_calories <- read_csv("Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_intensities <- read_csv("Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
## Rows: 940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_steps <- read_csv("Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_sleep <- read_csv("Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_info <- read_csv("Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let have a look at the various datasets and have a clear understanding of how they look, similarities and cohesion between the various datasets.
View(daily_activity)
View(daily_calories)
View(daily_intensities)
View(daily_steps)
View(daily_sleep)
View(weight_info)
After examining the various data sets, it is possible to conclude that table 1 (Daily activity) already contains information from table 2 (Daily calories), table 3 (Daily steps), and table 4 (Daily intensities). Another observation is that each dataset has the same number of observations. As a result, those dataframes will be removed.
rm(daily_calories, daily_intensities, daily_steps) #(removing tables)
Before merging the datasets, let’s clean the date columns to make them homogeneous and transform them to right data type.
# Cleaning the variables
daily_activity <- daily_activity %>%
rename(Date = ActivityDate) %>%
mutate(Date = as.Date(Date, format = "%m/%d/%y"))
daily_sleep <- daily_sleep %>%
rename(Date = SleepDay) %>%
mutate(Date = as.Date(Date, format = "%m/%d/%y"))
weight_info <- weight_info %>%
select(-LogId) %>%
mutate(Date = as.Date(Date, format = "%m/%d/%y")) %>%
mutate(IsManualReport = as.factor(IsManualReport))
final_data <- merge(merge(daily_activity, daily_sleep, by=c('Id','Date'), all = TRUE), weight_info, by = c('Id','Date'), all = TRUE)
View(final_data)
final_data <- final_data %>%
select(-c(TrackerDistance, LoggedActivitiesDistance, TotalSleepRecords, WeightPounds, Fat, BMI, IsManualReport))
View(final_data)
str(final_data)
## 'data.frame': 943 obs. of 16 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ Date : Date, format: "2020-04-12" "2020-04-13" ...
## $ TotalSteps : num 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num 728 776 1218 726 773 ...
## $ Calories : num 1985 1797 1776 1745 1863 ...
## $ TotalMinutesAsleep : num 327 384 NA 412 340 700 NA 304 360 325 ...
## $ TotalTimeInBed : num 346 407 NA 442 367 712 NA 320 377 364 ...
## $ WeightKg : num NA NA NA NA NA NA NA NA NA NA ...
We can see that majority of the variables are numerical.
summary(final_data)
## Id Date TotalSteps TotalDistance
## Min. :1.504e+09 Min. :2020-04-12 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 1st Qu.:2020-04-19 1st Qu.: 3795 1st Qu.: 2.620
## Median :4.445e+09 Median :2020-04-26 Median : 7439 Median : 5.260
## Mean :4.858e+09 Mean :2020-04-26 Mean : 7652 Mean : 5.503
## 3rd Qu.:6.962e+09 3rd Qu.:2020-05-04 3rd Qu.:10734 3rd Qu.: 7.720
## Max. :8.878e+09 Max. :2020-05-12 Max. :36019 Max. :28.030
##
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 1.950
## Median : 0.220 Median :0.2400 Median : 3.380
## Mean : 1.504 Mean :0.5709 Mean : 3.349
## 3rd Qu.: 2.065 3rd Qu.:0.8050 3rd Qu.: 4.790
## Max. :21.920 Max. :6.4800 Max. :10.710
##
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## Min. :0.000000 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.000000 1st Qu.: 0.00 1st Qu.: 0.00
## Median :0.000000 Median : 4.00 Median : 7.00
## Mean :0.001601 Mean : 21.24 Mean : 13.63
## 3rd Qu.:0.000000 3rd Qu.: 32.00 3rd Qu.: 19.00
## Max. :0.110000 Max. :210.00 Max. :143.00
##
## LightlyActiveMinutes SedentaryMinutes Calories TotalMinutesAsleep
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 58.0
## 1st Qu.:127 1st Qu.: 729.0 1st Qu.:1830 1st Qu.:361.0
## Median :199 Median :1057.0 Median :2140 Median :433.0
## Mean :193 Mean : 990.4 Mean :2308 Mean :419.5
## 3rd Qu.:264 3rd Qu.:1229.0 3rd Qu.:2796 3rd Qu.:490.0
## Max. :518 Max. :1440.0 Max. :4900 Max. :796.0
## NA's :530
## TotalTimeInBed WeightKg
## Min. : 61.0 Min. : 52.60
## 1st Qu.:403.0 1st Qu.: 61.40
## Median :463.0 Median : 62.50
## Mean :458.6 Mean : 72.04
## 3rd Qu.:526.0 3rd Qu.: 85.05
## Max. :961.0 Max. :133.50
## NA's :530 NA's :876
Finally, in this phase I get to share with the stakeholders my suggestions and conclusions based on the finds in our analysis.
Thank you very much!
Special thanks to Miguel Fzzz for his contribution to the open source community in assisting people to learn and be influenced by the approach by referring to case studies (as used in this analysis).