# Before we begin, load all the required libraries
library(rmarkdown)
library(tidyverse)
library(lubridate)
library(janitor)
library(here)
library(skimr)
library(scales)Introduction
The Chief executive officer of BellaBeat has asked you to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation.
Based on the provided information above, our business task is now clear. Use usage data from non-Bellabeat yet similar product. In our case we will use data from Fitbit consumers which is an american consumer electronics and fitness company owned by Google. It’s considered to be one of the largest companies in that market. According to Wikipedia, Fitbit has more than 29 million active users in their community and has sold more than 120 million devices. Their products are sold in 39,000 retail stores and in over 100 countries. Fitbit has a revenue of $1.13 billion in 2020. Thus choosing data from such a company to inform the marketing strategy for Bellabeat products was a really good decision.
Ask
Our main business task is to use these data to gain insights about:
Why consumers selected that company to get their smart devices from
What information those devices provide to them
How can Bellabeat use these data to improve their marketing strategy or even improve their product features
Prepare
For this project we shall use the Fitbit dataset from Kaggle. You may wanna take a look at the metadata for easier interpretation. Also have a quick look to know how the tracker actually works. Now According to the description,
This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 - 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
It consists of 18 csv files with data such as dates, times, duration of workouts, amounts of burned calories, number of sleeping hours and body weight info. As mentioned previously, this dataset can be downloaded from Kaggle and I saved a backup copy on my local device. Some data files are stored in wide format and some are stored in the long one. Some files are even stored in both formats.
It’s really important to note here that the CEO, who is our main stakeholder in this case, says that this data set might have some limitations and encourages us to consider adding other data to help address those limitations as we begin to work more with this data.
- Regarding the reliability of the source it’s important to mention that Kaggle doesn’t put strict conditions to ensure the datasets being uploaded meet specific criteria. Personally I’ve uploaded multiple datasets to Kaggle without having to answer any questions about its original source. The dataset uploader howerver states that the source is Zenodo website which in turn states that the owner of the dataset is RTI, an independent, nonprofit institute that provides research, development, and technical services to government and commercial clients worldwide. Hence we may say that the source of this datasets is reliable and the dataset is cited
- We haven’t collected this dataset by ourselves thus it’s a second party data
- This dataset was collected and uploaded about six years ago and later in this analysis we shall explore the dataset and decide whether it contains all the information we need or we have to look for other sources
- Despite Fitibit tracker is a product that’s being sold and used worldwide, This data is collected via Amazon Mechanical Turk where participants from only specific countries are eligible to take the survey and this raise concerns regarding the sampling bias. Other than that, I can’t see any other type of bias exist in the data
- The dataset owners made it an open access type. It’s available for download, copy and use by anyone for free
- Regarding ownership, according to the dataset owners thirty eligible Fitbit users consented to the submission of personal tracker data. However no additional information is provided with the description related to the transaction transparency. Thus we don’t know whether the participants were aware how exactly their data will processed or not
- IDs were used instead of participants names. Also no personally identifiable information (location, contact info, medical records, etc.) are found in the files
- In addition to the sampling bias, Compared to the number of customers, the size of the sample is quite small. These two are the main problems associated with this dataset which appeared so far
Process
We shall use R in both RStudio and Kaggle for this case study because I see R as a more powerful tool than spreadsheets and maybe later we repeat the case study with SQL and compare between the two tools. The dataset has 18 csv files so let’s explore and clean the dataset step by step.
1) Daily Activity
# Step_1
# First modify the dataframe columns names to a more standard format
dailyActivity_merged = read_csv(
"case_study/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv"
)
dailyActivity_merged = clean_names(dailyActivity_merged)
# Running the below code, we can see we have 33 id/person
dailyActivity_merged %>%
count(id)## # A tibble: 33 × 2
## id n
## <dbl> <int>
## 1 1503960366 31
## 2 1624580081 31
## 3 1644430081 30
## 4 1844505072 31
## 5 1927972279 31
## 6 2022484408 31
## 7 2026352035 31
## 8 2320127002 31
## 9 2347167796 18
## 10 2873212765 31
## # … with 23 more rows
As you can see we have a total number of IDs equals 33. But as mentioned earlier we should only have thirty ids. So could it Possible that there are some users with more than one device?!! Well as you can see below for some days we have the whole 33 participants working out on the same day, thus we shall consider we have 33 participants until we get strong valid evidence that proves otherwise. Another important note is that despite the description that says that the duration of study is between 3/12/2016 & 5/12/2016 (62 days), the first day in the dataset is 4/12/2016 thus the total number of days in the analysis is 31 days.
dailyActivity_merged %>%
group_by(activity_date) %>%
summarize(no_of_participants = n_distinct(id))## # A tibble: 31 × 2
## activity_date no_of_participants
## <chr> <int>
## 1 4/12/2016 33
## 2 4/13/2016 33
## 3 4/14/2016 33
## 4 4/15/2016 33
## 5 4/16/2016 32
## 6 4/17/2016 32
## 7 4/18/2016 32
## 8 4/19/2016 32
## 9 4/20/2016 32
## 10 4/21/2016 32
## # … with 21 more rows
# Step_2
# Remove any duplicate entries if exist
dailyActivity_merged = unique(dailyActivity_merged)# Step_3
# Check the data types of each column and modify if necessary
str(dailyActivity_merged)## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_date : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ total_steps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ total_distance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ tracker_distance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ logged_activities_distance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ very_active_distance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ moderately_active_distance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ light_active_distance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ sedentary_active_distance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ very_active_minutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ fairly_active_minutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ lightly_active_minutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentary_minutes : num [1:940] 728 776 1218 726 773 ...
## $ calories : num [1:940] 1985 1797 1776 1745 1863 ...
dailyActivity_merged$activity_date = mdy(dailyActivity_merged$activity_date)# Step_4
# Take a quick overview of the dataframe and check for any messing values
skim_without_charts(dailyActivity_merged)| Name | dailyActivity_merged |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| activity_date | 0 | 1 | 2016-04-12 | 2016-05-12 | 2016-04-26 | 31 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| total_steps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| total_distance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| tracker_distance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| logged_activities_distance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| very_active_distance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| moderately_active_distance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| light_active_distance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| sedentary_active_distance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| very_active_minutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| fairly_active_minutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| lightly_active_minutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| sedentary_minutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
But maybe despite not having any null values we still have rows that all of its values equal zero so let’s check.
# Step_5
# Check for any rows where every column has zero value
filter(dailyActivity_merged, if_all(everything(), ~ . == 0))## # A tibble: 0 × 15
## # … with 15 variables: id <dbl>, activity_date <date>, total_steps <dbl>,
## # total_distance <dbl>, tracker_distance <dbl>,
## # logged_activities_distance <dbl>, very_active_distance <dbl>,
## # moderately_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>
# Step_6
# Check for days where there are no input data
input_data = dplyr::select(dailyActivity_merged, !c("id", "activity_date"))
filter(input_data, if_all(everything(), ~ . == 0))## # A tibble: 0 × 13
## # … with 13 variables: total_steps <dbl>, total_distance <dbl>,
## # tracker_distance <dbl>, logged_activities_distance <dbl>,
## # very_active_distance <dbl>, moderately_active_distance <dbl>,
## # light_active_distance <dbl>, sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
# Step_7
# Check if there's any IDs entered or entered inocrrectly (more or less than
# 10 numbers)
min(dailyActivity_merged$id)## [1] 1503960366
max(dailyActivity_merged$id)## [1] 8877689391
From the following code, we can see that the value of tracker distance equals the total distance which is the sum of four degrees of active distance ONLY when there isn’t any value entered for the logged activities. However we shall not consider that as a cross-field invalidation and leave the dataframe as it is for now.
dailyActivity_merged %>%
filter(
tracker_distance == rowSums(dailyActivity_merged[, 7:10]) &
total_distance == tracker_distance & logged_activities_distance == 0
)## # A tibble: 306 × 15
## id activity_date total_steps total_distance tracker_distance
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-05-07 11992 7.71 7.71
## 2 1503960366 2016-05-12 0 0 0
## 3 1624580081 2016-04-12 8163 5.31 5.31
## 4 1624580081 2016-04-13 7007 4.55 4.55
## 5 1624580081 2016-04-16 5370 3.49 3.49
## 6 1624580081 2016-04-19 2916 1.90 1.90
## 7 1624580081 2016-04-20 4974 3.23 3.23
## 8 1624580081 2016-04-29 2390 1.55 1.55
## 9 1624580081 2016-05-07 2104 1.37 1.37
## 10 1624580081 2016-05-09 1732 1.13 1.13
## # … with 296 more rows, and 10 more variables:
## # logged_activities_distance <dbl>, very_active_distance <dbl>,
## # moderately_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>
The dataframes daily calories, daily intensities and daily steps are just a replication of the data in daily activity one. Thus they shall not be considered in this analysis. In general out of the 18 dataframes we’re going to use only six of them as the other 12 have replicate / unnecessary info. That said, let’s proceed to clean the remaining five.
2) Burned Calories per Hour
# Step_1
# First modify the dataframe columns names to a more standard format
hourlyCalories_merged = read_csv(
"case_study/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv"
)
hourlyCalories_merged = clean_names(hourlyCalories_merged)
# Step_2
# Remove any duplicate entries if exist
hourlyCalories_merged = unique(hourlyCalories_merged)
# Step_3
# Check the data types of each column and modify if necessary
str(hourlyCalories_merged)## tibble [22,099 × 3] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_hour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ calories : num [1:22099] 81 61 59 47 48 48 48 47 68 141 ...
hourlyCalories_merged$activity_hour = mdy_hms(hourlyCalories_merged$activity_hour)
# Step_4
# Take a quick overview the dataframe and check for any missing values
# This is also useful to make sure all dates have been converted correctly
# Otherwise we'll end up having null values
skim_without_charts(hourlyCalories_merged)| Name | hourlyCalories_merged |
| Number of rows | 22099 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.848235e+09 | 2.4225e+09 | 1503960366 | 2320127002 | 4445114986 | 6962181067 | 8877689391 |
| calories | 0 | 1 | 9.739000e+01 | 6.0700e+01 | 42 | 63 | 83 | 108 | 948 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| activity_hour | 0 | 1 | 2016-04-12 | 2016-05-12 15:00:00 | 2016-04-26 06:00:00 | 736 |
# Step_5
# Check for any zero values
filter(hourlyCalories_merged, if_any(everything(), ~ . == 0))## # A tibble: 0 × 3
## # … with 3 variables: id <dbl>, activity_hour <dttm>, calories <dbl>
# Step_6
# Check if there's any IDs entered or entered incorrectly (more or less than
# 10 numbers)
min(hourlyCalories_merged$id)## [1] 1503960366
max(hourlyCalories_merged$id)## [1] 8877689391
# Step_7
# Check for any wrong calories values
hourlyCalories_merged %>%
filter(calories < 0)## # A tibble: 0 × 3
## # … with 3 variables: id <dbl>, activity_hour <dttm>, calories <dbl>
3) Intensities Values per Hour
# Step_1
# First modify the dataframe columns names to a more standard format
hourlyIntensities_merged = read_csv(
"case_study/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv"
)
hourlyIntensities_merged = clean_names(hourlyIntensities_merged)
# Step_2
# Remove any duplicate entries if exist
hourlyIntensities_merged = unique(hourlyIntensities_merged)
# Step_3
# Check the data types of each column and modify if necessary
str(hourlyIntensities_merged)## tibble [22,099 × 4] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_hour : chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ total_intensity : num [1:22099] 20 8 7 0 0 0 0 0 13 30 ...
## $ average_intensity: num [1:22099] 0.333 0.133 0.117 0 0 ...
hourlyIntensities_merged$activity_hour = mdy_hms(hourlyIntensities_merged$activity_hour)
# Step_4
# Take a quick overview the dataframe and check for any missing values
# This is also useful to make sure all dates have been converted correctly
# Otherwise we'll end up having null values
skim_without_charts(hourlyIntensities_merged)| Name | hourlyIntensities_merged |
| Number of rows | 22099 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.848235e+09 | 2.4225e+09 | 1503960366 | 2320127002 | 4.445115e+09 | 6.962181e+09 | 8877689391 |
| total_intensity | 0 | 1 | 1.204000e+01 | 2.1130e+01 | 0 | 0 | 3.000000e+00 | 1.600000e+01 | 180 |
| average_intensity | 0 | 1 | 2.000000e-01 | 3.5000e-01 | 0 | 0 | 5.000000e-02 | 2.700000e-01 | 3 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| activity_hour | 0 | 1 | 2016-04-12 | 2016-05-12 15:00:00 | 2016-04-26 06:00:00 | 736 |
# Step_5
# Check for any row that has all zero values
filter(hourlyIntensities_merged, if_all(everything(), ~ . == 0))## # A tibble: 0 × 4
## # … with 4 variables: id <dbl>, activity_hour <dttm>, total_intensity <dbl>,
## # average_intensity <dbl>
# Step_6
# Check if there's any IDs entered or entered incorrectly (more or less than
# 10 numbers)
min(hourlyIntensities_merged$id)## [1] 1503960366
max(hourlyIntensities_merged$id)## [1] 8877689391
# Step_7
# Check for any wrong intensities values
hourlyIntensities_merged %>%
filter(total_intensity < 0 | average_intensity < 0)## # A tibble: 0 × 4
## # … with 4 variables: id <dbl>, activity_hour <dttm>, total_intensity <dbl>,
## # average_intensity <dbl>
4) Steps per Hour
# Step_1
# First modify the dataframe columns names to a more standard format
hourlySteps_merged = read_csv(
"case_study/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv"
)
hourlySteps_merged = clean_names(hourlySteps_merged)
# Step_2
# Remove any duplicate entries if exist
hourlySteps_merged = unique(hourlySteps_merged)
# Step_3
# Check the data types of each column and modify if necessary
str(hourlySteps_merged)## tibble [22,099 × 3] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activity_hour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ step_total : num [1:22099] 373 160 151 0 0 ...
hourlySteps_merged$activity_hour = mdy_hms(hourlySteps_merged$activity_hour)
# Step_4
# Take a quick overview the dataframe and check for any messing values
# This is also useful to make sure all dates have been converted correctly
# Otherwise we'll end up having null values
skim_without_charts(hourlySteps_merged)| Name | hourlySteps_merged |
| Number of rows | 22099 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.848235e+09 | 2.4225e+09 | 1503960366 | 2320127002 | 4445114986 | 6962181067 | 8877689391 |
| step_total | 0 | 1 | 3.201700e+02 | 6.9038e+02 | 0 | 0 | 40 | 357 | 10554 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| activity_hour | 0 | 1 | 2016-04-12 | 2016-05-12 15:00:00 | 2016-04-26 06:00:00 | 736 |
# Step_5
# Check for any row that has all zero values
filter(hourlySteps_merged, if_all(everything(), ~ . == 0))## # A tibble: 0 × 3
## # … with 3 variables: id <dbl>, activity_hour <dttm>, step_total <dbl>
# Step_6
# Check if there's any IDs entered or entered incorrectly (more or less than
# 10 numbers)
min(hourlySteps_merged$id)## [1] 1503960366
max(hourlySteps_merged$id)## [1] 8877689391
# Step_7
# Check for any wrong steps values
hourlySteps_merged %>%
filter(step_total < 0)## # A tibble: 0 × 3
## # … with 3 variables: id <dbl>, activity_hour <dttm>, step_total <dbl>
5) Sleep Periods
# Step_1
# First modify the dataframe columns names to a more standard format
sleepDay_merged = read_csv(
"case_study/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv"
)
sleepDay_merged = clean_names(sleepDay_merged)
# Step_2
# Remove any duplicate entries if exist
sleepDay_merged = unique(sleepDay_merged)
# Step_3
# Check the data types of each column and modify if necessary
str(sleepDay_merged)## tibble [410 × 5] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:410] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ sleep_day : chr [1:410] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ total_sleep_records : num [1:410] 1 2 1 2 1 1 1 1 1 1 ...
## $ total_minutes_asleep: num [1:410] 327 384 412 340 700 304 360 325 361 430 ...
## $ total_time_in_bed : num [1:410] 346 407 442 367 712 320 377 364 384 449 ...
sleepDay_merged$sleep_day = mdy_hms(sleepDay_merged$sleep_day)
# Step_4
# Take a quick overview the dataframe and check for any messing values
# This is also useful to make sure all dates have been converted correctly
# Otherwise. we'll end up having null values
skim_without_charts(sleepDay_merged)| Name | sleepDay_merged |
| Number of rows | 410 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.994963e+09 | 2.060863e+09 | 1503960366 | 3.977334e+09 | 4702921684.0 | 6962181067 | 8792009665 |
| total_sleep_records | 0 | 1 | 1.120000e+00 | 3.500000e-01 | 1 | 1.000000e+00 | 1.0 | 1 | 3 |
| total_minutes_asleep | 0 | 1 | 4.191700e+02 | 1.186400e+02 | 58 | 3.610000e+02 | 432.5 | 490 | 796 |
| total_time_in_bed | 0 | 1 | 4.584800e+02 | 1.274600e+02 | 61 | 4.037500e+02 | 463.0 | 526 | 961 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| sleep_day | 0 | 1 | 2016-04-12 | 2016-05-12 | 2016-04-27 | 31 |
# Step_5
# Check for any zero values
filter(sleepDay_merged, if_all(everything(), ~ . == 0))## # A tibble: 0 × 5
## # … with 5 variables: id <dbl>, sleep_day <dttm>, total_sleep_records <dbl>,
## # total_minutes_asleep <dbl>, total_time_in_bed <dbl>
# Step_6
# Check if there're no IDs entered or entered inocrrectly (more or less than
# 10 numbers)
min(sleepDay_merged$id)## [1] 1503960366
max(sleepDay_merged$id)## [1] 8792009665
# Step_7
# Check for any wrong values
sleepDay_merged %>%
filter(
total_sleep_records < 0 | total_minutes_asleep < 0 | total_time_in_bed < 0
)## # A tibble: 0 × 5
## # … with 5 variables: id <dbl>, sleep_day <dttm>, total_sleep_records <dbl>,
## # total_minutes_asleep <dbl>, total_time_in_bed <dbl>
6) Body Mass Info
# Step_1
# First modify the dataframe columns names to a more standard format
weightLogInfo_merged = read_csv(
"case_study/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv"
)
weightLogInfo_merged = clean_names(weightLogInfo_merged)
# Step_2
# Remove any duplicate entries if exist
weightLogInfo_merged = unique(weightLogInfo_merged)
# Step_3
# Check the data types of each column and modify if necessary
str(weightLogInfo_merged)## tibble [67 × 8] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ date : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ weight_kg : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
## $ weight_pounds : num [1:67] 116 116 294 125 126 ...
## $ fat : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
## $ bmi : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
## $ is_manual_report: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
## $ log_id : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
weightLogInfo_merged$date = mdy_hms(weightLogInfo_merged$date)
# Step_4
# Take a quick overview the dataframe and check for any messing values
# This is also useful to make sure all dates have been converted correctly
# Otherwise we'll end up having null values
skim_without_charts(weightLogInfo_merged)| Name | weightLogInfo_merged |
| Number of rows | 67 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| logical | 1 |
| numeric | 6 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| is_manual_report | 0 | 1 | 0.61 | TRU: 41, FAL: 26 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 7.009282e+09 | 1.950322e+09 | 1.503960e+09 | 6.962181e+09 | 6.962181e+09 | 8.877689e+09 | 8.877689e+09 |
| weight_kg | 0 | 1.00 | 7.204000e+01 | 1.392000e+01 | 5.260000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 |
| weight_pounds | 0 | 1.00 | 1.588100e+02 | 3.070000e+01 | 1.159600e+02 | 1.353600e+02 | 1.377900e+02 | 1.875000e+02 | 2.943200e+02 |
| fat | 65 | 0.03 | 2.350000e+01 | 2.120000e+00 | 2.200000e+01 | 2.275000e+01 | 2.350000e+01 | 2.425000e+01 | 2.500000e+01 |
| bmi | 0 | 1.00 | 2.519000e+01 | 3.070000e+00 | 2.145000e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 |
| log_id | 0 | 1.00 | 1.461772e+12 | 7.829948e+08 | 1.460444e+12 | 1.461079e+12 | 1.461802e+12 | 1.462375e+12 | 1.463098e+12 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2016-04-12 06:47:11 | 2016-05-12 23:59:59 | 2016-04-27 23:59:59 | 56 |
# Step_5
# Check for any zero values
filter(weightLogInfo_merged, if_all(everything(), ~ . == 0))## # A tibble: 0 × 8
## # … with 8 variables: id <dbl>, date <dttm>, weight_kg <dbl>,
## # weight_pounds <dbl>, fat <dbl>, bmi <dbl>, is_manual_report <lgl>,
## # log_id <dbl>
# Step_6
# Check if there're no IDs entered or entered inocrrectly (more or less than
# 10 numbers)
min(weightLogInfo_merged$id)## [1] 1503960366
max(weightLogInfo_merged$id)## [1] 8877689391
min(weightLogInfo_merged$log_id)## [1] 1.460444e+12
max(weightLogInfo_merged$log_id)## [1] 1.463098e+12
# Step_7
# Check for any wrong values
# We know from step_4 that the fat column has too many null values
weightLogInfo_merged %>%
drop_na(fat) %>%
filter(
weight_kg < 0 | weight_pounds < 0 | fat < 0 | bmi < 0 |
is_manual_report == 0
)## # A tibble: 0 × 8
## # … with 8 variables: id <dbl>, date <dttm>, weight_kg <dbl>,
## # weight_pounds <dbl>, fat <dbl>, bmi <dbl>, is_manual_report <lgl>,
## # log_id <dbl>
Act
The five recommendations shown below are based on the five findings in our dataset respectively. Before proceeding, allow me to remind you what I’ve said earlier about the limitations of this dataset and why we should collect more data by conducting another survey that’s more global, specifically addressed for women and has more participants. The CEO has also asked to do so.
It’s finally the weekend and everyone of us want to hang out with some friends, binge watch some shows, go for a quick trip and the list goes on. For me I can see the challenge here is not just to send a reminder via the app for users not to miss training on those days but rather to prepare some fun collections of different workouts that are mainly bodyweight ones, so they can be done anywhere. Thus no matter where you’re planning to go this weekend you don’t need to go look for a gym or a good running track and you don’t even have to carry those exercise tools with you. I think what we need to bring to the table is a diverse list of fun bodyweight programs and keep updating those programs so that users are looking forward to exploring them each new weekend. We may also bring group workout programs to encourage friends to do them while hanging out.
There are many fitness experts that strongly advise to workout in the early morning to get the best results for the body. You may wanna read this article by the TIME for instance. We may develop a collection of morning workouts. To keep our users motivated we may send daily notifications such as links to articles, studies, etc. that are showing more information on the benefits of waking up early and practice before going to work. This will also help the users to not miss workouts on Fridays.
Sending quick info & charts to users about their sleep quality last night and how that will affect the day ahead can be a good start. Also a creative reminders that are set at a user_defined times each day to tell the user to go to bed and turn off any distractions
& 5 We need to conduct another survey for Fitbit users asking questions about
a. Whether they use their tracker while they’re asleep or not
b. Is the sleep-related information they receive from the tracker helpful or not
c. Asking whether the tracker itself is comfortable to wear while asleep or not and if not why it’s uncomfortable
d. Do they record their weight, BMI and fat info or not
e. Is the weight-related information they receive from the tracker helpful or not
f. On average how long does it take to record this weight-related info via the app
p.s. I know the following four questions are out of scope of this case study but since we’re conducting a survey anyway let’s gather this info for future considerations
g. If they’re planning to get a new smart scale what physical and technical features are they looking for and what’s the desired price range
h. why they’ve chosen that brand to get their current tracker from
i. If they’re planning to get a new fitness tracker what physical and technical features are they looking for and what’s the desired price range
j. Would they use their trackers to look for dietary plans and why
k. Finally we may request feedback and suggestions. For instance when they’re looking for information about health status, sleep quality, etc., what they’re expecting