Main data observed
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(weight_log_info$Id)
## [1] 8
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Using RStudio, when clicking the Knit button, a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
Prior loading packages we must select the CRAN mirror server to be used.
chooseCRANmirror(graphics = getOption("menu.graphics"), ind = 1, local.only = FALSE)
In order to start cleaning, processing and analysing project data, we
need to install the required packages by running the
install.packages().
Once a package is installed, we can load it by running the
library() function for each of the several packages.
Bellabeat is a successful small company, but they have the potential to become a larger player in the Global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. The Bellabeat Case Study focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights discovered will then help guide marketing strategy for the company. The results and conclusions of the analysis will be presented to the Bellabeat executive team along with high-level recommendations for Bellabeat’s marketing strategy.
Bellabeat website: https://bellabeat.com
Bellabeat products and membership (in app subscription):
Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
The scope of work is to get data on consumer usage of Bellabeat app from smartphones, considering one of the products, to analyse opportunities of growth based on that. The insights and recommendations will be presented to Bellabeat CCO (Chief Creative Officer) and marketing team. Eventually the presentation will also be used to support CCO for sharing the results and to present a proposal with the derived marketing opportunity of growth for getting approval from Bellabeat Executive Team Board.
The public dataset suggested as a basis for the kick-off of the activities is the following one (CC0: Public Domain, dataset made available through Mobius, a data scientist): https://www.kaggle.com/arashnic/fitbit Other public datasets can be used to complement wherever needed and justified. The data will be assumed as near ROCCC (reliable, original, comprehensive, current and cited), but with some limitations on quantity of data (distinct user records) and aspects not covered (e.g. age). However, the sample is relatively small, only from 30 users, which is somehow not enough, but we will look further to highlight that limitation. Regarding weight information the available information is even more limited regarding distinct users, namely only 8. “Fat” information is mostly not available, but it is relevant. User age would be relevant to consider as it is naturally an aspect with impact in physical activity, calories, etc. Information on age is not available. User gender should also be considered for eventual future commercial expansion of scope to include also men and not only women. The data is stored in Kaggle, namely it is a set of 18 csv files. The Id field represents the several users, which are said to be 30 distinguished, corresponding to the count observed in the several files. The organization of the data is in the wide format (a column for each variable). In terms of right to use CC0 Public Domain is the “no copyright reserved” option in the Creative Commons toolkit - it effectively means relinquishing all copyright and similar rights that you hold in a work and dedicating those rights to the public domain.
The data to consider is currently external .csv files. In order to
view and clean it in R, we need to import it. The
tidyverse library readr package has a number
of functions for “reading in” or importing data, including .csv
files.
In the chunk below, we use the read_csv() function to
import data from .csv files in the project folder called “Fitabase_Data”
(https://www.kaggle.com/arashnic/fitbit) and save it as a
data frames. The data describing both daily activity and sleep is
selected to start the analysis.
daily_activity <- read_csv("C:/Users/josef/My_Documents/_Strategy/Development_AI-ML-DS/Tools_Data_Analysis/R/R-Projects/Case-Study-Bellabeat/Fitabase_Data/dailyActivity_merged.csv", show_col_types = FALSE)
sleep_day <- read_csv("C:/Users/josef/My_Documents/_Strategy/Development_AI-ML-DS/Tools_Data_Analysis/R/R-Projects/Case-study-Bellabeat/Fitabase_Data/sleepDay_merged.csv", show_col_types = FALSE)
weight_log_info <- read_csv("C:/Users/josef/My_Documents/_Strategy/Development_AI-ML-DS/Tools_Data_Analysis/R/R-Projects/Case-study-Bellabeat/Fitabase_Data/weightLogInfo_merged.csv",show_col_types = FALSE)
Before starting cleanup the data, we will take some time to explore
it, using the head() function in the code chunk below:
head(daily_activity)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
head(sleep_day)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalT…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## # … with abbreviated variable name ¹TotalTimeInBed
head(weight_log_info)
## # A tibble: 6 × 8
## Id Date WeightKg Weight…¹ Fat BMI IsMan…² LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 160. 25 27.5 TRUE 1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport
We also use colnames() to check the names of the columns
in the data frames.
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weight_log_info)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(weight_log_info$Id)
## [1] 8
str(daily_activity)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(sleep_day)
## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(weight_log_info)
## spc_tbl_ [67 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num [1:67] 116 116 294 125 126 ...
## $ Fat : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
## $ LogId : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. Date = col_character(),
## .. WeightKg = col_double(),
## .. WeightPounds = col_double(),
## .. Fat = col_double(),
## .. BMI = col_double(),
## .. IsManualReport = col_logical(),
## .. LogId = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
RStudio desktop tool is selected for all phases of the project, as it is very suitable to data analysis and due to its feature R Markdown. Daily Activity, Sleep Day and Weight log information were selected amongst the several csv files available. Calories would be also interesting however the missing information on age resulted in not to be considered. Data from csv files was imported to RStudio and treated in what is concerned to missing values and duplicated information. Regarding names a cleaning was performed to ensure that resulting names are unique and consist only of the _ character, numbers, and letters.
sum(is.na(daily_activity))
## [1] 0
sum(is.na(sleep_day))
## [1] 0
sum(is.na(weight_log_info))
## [1] 65
Note: Regarding weight_log_info “Fat” attribute is mostly NA so we will keep this, although “Fat” shall therefor not suitable to be used along analysis.
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(sleep_day))
## [1] 3
sum(duplicated(weight_log_info))
## [1] 0
sleep_day <- sleep_day %>%
distinct() %>%
drop_na()
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(sleep_day))
## [1] 0
sum(duplicated(weight_log_info))
## [1] 0
clean_names(daily_activity)
## # A tibble: 940 × 15
## id activity…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5 0 1.88 0.550 6.06
## 2 1503960366 4/13/2016 10735 6.97 6.97 0 1.57 0.690 4.71
## 3 1503960366 4/14/2016 10460 6.74 6.74 0 2.44 0.400 3.91
## 4 1503960366 4/15/2016 9762 6.28 6.28 0 2.14 1.26 2.83
## 5 1503960366 4/16/2016 12669 8.16 8.16 0 2.71 0.410 5.04
## 6 1503960366 4/17/2016 9705 6.48 6.48 0 3.19 0.780 2.51
## 7 1503960366 4/18/2016 13019 8.59 8.59 0 3.25 0.640 4.71
## 8 1503960366 4/19/2016 15506 9.88 9.88 0 3.53 1.32 5.03
## 9 1503960366 4/20/2016 10544 6.68 6.68 0 1.96 0.480 4.24
## 10 1503960366 4/21/2016 9819 6.34 6.34 0 1.34 0.350 4.65
## # … with 930 more rows, 6 more variables: sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## # abbreviated variable names ¹activity_date, ²total_steps, ³total_distance,
## # ⁴tracker_distance, ⁵logged_activities_distance, ⁶very_active_distance,
## # ⁷moderately_active_distance, ⁸light_active_distance
clean_names(sleep_day)
## # A tibble: 410 × 5
## id sleep_day total_sleep_records total_minutes_…¹ total…²
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
## # … with 400 more rows, and abbreviated variable names ¹total_minutes_asleep,
## # ²total_time_in_bed
clean_names(weight_log_info)
## # A tibble: 67 × 8
## id date weight…¹ weigh…² fat bmi is_ma…³ log_id
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 160. 25 27.5 TRUE 1.46e12
## 7 4319703577 5/4/2016 11:59:59 PM 72.3 159. NA 27.4 TRUE 1.46e12
## 8 4558609924 4/18/2016 11:59:59 PM 69.7 154. NA 27.2 TRUE 1.46e12
## 9 4558609924 4/25/2016 11:59:59 PM 70.3 155. NA 27.5 TRUE 1.46e12
## 10 4558609924 5/1/2016 11:59:59 PM 69.9 154. NA 27.3 TRUE 1.46e12
## # … with 57 more rows, and abbreviated variable names ¹weight_kg,
## # ²weight_pounds, ³is_manual_report
Regarding dates a conversion was implemented to ensure consistence and simplicity during analyse subsequent phase.
Converting format to yyyy-mm-dd and renaming it “Date_Activity” and “Date_Sleep”. Using strftime standard method (https://strftime.org/).
daily_activity <- daily_activity %>%
mutate(Date_Activity <- as_date(ActivityDate, format = "%m/%d/%Y"))
head(daily_activity)
## # A tibble: 6 × 16
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 6 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## # `Date_Activity <- as_date(ActivityDate, format = "%m/%d/%Y")` <date>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
sleep_day <- sleep_day %>%
mutate(Date_Sleep = as_date(SleepDay,format ="%m/%d/%Y %I:%M:%S %p"))
head(sleep_day)
## # A tibble: 6 × 6
## Id SleepDay TotalSleepRecords TotalM…¹ Total…² Date_Sleep
## <dbl> <chr> <dbl> <dbl> <dbl> <date>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346 2016-04-12
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407 2016-04-13
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442 2016-04-15
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367 2016-04-16
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712 2016-04-17
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320 2016-04-19
## # … with abbreviated variable names ¹TotalMinutesAsleep, ²TotalTimeInBed
weight_log_info <- weight_log_info %>%
mutate(Date_weight = as_date(Date,format ="%m/%d/%Y %I:%M:%S %p"))
head(weight_log_info)
## # A tibble: 6 × 9
## Id Date Weigh…¹ Weigh…² Fat BMI IsMan…³ LogId Date_wei…⁴
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <date>
## 1 1503960366 5/2/2016 11… 52.6 116. 22 22.6 TRUE 1.46e12 2016-05-02
## 2 1503960366 5/3/2016 11… 52.6 116. NA 22.6 TRUE 1.46e12 2016-05-03
## 3 1927972279 4/13/2016 1… 134. 294. NA 47.5 FALSE 1.46e12 2016-04-13
## 4 2873212765 4/21/2016 1… 56.7 125. NA 21.5 TRUE 1.46e12 2016-04-21
## 5 2873212765 5/12/2016 1… 57.3 126. NA 21.7 TRUE 1.46e12 2016-05-12
## 6 4319703577 4/17/2016 1… 72.4 160. 25 27.5 TRUE 1.46e12 2016-04-17
## # … with abbreviated variable names ¹WeightKg, ²WeightPounds, ³IsManualReport,
## # ⁴Date_weight
How many unique participants are there in each dataframe? It looks like there may be more participants in the daily activity dataset than the sleep dataset.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(weight_log_info$Id)
## [1] 8
How many observations are there in each dataframe?
nrow(daily_activity)
## [1] 940
nrow(sleep_day)
## [1] 410
nrow(weight_log_info)
## [1] 67
What are some quick summary statistics we’d want to know about each data frame?
For the daily activity dataframe:
daily_activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :1440.0
For the sleep dataframe:
sleep_day %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
For the weight dataframe:
weight_log_info %>%
select(WeightKg,
# Fat,
BMI) %>%
summary()
## WeightKg BMI
## Min. : 52.60 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:23.96
## Median : 62.50 Median :24.39
## Mean : 72.04 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:25.56
## Max. :133.50 Max. :47.54
Distance average by customer:
daily_activity_mean <- daily_activity %>%
group_by(Id) %>%
summarise(daily_activity_average = mean(TrackerDistance, na.rm = TRUE))
daily_activity_mean
## # A tibble: 33 × 2
## Id daily_activity_average
## <dbl> <dbl>
## 1 1503960366 7.81
## 2 1624580081 3.91
## 3 1644430081 5.30
## 4 1844505072 1.71
## 5 1927972279 0.635
## 6 2022484408 8.08
## 7 2026352035 3.45
## 8 2320127002 3.19
## 9 2347167796 6.36
## 10 2873212765 5.10
## # … with 23 more rows
Sleep average by customer:
sleep_mean <- sleep_day %>%
group_by(Id) %>%
summarise(sleep_average = mean(TotalMinutesAsleep, na.rm = TRUE))
sleep_mean
## # A tibble: 24 × 2
## Id sleep_average
## <dbl> <dbl>
## 1 1503960366 360.
## 2 1644430081 294
## 3 1844505072 652
## 4 1927972279 417
## 5 2026352035 506.
## 6 2320127002 61
## 7 2347167796 447.
## 8 3977333714 294.
## 9 4020332650 349.
## 10 4319703577 477.
## # … with 14 more rows
Weight average by customer:
weight_mean <- weight_log_info %>%
group_by(Id) %>%
summarise(weight_average = mean(WeightKg, na.rm = TRUE))
weight_mean
## # A tibble: 8 × 2
## Id weight_average
## <dbl> <dbl>
## 1 1503960366 52.6
## 2 1927972279 134.
## 3 2873212765 57
## 4 4319703577 72.4
## 5 4558609924 69.6
## 6 5577150313 90.7
## 7 6962181067 61.6
## 8 8877689391 85.1
What does this tell us about how this sample of people’s activities?
At a first glance we can say that the quantity and completeness of the available data is not abundant and in a near future more data shall be logged to be used in serving better the customer experience.
What’s the relationship between steps taken in a day and sedentary minutes? How could this help inform the customer segments that we can market to? E.g. position this more as a way to get started in walking more? Or to measure steps that you’re already taking?
ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes)) + geom_point()
ggplot(data = daily_activity) +
geom_point(mapping = aes(x = TotalSteps, y = Calories, color = TotalDistance))
What’s the relationship between minutes asleep and time in bed? You might expect it to be almost completely linear - are there any unexpected trends?
ggplot(data=sleep_day, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point()
What’s the relationship between weight and BMI? You might expect it to be almost completely linear - are there any unexpected trends? Notes: regarding “Fat” there is no data. There are only 8 distinct persons that made weight logs. This reality is an opportunity for improve!
ggplot(data=weight_log_info, aes(x=WeightKg, y=BMI)) + geom_point()
What could these trends tell you about how to help market this product? Or areas where you might want to explore further?
combined_data_activity_weight <- merge(daily_activity_mean, weight_mean, by="Id")
combined_data_weight_sleep <- merge(weight_mean, sleep_mean, by="Id")
Take a look at how many participants are in this data set.
n_distinct(combined_data_activity_weight$Id)
## [1] 8
colnames(combined_data_activity_weight)
## [1] "Id" "daily_activity_average" "weight_average"
n_distinct(combined_data_weight_sleep$Id)
## [1] 6
colnames(combined_data_weight_sleep)
## [1] "Id" "weight_average" "sleep_average"
Note that there were more participant Ids in the daily activity dataset that have been filtered out using merge. Alternatively we could Consider using ‘outer_join’ to keep those in the dataset.
Now we can explore some different relationships between activity and sleep as well. For example, participants who sleep more also take more steps or fewer steps per day? Is there a relationship at all? How could these answers help inform the marketing strategy of how you position this new product?
ggplot(data=combined_data_activity_weight, aes(x=daily_activity_average, y=weight_average)) + geom_point()
ggplot(data=combined_data_weight_sleep, aes(x=sleep_average, y=weight_average)) + geom_point()
Data available is scarce and incomplete in some respects (e.g. age). It is highly recommended that APP logs are completely revised to include more useful data and that customers are incentivised to enable collect data. Nevertheless, preliminary recommendations for Bellabeat are depicted below.
Naturally the most prominent step for Bellabeat business expansion would be to create a second line of products for men, adapting the name (e.g. Letbeat, Letusbeat, Letitbeat).
Sleeping well seems to be of outstanding relevance, so the APP and product offer both provided by Bellabeat would improve greatly the usefulness by providing guidance, motivation and support on sleep. Sleeping APPs, most based in white noise, may serve as example to add similar capabilities to Bellabeat initiatives on sleep domain. As users like to listen music while they perform activities this would be also a good add-on to monetize while users perform their activities (walking, etc).
Advertising recommendations for Health and Meals based on usage will also help further monetize, in a way that user can somehow enable or disable or configure.
Missing age information is an aspect to correct as a matter of urgency. It is relevant to the analysis considering several age stages. From that complementary information we would derive suitable recommendations for business enhancement. Fat information is also missing and giving the nature of activity it would be good to have it also available.