R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Using RStudio, when clicking the Knit button, a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Loading packages

Prior loading packages we must select the CRAN mirror server to be used.

chooseCRANmirror(graphics = getOption("menu.graphics"), ind = 1, local.only = FALSE)

In order to start cleaning, processing and analysing project data, we need to install the required packages by running the install.packages().

Once a package is installed, we can load it by running the library() function for each of the several packages.

Introduction

Bellabeat is a successful small company, but they have the potential to become a larger player in the Global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. The Bellabeat Case Study focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights discovered will then help guide marketing strategy for the company. The results and conclusions of the analysis will be presented to the Bellabeat executive team along with high-level recommendations for Bellabeat’s marketing strategy.

Bellabeat website: https://bellabeat.com

Bellabeat products and membership (in app subscription):

App: https://play.google.com/store/apps/details?id=com.bellabeat.cacao
Leaf: https://bellabeat.com/ivy/
Time: https://bellabeat.com/time/
Spring: https://bellabeat.com/spring/

Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

ASK - Summary of the business task

The scope of work is to get data on consumer usage of Bellabeat app from smartphones, considering one of the products, to analyse opportunities of growth based on that. The insights and recommendations will be presented to Bellabeat CCO (Chief Creative Officer) and marketing team. Eventually the presentation will also be used to support CCO for sharing the results and to present a proposal with the derived marketing opportunity of growth for getting approval from Bellabeat Executive Team Board.

PREPARE - A description of all data sources used

The public dataset suggested as a basis for the kick-off of the activities is the following one (CC0: Public Domain, dataset made available through Mobius, a data scientist): https://www.kaggle.com/arashnic/fitbit Other public datasets can be used to complement wherever needed and justified. The data will be assumed as near ROCCC (reliable, original, comprehensive, current and cited), but with some limitations on quantity of data (distinct user records) and aspects not covered (e.g. age). However, the sample is relatively small, only from 30 users, which is somehow not enough, but we will look further to highlight that limitation. Regarding weight information the available information is even more limited regarding distinct users, namely only 8. “Fat” information is mostly not available, but it is relevant. User age would be relevant to consider as it is naturally an aspect with impact in physical activity, calories, etc. Information on age is not available. User gender should also be considered for eventual future commercial expansion of scope to include also men and not only women. The data is stored in Kaggle, namely it is a set of 18 csv files. The Id field represents the several users, which are said to be 30 distinguished, corresponding to the count observed in the several files. The organization of the data is in the wide format (a column for each variable). In terms of right to use CC0 Public Domain is the “no copyright reserved” option in the Creative Commons toolkit - it effectively means relinquishing all copyright and similar rights that you hold in a work and dedicating those rights to the public domain.

Importing data

The data to consider is currently external .csv files. In order to view and clean it in R, we need to import it. The tidyverse library readr package has a number of functions for “reading in” or importing data, including .csv files.

In the chunk below, we use the read_csv() function to import data from .csv files in the project folder called “Fitabase_Data” (https://www.kaggle.com/arashnic/fitbit) and save it as a data frames. The data describing both daily activity and sleep is selected to start the analysis.

daily_activity <- read_csv("C:/Users/josef/My_Documents/_Strategy/Development_AI-ML-DS/Tools_Data_Analysis/R/R-Projects/Case-Study-Bellabeat/Fitabase_Data/dailyActivity_merged.csv", show_col_types = FALSE)

sleep_day <- read_csv("C:/Users/josef/My_Documents/_Strategy/Development_AI-ML-DS/Tools_Data_Analysis/R/R-Projects/Case-study-Bellabeat/Fitabase_Data/sleepDay_merged.csv", show_col_types = FALSE)

weight_log_info <- read_csv("C:/Users/josef/My_Documents/_Strategy/Development_AI-ML-DS/Tools_Data_Analysis/R/R-Projects/Case-study-Bellabeat/Fitabase_Data/weightLogInfo_merged.csv",show_col_types = FALSE)

Getting to know the project data

Before starting cleanup the data, we will take some time to explore it, using the head() function in the code chunk below:

head(daily_activity)

## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## #   ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## #   ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance

head(sleep_day)

## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹TotalTimeInBed

head(weight_log_info)

## # A tibble: 6 × 8
##           Id Date                  WeightKg Weight…¹   Fat   BMI IsMan…²   LogId
##        <dbl> <chr>                    <dbl>    <dbl> <dbl> <dbl> <lgl>     <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM      52.6     116.    22  22.6 TRUE    1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM      52.6     116.    NA  22.6 TRUE    1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM     134.      294.    NA  47.5 FALSE   1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.    NA  21.5 TRUE    1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.    NA  21.7 TRUE    1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     160.    25  27.5 TRUE    1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport

We also use colnames() to check the names of the columns in the data frames.

colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(sleep_day)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

colnames(weight_log_info)

## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Main data observed

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(sleep_day$Id)

## [1] 24

n_distinct(weight_log_info$Id)

## [1] 8

Familiarizing with data and column datatypes

str(daily_activity)

## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(sleep_day)

## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(weight_log_info)

## spc_tbl_ [67 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id            : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num [1:67] 116 116 294 125 126 ...
##  $ Fat           : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
##  $ LogId         : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   Date = col_character(),
##   ..   WeightKg = col_double(),
##   ..   WeightPounds = col_double(),
##   ..   Fat = col_double(),
##   ..   BMI = col_double(),
##   ..   IsManualReport = col_logical(),
##   ..   LogId = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

PROCESS - Documentation of cleaning and manipulation of data

RStudio desktop tool is selected for all phases of the project, as it is very suitable to data analysis and due to its feature R Markdown. Daily Activity, Sleep Day and Weight log information were selected amongst the several csv files available. Calories would be also interesting however the missing information on age resulted in not to be considered. Data from csv files was imported to RStudio and treated in what is concerned to missing values and duplicated information. Regarding names a cleaning was performed to ensure that resulting names are unique and consist only of the _ character, numbers, and letters.

Checking for missing values ()

sum(is.na(daily_activity))

## [1] 0

sum(is.na(sleep_day))

## [1] 0

sum(is.na(weight_log_info))

## [1] 65

Note: Regarding weight_log_info “Fat” attribute is mostly NA so we will keep this, although “Fat” shall therefor not suitable to be used along analysis.

Checking for duplicates

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(sleep_day))

## [1] 3

sum(duplicated(weight_log_info))

## [1] 0

Removing duplicates and NA from applicable tables

sleep_day <- sleep_day %>% 
  distinct() %>% 
  drop_na()

Checking if duplicates were removed from tables

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(sleep_day))

## [1] 0

sum(duplicated(weight_log_info))

## [1] 0

Cleaning the data

Cleaning datasets

clean_names(daily_activity)

## # A tibble: 940 × 15
##            id activity…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸
##         <dbl> <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1503960366 4/12/2016    13162    8.5     8.5        0    1.88   0.550    6.06
##  2 1503960366 4/13/2016    10735    6.97    6.97       0    1.57   0.690    4.71
##  3 1503960366 4/14/2016    10460    6.74    6.74       0    2.44   0.400    3.91
##  4 1503960366 4/15/2016     9762    6.28    6.28       0    2.14   1.26     2.83
##  5 1503960366 4/16/2016    12669    8.16    8.16       0    2.71   0.410    5.04
##  6 1503960366 4/17/2016     9705    6.48    6.48       0    3.19   0.780    2.51
##  7 1503960366 4/18/2016    13019    8.59    8.59       0    3.25   0.640    4.71
##  8 1503960366 4/19/2016    15506    9.88    9.88       0    3.53   1.32     5.03
##  9 1503960366 4/20/2016    10544    6.68    6.68       0    1.96   0.480    4.24
## 10 1503960366 4/21/2016     9819    6.34    6.34       0    1.34   0.350    4.65
## # … with 930 more rows, 6 more variables: sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## #   abbreviated variable names ¹activity_date, ²total_steps, ³total_distance,
## #   ⁴tracker_distance, ⁵logged_activities_distance, ⁶very_active_distance,
## #   ⁷moderately_active_distance, ⁸light_active_distance

clean_names(sleep_day)

## # A tibble: 410 × 5
##            id sleep_day             total_sleep_records total_minutes_…¹ total…²
##         <dbl> <chr>                               <dbl>            <dbl>   <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                   1              327     346
##  2 1503960366 4/13/2016 12:00:00 AM                   2              384     407
##  3 1503960366 4/15/2016 12:00:00 AM                   1              412     442
##  4 1503960366 4/16/2016 12:00:00 AM                   2              340     367
##  5 1503960366 4/17/2016 12:00:00 AM                   1              700     712
##  6 1503960366 4/19/2016 12:00:00 AM                   1              304     320
##  7 1503960366 4/20/2016 12:00:00 AM                   1              360     377
##  8 1503960366 4/21/2016 12:00:00 AM                   1              325     364
##  9 1503960366 4/23/2016 12:00:00 AM                   1              361     384
## 10 1503960366 4/24/2016 12:00:00 AM                   1              430     449
## # … with 400 more rows, and abbreviated variable names ¹total_minutes_asleep,
## #   ²total_time_in_bed

clean_names(weight_log_info)

## # A tibble: 67 × 8
##            id date                  weight…¹ weigh…²   fat   bmi is_ma…³  log_id
##         <dbl> <chr>                    <dbl>   <dbl> <dbl> <dbl> <lgl>     <dbl>
##  1 1503960366 5/2/2016 11:59:59 PM      52.6    116.    22  22.6 TRUE    1.46e12
##  2 1503960366 5/3/2016 11:59:59 PM      52.6    116.    NA  22.6 TRUE    1.46e12
##  3 1927972279 4/13/2016 1:08:52 AM     134.     294.    NA  47.5 FALSE   1.46e12
##  4 2873212765 4/21/2016 11:59:59 PM     56.7    125.    NA  21.5 TRUE    1.46e12
##  5 2873212765 5/12/2016 11:59:59 PM     57.3    126.    NA  21.7 TRUE    1.46e12
##  6 4319703577 4/17/2016 11:59:59 PM     72.4    160.    25  27.5 TRUE    1.46e12
##  7 4319703577 5/4/2016 11:59:59 PM      72.3    159.    NA  27.4 TRUE    1.46e12
##  8 4558609924 4/18/2016 11:59:59 PM     69.7    154.    NA  27.2 TRUE    1.46e12
##  9 4558609924 4/25/2016 11:59:59 PM     70.3    155.    NA  27.5 TRUE    1.46e12
## 10 4558609924 5/1/2016 11:59:59 PM      69.9    154.    NA  27.3 TRUE    1.46e12
## # … with 57 more rows, and abbreviated variable names ¹weight_kg,
## #   ²weight_pounds, ³is_manual_report

ANALYSE - A summary of Bellabeat Case Study

Regarding dates a conversion was implemented to ensure consistence and simplicity during analyse subsequent phase.

Changing the datatype of the data column

Converting format to yyyy-mm-dd and renaming it “Date_Activity” and “Date_Sleep”. Using strftime standard method (https://strftime.org/).

daily_activity <- daily_activity %>%
  mutate(Date_Activity <- as_date(ActivityDate, format = "%m/%d/%Y"))
head(daily_activity)

## # A tibble: 6 × 16
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 6 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## #   `Date_Activity <- as_date(ActivityDate, format = "%m/%d/%Y")` <date>, and
## #   abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## #   ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## #   ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance

sleep_day <- sleep_day %>%
  mutate(Date_Sleep = as_date(SleepDay,format ="%m/%d/%Y %I:%M:%S %p"))
head(sleep_day)

## # A tibble: 6 × 6
##           Id SleepDay              TotalSleepRecords TotalM…¹ Total…² Date_Sleep
##        <dbl> <chr>                             <dbl>    <dbl>   <dbl> <date>    
## 1 1503960366 4/12/2016 12:00:00 AM                 1      327     346 2016-04-12
## 2 1503960366 4/13/2016 12:00:00 AM                 2      384     407 2016-04-13
## 3 1503960366 4/15/2016 12:00:00 AM                 1      412     442 2016-04-15
## 4 1503960366 4/16/2016 12:00:00 AM                 2      340     367 2016-04-16
## 5 1503960366 4/17/2016 12:00:00 AM                 1      700     712 2016-04-17
## 6 1503960366 4/19/2016 12:00:00 AM                 1      304     320 2016-04-19
## # … with abbreviated variable names ¹TotalMinutesAsleep, ²TotalTimeInBed

weight_log_info <- weight_log_info %>%
  mutate(Date_weight = as_date(Date,format ="%m/%d/%Y %I:%M:%S %p"))
head(weight_log_info)

## # A tibble: 6 × 9
##           Id Date         Weigh…¹ Weigh…²   Fat   BMI IsMan…³   LogId Date_wei…⁴
##        <dbl> <chr>          <dbl>   <dbl> <dbl> <dbl> <lgl>     <dbl> <date>    
## 1 1503960366 5/2/2016 11…    52.6    116.    22  22.6 TRUE    1.46e12 2016-05-02
## 2 1503960366 5/3/2016 11…    52.6    116.    NA  22.6 TRUE    1.46e12 2016-05-03
## 3 1927972279 4/13/2016 1…   134.     294.    NA  47.5 FALSE   1.46e12 2016-04-13
## 4 2873212765 4/21/2016 1…    56.7    125.    NA  21.5 TRUE    1.46e12 2016-04-21
## 5 2873212765 5/12/2016 1…    57.3    126.    NA  21.7 TRUE    1.46e12 2016-05-12
## 6 4319703577 4/17/2016 1…    72.4    160.    25  27.5 TRUE    1.46e12 2016-04-17
## # … with abbreviated variable names ¹WeightKg, ²WeightPounds, ³IsManualReport,
## #   ⁴Date_weight

Understanding some summary statistics

How many unique participants are there in each dataframe? It looks like there may be more participants in the daily activity dataset than the sleep dataset.

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(sleep_day$Id)

## [1] 24

n_distinct(weight_log_info$Id)

## [1] 8

How many observations are there in each dataframe?

nrow(daily_activity)

## [1] 940

nrow(sleep_day)

## [1] 410

nrow(weight_log_info)

## [1] 67

What are some quick summary statistics we’d want to know about each data frame?

For the daily activity dataframe:

daily_activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes) %>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

For the sleep dataframe:

sleep_day %>%  
  select(TotalSleepRecords,
  TotalMinutesAsleep,
  TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

For the weight dataframe:

weight_log_info %>%  
  select(WeightKg,
         #  Fat, 
        BMI) %>%
  summary()

##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

Distance average by customer:

daily_activity_mean <- daily_activity %>% 
  group_by(Id) %>% 
  summarise(daily_activity_average = mean(TrackerDistance, na.rm = TRUE))
daily_activity_mean

## # A tibble: 33 × 2
##            Id daily_activity_average
##         <dbl>                  <dbl>
##  1 1503960366                  7.81 
##  2 1624580081                  3.91 
##  3 1644430081                  5.30 
##  4 1844505072                  1.71 
##  5 1927972279                  0.635
##  6 2022484408                  8.08 
##  7 2026352035                  3.45 
##  8 2320127002                  3.19 
##  9 2347167796                  6.36 
## 10 2873212765                  5.10 
## # … with 23 more rows

Sleep average by customer:

sleep_mean <- sleep_day %>% 
  group_by(Id) %>% 
  summarise(sleep_average = mean(TotalMinutesAsleep, na.rm = TRUE))
sleep_mean

## # A tibble: 24 × 2
##            Id sleep_average
##         <dbl>         <dbl>
##  1 1503960366          360.
##  2 1644430081          294 
##  3 1844505072          652 
##  4 1927972279          417 
##  5 2026352035          506.
##  6 2320127002           61 
##  7 2347167796          447.
##  8 3977333714          294.
##  9 4020332650          349.
## 10 4319703577          477.
## # … with 14 more rows

Weight average by customer:

weight_mean <- weight_log_info %>% 
  group_by(Id) %>% 
  summarise(weight_average = mean(WeightKg, na.rm = TRUE))
weight_mean

## # A tibble: 8 × 2
##           Id weight_average
##        <dbl>          <dbl>
## 1 1503960366           52.6
## 2 1927972279          134. 
## 3 2873212765           57  
## 4 4319703577           72.4
## 5 4558609924           69.6
## 6 5577150313           90.7
## 7 6962181067           61.6
## 8 8877689391           85.1

What does this tell us about how this sample of people’s activities?

At a first glance we can say that the quantity and completeness of the available data is not abundant and in a near future more data shall be logged to be used in serving better the customer experience.

Plotting a few explorations

What’s the relationship between steps taken in a day and sedentary minutes? How could this help inform the customer segments that we can market to? E.g. position this more as a way to get started in walking more? Or to measure steps that you’re already taking?

ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes)) + geom_point()

ggplot(data = daily_activity) +
  geom_point(mapping = aes(x = TotalSteps, y = Calories, color = TotalDistance))

What’s the relationship between minutes asleep and time in bed? You might expect it to be almost completely linear - are there any unexpected trends?

ggplot(data=sleep_day, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point()

What’s the relationship between weight and BMI? You might expect it to be almost completely linear - are there any unexpected trends? Notes: regarding “Fat” there is no data. There are only 8 distinct persons that made weight logs. This reality is an opportunity for improve!

ggplot(data=weight_log_info, aes(x=WeightKg, y=BMI)) + geom_point()

What could these trends tell you about how to help market this product? Or areas where you might want to explore further?

Merging datasets together

daily_activity_mean and weight_log_info)
weight_mean and sleep_mean

combined_data_activity_weight <- merge(daily_activity_mean, weight_mean, by="Id")
combined_data_weight_sleep <- merge(weight_mean, sleep_mean, by="Id")

Take a look at how many participants are in this data set.

n_distinct(combined_data_activity_weight$Id)

## [1] 8

colnames(combined_data_activity_weight)

## [1] "Id"                     "daily_activity_average" "weight_average"

n_distinct(combined_data_weight_sleep$Id)

## [1] 6

colnames(combined_data_weight_sleep)

## [1] "Id"             "weight_average" "sleep_average"

Note that there were more participant Ids in the daily activity dataset that have been filtered out using merge. Alternatively we could Consider using ‘outer_join’ to keep those in the dataset.

Now we can explore some different relationships between activity and sleep as well. For example, participants who sleep more also take more steps or fewer steps per day? Is there a relationship at all? How could these answers help inform the marketing strategy of how you position this new product?

ggplot(data=combined_data_activity_weight, aes(x=daily_activity_average, y=weight_average)) + geom_point()

ggplot(data=combined_data_weight_sleep, aes(x=sleep_average, y=weight_average)) + geom_point()

ACT - Top high-level content recommendations based on the analysis done

Data available is scarce and incomplete in some respects (e.g. age). It is highly recommended that APP logs are completely revised to include more useful data and that customers are incentivised to enable collect data. Nevertheless, preliminary recommendations for Bellabeat are depicted below.

Naturally the most prominent step for Bellabeat business expansion would be to create a second line of products for men, adapting the name (e.g. Letbeat, Letusbeat, Letitbeat).

Sleeping well seems to be of outstanding relevance, so the APP and product offer both provided by Bellabeat would improve greatly the usefulness by providing guidance, motivation and support on sleep. Sleeping APPs, most based in white noise, may serve as example to add similar capabilities to Bellabeat initiatives on sleep domain. As users like to listen music while they perform activities this would be also a good add-on to monetize while users perform their activities (walking, etc).

Advertising recommendations for Health and Meals based on usage will also help further monetize, in a way that user can somehow enable or disable or configure.

Missing age information is an aspect to correct as a matter of urgency. It is relevant to the analysis considering several age stages. From that complementary information we would derive suitable recommendations for business enhancement. Fat information is also missing and giving the nature of activity it would be good to have it also available.

Bellabeat Case Study 2022

José Neto - Google Data Analytics - Capstone - Portfolio

28/11/2022

R Markdown

Loading packages

Introduction

ASK - Summary of the business task

PREPARE - A description of all data sources used

Importing data

Getting to know the project data

Main data observed

Familiarizing with data and column datatypes

PROCESS - Documentation of cleaning and manipulation of data

Checking for missing values ()

Checking for duplicates

Removing duplicates and NA from applicable tables

Checking if duplicates were removed from tables

Cleaning the data

Cleaning datasets

ANALYSE - A summary of Bellabeat Case Study

Changing the datatype of the data column

Understanding some summary statistics

Plotting a few explorations

Merging datasets together

ACT - Top high-level content recommendations based on the analysis done

Bellabeat Case Study 2022

José Neto - Google Data Analytics - Capstone - Portfolio

28/11/2022

R Markdown

Loading packages

Introduction

ASK - Summary of the business task

PREPARE - A description of all data sources used

Importing data

Getting to know the project data

Main data observed

Familiarizing with data and column datatypes

PROCESS - Documentation of cleaning and manipulation of data

Checking for missing values ()

Checking for duplicates

Removing duplicates and NA from applicable tables

Checking if duplicates were removed from tables

Cleaning the data

Cleaning datasets

ANALYSE - A summary of Bellabeat Case Study

Changing the datatype of the data column

Understanding some summary statistics

Plotting a few explorations

Merging datasets together

SHARE - Supporting visualizations and key findings

ACT - Top high-level content recommendations based on the analysis done