R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Using RStudio, when clicking the Knit button, a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Loading packages

Prior loading packages we must select the CRAN mirror server to be used.

chooseCRANmirror(graphics = getOption("menu.graphics"), ind = 1, local.only = FALSE)

In order to start cleaning, processing and analysing project data, we need to install the required packages by running the install.packages().

Once a package is installed, we can load it by running the library() function for each of the several packages.

Introduction

Bellabeat is a successful small company, but they have the potential to become a larger player in the Global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. The Bellabeat Case Study focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights discovered will then help guide marketing strategy for the company. The results and conclusions of the analysis will be presented to the Bellabeat executive team along with high-level recommendations for Bellabeat’s marketing strategy.

Bellabeat website: https://bellabeat.com

Bellabeat products and membership (in app subscription):

App: https://play.google.com/store/apps/details?id=com.bellabeat.cacao
Leaf: https://bellabeat.com/ivy/
Time: https://bellabeat.com/time/
Spring: https://bellabeat.com/spring/

Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

ASK - Summary of the business task

The scope of work is to get data on consumer usage of Bellabeat app from smartphones, considering one of the products, to analyse opportunities of growth based on that. The insights and recommendations will be presented to Bellabeat CCO (Chief Creative Officer) and marketing team. Eventually the presentation will also be used to support CCO for sharing the results and to present a proposal with the derived marketing opportunity of growth for getting approval from Bellabeat Executive Team Board.

PREPARE - A description of all data sources used

The public dataset suggested as a basis for the kick-off of the activities is the following one (CC0: Public Domain, dataset made available through Mobius, a data scientist): https://www.kaggle.com/arashnic/fitbit Other public datasets can be used to complement wherever needed and justified. The data will be assumed as near ROCCC (reliable, original, comprehensive, current and cited), but with some limitations on quantity of data (distinct user records) and aspects not covered (e.g. age). However, the sample is relatively small, only from 30 users, which is somehow not enough, but we will look further to highlight that limitation. Regarding weight information the available information is even more limited regarding distinct users, namely only 8. “Fat” information is mostly not available, but it is relevant. User age would be relevant to consider as it is naturally an aspect with impact in physical activity, calories, etc. Information on age is not available. User gender should also be considered for eventual future commercial expansion of scope to include also men and not only women. The data is stored in Kaggle, namely it is a set of 18 csv files. The Id field represents the several users, which are said to be 30 distinguished, corresponding to the count observed in the several files. The organization of the data is in the wide format (a column for each variable). In terms of right to use CC0 Public Domain is the “no copyright reserved” option in the Creative Commons toolkit - it effectively means relinquishing all copyright and similar rights that you hold in a work and dedicating those rights to the public domain.

Importing data

The data to consider is currently external .csv files. In order to view and clean it in R, we need to import it. The tidyverse library readr package has a number of functions for “reading in” or importing data, including .csv files.

In the chunk below, we use the read_csv() function to import data from .csv files in the project folder called “Fitabase_Data” (https://www.kaggle.com/arashnic/fitbit) and save it as a data frames. The data describing both daily activity and sleep is selected to start the analysis.

daily_activity <- read_csv("C:/Users/josef/My_Documents/_Strategy/Development_AI-ML-DS/R/R-Projects/Bellabeat-Case-Study/Fitabase_Data/dailyActivity_merged.csv")

## Rows: 940 Columns: 15

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

sleep_day <- read_csv("C:/Users/josef/My_Documents/_Strategy/Development_AI-ML-DS/R/R-Projects/Bellabeat-Case-Study/Fitabase_Data/sleepDay_merged.csv")

## Rows: 413 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

weight_log_info <- read_csv("C:/Users/josef/My_Documents/_Strategy/Development_AI-ML-DS/R/R-Projects/Bellabeat-Case-Study/Fitabase_Data/weightLogInfo_merged.csv")

## Rows: 67 Columns: 8

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Getting to know the project data

Before starting cleanup the data, we will take some time to explore it, using the head() function in the code chunk below:

head(daily_activity)

## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

head(sleep_day)

## # A tibble: 6 x 5
##           Id SleepDay          TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
##        <dbl> <chr>                        <dbl>             <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:00:~                1               327            346
## 2 1503960366 4/13/2016 12:00:~                2               384            407
## 3 1503960366 4/15/2016 12:00:~                1               412            442
## 4 1503960366 4/16/2016 12:00:~                2               340            367
## 5 1503960366 4/17/2016 12:00:~                1               700            712
## 6 1503960366 4/19/2016 12:00:~                1               304            320

head(weight_log_info)

## # A tibble: 6 x 8
##           Id Date      WeightKg WeightPounds   Fat   BMI IsManualReport    LogId
##        <dbl> <chr>        <dbl>        <dbl> <dbl> <dbl> <lgl>             <dbl>
## 1 1503960366 5/2/2016~     52.6         116.    22  22.6 TRUE            1.46e12
## 2 1503960366 5/3/2016~     52.6         116.    NA  22.6 TRUE            1.46e12
## 3 1927972279 4/13/201~    134.          294.    NA  47.5 FALSE           1.46e12
## 4 2873212765 4/21/201~     56.7         125.    NA  21.5 TRUE            1.46e12
## 5 2873212765 5/12/201~     57.3         126.    NA  21.7 TRUE            1.46e12
## 6 4319703577 4/17/201~     72.4         160.    25  27.5 TRUE            1.46e12

We also use colnames() to check the names of the columns in the data frames.

colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(sleep_day)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

colnames(weight_log_info)

## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Main data observed

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(sleep_day$Id)

## [1] 24

n_distinct(weight_log_info$Id)

## [1] 8

Familiarizing with data and column datatypes

str(daily_activity)

## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(sleep_day)

## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(weight_log_info)

## spec_tbl_df [67 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id            : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num [1:67] 116 116 294 125 126 ...
##  $ Fat           : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
##  $ LogId         : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   Date = col_character(),
##   ..   WeightKg = col_double(),
##   ..   WeightPounds = col_double(),
##   ..   Fat = col_double(),
##   ..   BMI = col_double(),
##   ..   IsManualReport = col_logical(),
##   ..   LogId = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

PROCESS - Documentation of cleaning and manipulation of data

RStudio desktop tool is selected for all phases of the project, as it is very suitable to data analysis and due to its feature R Markdown. Daily Activity, Sleep Day and Weight log information were selected amongst the several csv files available. Calories would be also interesting however the missing information on age resulted in not to be considered. Data from csv files was imported to RStudio and treated in what is concerned to missing values and duplicated information. Regarding names a cleaning was performed to ensure that resulting names are unique and consist only of the _ character, numbers, and letters.

Checking for missing values ()

sum(is.na(daily_activity))

## [1] 0

sum(is.na(sleep_day))

## [1] 0

sum(is.na(weight_log_info))

## [1] 65

Note: Regarding weight_log_info “Fat” attribute is mostly NA so we will keep this, although “Fat” shall therefor not suitable to be used along analysis.

Checking for duplicates

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(sleep_day))

## [1] 3

sum(duplicated(weight_log_info))

## [1] 0

Removing duplicates and NA from applicable tables

sleep_day <- sleep_day %>% 
  distinct() %>% 
  drop_na()

Checking if duplicates were removed from tables

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(sleep_day))

## [1] 0

sum(duplicated(weight_log_info))

## [1] 0

Cleaning the data

Cleaning datasets

clean_names(daily_activity)

## # A tibble: 940 x 15
##            id activity_date total_steps total_distance tracker_distance
##         <dbl> <chr>               <dbl>          <dbl>            <dbl>
##  1 1503960366 4/12/2016           13162           8.5              8.5 
##  2 1503960366 4/13/2016           10735           6.97             6.97
##  3 1503960366 4/14/2016           10460           6.74             6.74
##  4 1503960366 4/15/2016            9762           6.28             6.28
##  5 1503960366 4/16/2016           12669           8.16             8.16
##  6 1503960366 4/17/2016            9705           6.48             6.48
##  7 1503960366 4/18/2016           13019           8.59             8.59
##  8 1503960366 4/19/2016           15506           9.88             9.88
##  9 1503960366 4/20/2016           10544           6.68             6.68
## 10 1503960366 4/21/2016            9819           6.34             6.34
## # ... with 930 more rows, and 10 more variables:
## #   logged_activities_distance <dbl>, very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>

clean_names(sleep_day)

## # A tibble: 410 x 5
##            id sleep_day     total_sleep_reco~ total_minutes_as~ total_time_in_b~
##         <dbl> <chr>                     <dbl>             <dbl>            <dbl>
##  1 1503960366 4/12/2016 12~                 1               327              346
##  2 1503960366 4/13/2016 12~                 2               384              407
##  3 1503960366 4/15/2016 12~                 1               412              442
##  4 1503960366 4/16/2016 12~                 2               340              367
##  5 1503960366 4/17/2016 12~                 1               700              712
##  6 1503960366 4/19/2016 12~                 1               304              320
##  7 1503960366 4/20/2016 12~                 1               360              377
##  8 1503960366 4/21/2016 12~                 1               325              364
##  9 1503960366 4/23/2016 12~                 1               361              384
## 10 1503960366 4/24/2016 12~                 1               430              449
## # ... with 400 more rows

clean_names(weight_log_info)

## # A tibble: 67 x 8
##            id date  weight_kg weight_pounds   fat   bmi is_manual_report  log_id
##         <dbl> <chr>     <dbl>         <dbl> <dbl> <dbl> <lgl>              <dbl>
##  1 1503960366 5/2/~      52.6          116.    22  22.6 TRUE             1.46e12
##  2 1503960366 5/3/~      52.6          116.    NA  22.6 TRUE             1.46e12
##  3 1927972279 4/13~     134.           294.    NA  47.5 FALSE            1.46e12
##  4 2873212765 4/21~      56.7          125.    NA  21.5 TRUE             1.46e12
##  5 2873212765 5/12~      57.3          126.    NA  21.7 TRUE             1.46e12
##  6 4319703577 4/17~      72.4          160.    25  27.5 TRUE             1.46e12
##  7 4319703577 5/4/~      72.3          159.    NA  27.4 TRUE             1.46e12
##  8 4558609924 4/18~      69.7          154.    NA  27.2 TRUE             1.46e12
##  9 4558609924 4/25~      70.3          155.    NA  27.5 TRUE             1.46e12
## 10 4558609924 5/1/~      69.9          154.    NA  27.3 TRUE             1.46e12
## # ... with 57 more rows

ANALYSE - A summary of Bellabeat Case Study

Regarding dates a conversion was implemented to ensure consistence and simplicity during analyse subsequent phase.

Changing the datatype of the data column

Converting format to yyyy-mm-dd and renaming it “Date_Activity” and “Date_Sleep”. Using strftime standard method (https://strftime.org/).

daily_activity <- daily_activity %>%
  mutate(Date_Activity <- as_date(ActivityDate, format = "%m/%d/%Y"))
head(daily_activity)

## # A tibble: 6 x 16
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # ... with 10 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>,
## #   Date_Activity <- as_date(ActivityDate, format = "%m/%d/%Y") <date>

sleep_day <- sleep_day %>%
  mutate(Date_Sleep = as_date(SleepDay,format ="%m/%d/%Y %I:%M:%S %p"))
head(sleep_day)

## # A tibble: 6 x 6
##        Id SleepDay   TotalSleepRecor~ TotalMinutesAsl~ TotalTimeInBed Date_Sleep
##     <dbl> <chr>                 <dbl>            <dbl>          <dbl> <date>    
## 1  1.50e9 4/12/2016~                1              327            346 2016-04-12
## 2  1.50e9 4/13/2016~                2              384            407 2016-04-13
## 3  1.50e9 4/15/2016~                1              412            442 2016-04-15
## 4  1.50e9 4/16/2016~                2              340            367 2016-04-16
## 5  1.50e9 4/17/2016~                1              700            712 2016-04-17
## 6  1.50e9 4/19/2016~                1              304            320 2016-04-19

weight_log_info <- weight_log_info %>%
  mutate(Date_weight = as_date(Date,format ="%m/%d/%Y %I:%M:%S %p"))
head(weight_log_info)

## # A tibble: 6 x 9
##           Id Date      WeightKg WeightPounds   Fat   BMI IsManualReport    LogId
##        <dbl> <chr>        <dbl>        <dbl> <dbl> <dbl> <lgl>             <dbl>
## 1 1503960366 5/2/2016~     52.6         116.    22  22.6 TRUE            1.46e12
## 2 1503960366 5/3/2016~     52.6         116.    NA  22.6 TRUE            1.46e12
## 3 1927972279 4/13/201~    134.          294.    NA  47.5 FALSE           1.46e12
## 4 2873212765 4/21/201~     56.7         125.    NA  21.5 TRUE            1.46e12
## 5 2873212765 5/12/201~     57.3         126.    NA  21.7 TRUE            1.46e12
## 6 4319703577 4/17/201~     72.4         160.    25  27.5 TRUE            1.46e12
## # ... with 1 more variable: Date_weight <date>

Understanding some summary statistics

How many unique participants are there in each dataframe? It looks like there may be more participants in the daily activity dataset than the sleep dataset.

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(sleep_day$Id)

## [1] 24

n_distinct(weight_log_info$Id)

## [1] 8

How many observations are there in each dataframe?

nrow(daily_activity)

## [1] 940

nrow(sleep_day)

## [1] 410

nrow(weight_log_info)

## [1] 67

What are some quick summary statistics we’d want to know about each data frame?

For the daily activity dataframe:

daily_activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes) %>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

For the sleep dataframe:

sleep_day %>%  
  select(TotalSleepRecords,
  TotalMinutesAsleep,
  TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

For the weight dataframe:

weight_log_info %>%  
  select(WeightKg,
         #  Fat, 
        BMI) %>%
  summary()

##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

Distance average by customer:

daily_activity_mean <- daily_activity %>% 
  group_by(Id) %>% 
  summarise(daily_activity_average = mean(TrackerDistance, na.rm = TRUE))
daily_activity_mean

## # A tibble: 33 x 2
##            Id daily_activity_average
##         <dbl>                  <dbl>
##  1 1503960366                  7.81 
##  2 1624580081                  3.91 
##  3 1644430081                  5.30 
##  4 1844505072                  1.71 
##  5 1927972279                  0.635
##  6 2022484408                  8.08 
##  7 2026352035                  3.45 
##  8 2320127002                  3.19 
##  9 2347167796                  6.36 
## 10 2873212765                  5.10 
## # ... with 23 more rows

Sleep average by customer:

sleep_mean <- sleep_day %>% 
  group_by(Id) %>% 
  summarise(sleep_average = mean(TotalMinutesAsleep, na.rm = TRUE))
sleep_mean

## # A tibble: 24 x 2
##            Id sleep_average
##         <dbl>         <dbl>
##  1 1503960366          360.
##  2 1644430081          294 
##  3 1844505072          652 
##  4 1927972279          417 
##  5 2026352035          506.
##  6 2320127002           61 
##  7 2347167796          447.
##  8 3977333714          294.
##  9 4020332650          349.
## 10 4319703577          477.
## # ... with 14 more rows

Weight average by customer:

weight_mean <- weight_log_info %>% 
  group_by(Id) %>% 
  summarise(weight_average = mean(WeightKg, na.rm = TRUE))
weight_mean

## # A tibble: 8 x 2
##           Id weight_average
##        <dbl>          <dbl>
## 1 1503960366           52.6
## 2 1927972279          134. 
## 3 2873212765           57  
## 4 4319703577           72.4
## 5 4558609924           69.6
## 6 5577150313           90.7
## 7 6962181067           61.6
## 8 8877689391           85.1

What does this tell us about how this sample of people’s activities?

At a first glance we can say that the quantity and completeness of the available data is not abundant and in a near future more data shall be logged to be used in serving better the customer experience.

Plotting a few explorations

What’s the relationship between steps taken in a day and sedentary minutes? How could this help inform the customer segments that we can market to? E.g. position this more as a way to get started in walking more? Or to measure steps that you’re already taking?

ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes)) + geom_point()

ggplot(data = daily_activity) +
  geom_point(mapping = aes(x = TotalSteps, y = Calories, color = TotalDistance))

What’s the relationship between minutes asleep and time in bed? You might expect it to be almost completely linear - are there any unexpected trends?

ggplot(data=sleep_day, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point()

What’s the relationship between weight and BMI? You might expect it to be almost completely linear - are there any unexpected trends? Notes: regarding “Fat” there is no data. There are only 8 distinct persons that made weight logs. This reality is an opportunity for improve!

ggplot(data=weight_log_info, aes(x=WeightKg, y=BMI)) + geom_point()

What could these trends tell you about how to help market this product? Or areas where you might want to explore further?

Merging datasets together

daily_activity_mean and weight_log_info)
weight_mean and sleep_mean

combined_data_activity_weight <- merge(daily_activity_mean, weight_mean, by="Id")
combined_data_weight_sleep <- merge(weight_mean, sleep_mean, by="Id")

Take a look at how many participants are in this data set.

n_distinct(combined_data_activity_weight$Id)

## [1] 8

colnames(combined_data_activity_weight)

## [1] "Id"                     "daily_activity_average" "weight_average"

n_distinct(combined_data_weight_sleep$Id)

## [1] 6

colnames(combined_data_weight_sleep)

## [1] "Id"             "weight_average" "sleep_average"

Note that there were more participant Ids in the daily activity dataset that have been filtered out using merge. Alternatively we could Consider using ‘outer_join’ to keep those in the dataset.

Now we can explore some different relationships between activity and sleep as well. For example, participants who sleep more also take more steps or fewer steps per day? Is there a relationship at all? How could these answers help inform the marketing strategy of how you position this new product?

ggplot(data=combined_data_activity_weight, aes(x=daily_activity_average, y=weight_average)) + geom_point()

ggplot(data=combined_data_weight_sleep, aes(x=sleep_average, y=weight_average)) + geom_point()

ACT - Top high-level content recommendations based on the analysis done

Data available is scarce and incomplete in some respects (e.g. age). It is highly recommended that APP logs are completely revised to include more useful data and that customers are incentivised to enable collect data. Nevertheless, preliminary recommendations for Bellabeat are depicted below.

Naturally the most prominent step for Bellabeat business expansion would be to create a second line of products for men, adapting the name (e.g. Letbeat, Letusbeat, Letitbeat).

Sleeping well seems to be of outstanding relevance, so the APP and product offer both provided by Bellabeat would improve greatly the usefulness by providing guidance, motivation and support on sleep. Sleeping APPs, most based in white noise, may serve as example to add similar capabilities to Bellabeat initiatives on sleep domain. As users like to listen music while they perform activities this would be also a good add-on to monetize while users perform their activities (walking, etc).

Advertising recommendations for Health and Meals based on usage will also help further monetize, in a way that user can somehow enable or disable or configure.

Missing age information is an aspect to correct as a matter of urgency. It is relevant to the analysis considering several age stages. From that complementary information we would derive suitable recommendations for business enhancement. Fat information is also missing and giving the nature of activity it would be good to have it also available.

Bellabeat Case Study

José Neto - Google Data Analytics - Capstone - Portfolio

23/12/2021

R Markdown

Loading packages

Introduction

ASK - Summary of the business task

PREPARE - A description of all data sources used

Importing data

Getting to know the project data

Main data observed

Familiarizing with data and column datatypes

PROCESS - Documentation of cleaning and manipulation of data

Checking for missing values ()

Checking for duplicates

Removing duplicates and NA from applicable tables

Checking if duplicates were removed from tables

Cleaning the data

Cleaning datasets

ANALYSE - A summary of Bellabeat Case Study

Changing the datatype of the data column

Understanding some summary statistics

Plotting a few explorations

Merging datasets together

ACT - Top high-level content recommendations based on the analysis done

Bellabeat Case Study

José Neto - Google Data Analytics - Capstone - Portfolio

23/12/2021

R Markdown

Loading packages

Introduction

ASK - Summary of the business task

PREPARE - A description of all data sources used

Importing data

Getting to know the project data

Main data observed

Familiarizing with data and column datatypes

PROCESS - Documentation of cleaning and manipulation of data

Checking for missing values ()

Checking for duplicates

Removing duplicates and NA from applicable tables

Checking if duplicates were removed from tables

Cleaning the data

Cleaning datasets

ANALYSE - A summary of Bellabeat Case Study

Changing the datatype of the data column

Understanding some summary statistics

Plotting a few explorations

Merging datasets together

SHARE - Supporting visualizations and key findings

ACT - Top high-level content recommendations based on the analysis done