Scenario :


This is my version of the Google Data Analytics : Case Study 2 , A ‘Wellness Technology’ Company case study.

Bellabeat A high-tech company that manufactures health-focused smart products. They said this case study will be an another ‘Tangible’ way to demonstrate my knowledge and skills so here we are :


I will be performing many real-world tasks of a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. “Bellabeat” is a successful small company, but they have the potential to become a larger player in the global smart device market.


I have joined this team six months ago and have been busy learning about Bellabeat’s mission and business goals — as well as how I, as a junior data analyst, can help Bellabeat achieve them.


Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. And I will be presenting data analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy.


Characters and Products :


Characters


  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer.


  • Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team.


  • Bellabeat Marketing Analytics Team: A team of Data Analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.


Products


  • Bellabeat app : The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits.


  • Leaf : Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.


  • Time : This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress.


  • Spring : This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.


  • Bellabeat membership : Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.


Before we begin the Process for this Project, there are few key points that are wrapped below: As these are the ‘phases’ I’ll be following to ensure Data Analysis completion.


  • I’ll be following these vital steps for the Data Analysis Process:





PHASE 1 : Ask


Sršen asks me to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants me to select one Bellabeat product to apply these insights to, in my presentation.




Deliverable



PHASE 2 : Prepare


I’ll be using Public Dataset to analyze and identify trends.




I will use the ROCCC system to determine the credibility and integrity of the data.
  • Reliability: This data is not reliable. There are no further information for example : margin of error , smart device type etc & small sample size has been used, which can limit the data analysis that can be done.

  • Originality: This is not an original dataset as it was originally collected from Amazon Mechanical Turk.

  • Comprehensiveness: This data is not comprehensive. There is no further information about the participants, gender, age, health state, etc. If the data is biased, then the insights from the analysis will be unfair and a complete time waste.

  • Current: This data was collected back in 2016, which is currently outdated.

  • Cited: Amazon Mechanical Turk created the dataset, but we have no information on whether this is a credible source.


Now the Datasets is clearly does not meet the ROCCC System. Therefore,insights from the Analysis might only provide some direction(s), I guess.


  • Downloaded data and stored it appropriately.
  • Identified how it’s organized.
  • Determined the credibility of the data.



PHASE 3 : Process



R is primarily used for statistical analysis and data visualization. So, I chose ‘RStudio’ to merge appropriate dataset for further Analysis.


Setting up the Environment


  • Dependencies


# install.packages("tidyverse")
# install.packages("lubridate")
# install.packages("ggplot2")
# install.packages("janitor")


  • Libraries


library(tidyverse)
library(lubridate)
library(ggplot2)
library(janitor)


  • Working Directory


setwd("D:/Case_Study/Data/Bellabeat/Fitabase Data 4.12.16-5.12.16")


> Categorized Data Collection


  • Hourly Data
hour_cal <- read.csv("hourlyCalories_merged.csv")
hour_inten <- read.csv("hourlyIntensities_merged.csv")
hour_steps <- read.csv("hourlySteps_merged.csv")


  • Minute Data
min_cal <- read.csv("minuteCaloriesNarrow_merged.csv")
min_cal_wide <- read.csv("minuteCaloriesWide_merged.csv")
min_inten <- read.csv("minuteIntensitiesNarrow_merged.csv")
min_inten_wide <- read.csv("minuteIntensitiesWide_merged.csv")
min_mets <- read.csv("minuteMetsNarrow_merged.csv")
min_sleep <- read.csv("minuteSleep_merged.csv")
min_steps <- read.csv("minuteStepsNarrow_merged.csv")
min_steps_wide <- read.csv("minuteStepsWide_merged.csv")


  • Daily Data
daily_sleep <- read.csv("sleepDay_merged.csv")
daily_act <- read.csv("dailyActivity_merged.csv")
daily_cal <- read.csv("dailyCalories_merged.csv")
daily_inten <- read.csv("dailyIntensities_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")


  • Weight Data (Incomplete)
weight_log <- read.csv("weightLogInfo_merged.csv")


  • Heart Rate Data (Incomplete)
heart_sec <- read.csv("heartrate_seconds_merged.csv")


  • Gonna drop these datasets :
  1. weight_log
  2. all the dataframe ending with “wide”



PHASE 4 : Analyse


Working on Daily Datasets :


  • Data Cleaning & Manipulation on “daily_sleep” :
daily_sleep <- daily_sleep %>% 
  clean_names() %>% 
  rename(act_date = sleep_day,
         sleep_min = total_minutes_asleep,
         inbed_min = total_time_in_bed) %>%
  distinct()

daily_sleep$act_date <- as.Date(daily_sleep$act_date, format = "%m/%d/%Y %H:%M:%S %p")
str(daily_sleep)
## 'data.frame':    410 obs. of  5 variables:
##  $ id                 : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ act_date           : Date, format: "2016-04-12" "2016-04-13" ...
##  $ total_sleep_records: int  1 2 1 2 1 1 1 1 1 1 ...
##  $ sleep_min          : int  327 384 412 340 700 304 360 325 361 430 ...
##  $ inbed_min          : int  346 407 442 367 712 320 377 364 384 449 ...


  • Statistical Summary
summary(daily_sleep)
##        id               act_date          total_sleep_records   sleep_min    
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :1.00        Min.   : 58.0  
##  1st Qu.:3.977e+09   1st Qu.:2016-04-19   1st Qu.:1.00        1st Qu.:361.0  
##  Median :4.703e+09   Median :2016-04-27   Median :1.00        Median :432.5  
##  Mean   :4.995e+09   Mean   :2016-04-26   Mean   :1.12        Mean   :419.2  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:1.00        3rd Qu.:490.0  
##  Max.   :8.792e+09   Max.   :2016-05-12   Max.   :3.00        Max.   :796.0  
##    inbed_min    
##  Min.   : 61.0  
##  1st Qu.:403.8  
##  Median :463.0  
##  Mean   :458.5  
##  3rd Qu.:526.0  
##  Max.   :961.0


  • Now daily_sleep is ready for next phase


  • Average Sleep duration is almost 7 hours which is overall good for Health.



  • Data Cleaning & Manipulation on “daily_act” :
daily_act <- daily_act %>% 
  clean_names() %>% 
  rename(act_date = activity_date) %>% 
  mutate(act_date = as.Date(act_date, format = "%m/%d/%Y")) %>% 
  distinct()
  
str(daily_act)
## 'data.frame':    940 obs. of  15 variables:
##  $ id                        : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ act_date                  : Date, format: "2016-04-12" "2016-04-13" ...
##  $ total_steps               : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ total_distance            : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ tracker_distance          : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ logged_activities_distance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ very_active_distance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ moderately_active_distance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ light_active_distance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ sedentary_active_distance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ very_active_minutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ fairly_active_minutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ lightly_active_minutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ sedentary_minutes         : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ calories                  : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...


  • Statistical Summary
summary(daily_act)
##        id               act_date           total_steps    total_distance  
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Median :2016-04-26   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019   Max.   :28.030  
##  tracker_distance logged_activities_distance very_active_distance
##  Min.   : 0.000   Min.   :0.0000             Min.   : 0.000      
##  1st Qu.: 2.620   1st Qu.:0.0000             1st Qu.: 0.000      
##  Median : 5.245   Median :0.0000             Median : 0.210      
##  Mean   : 5.475   Mean   :0.1082             Mean   : 1.503      
##  3rd Qu.: 7.710   3rd Qu.:0.0000             3rd Qu.: 2.053      
##  Max.   :28.030   Max.   :4.9421             Max.   :21.920      
##  moderately_active_distance light_active_distance sedentary_active_distance
##  Min.   :0.0000             Min.   : 0.000        Min.   :0.000000         
##  1st Qu.:0.0000             1st Qu.: 1.945        1st Qu.:0.000000         
##  Median :0.2400             Median : 3.365        Median :0.000000         
##  Mean   :0.5675             Mean   : 3.341        Mean   :0.001606         
##  3rd Qu.:0.8000             3rd Qu.: 4.782        3rd Qu.:0.000000         
##  Max.   :6.4800             Max.   :10.710        Max.   :0.110000         
##  very_active_minutes fairly_active_minutes lightly_active_minutes
##  Min.   :  0.00      Min.   :  0.00        Min.   :  0.0         
##  1st Qu.:  0.00      1st Qu.:  0.00        1st Qu.:127.0         
##  Median :  4.00      Median :  6.00        Median :199.0         
##  Mean   : 21.16      Mean   : 13.56        Mean   :192.8         
##  3rd Qu.: 32.00      3rd Qu.: 19.00        3rd Qu.:264.0         
##  Max.   :210.00      Max.   :143.00        Max.   :518.0         
##  sedentary_minutes    calories   
##  Min.   :   0.0    Min.   :   0  
##  1st Qu.: 729.8    1st Qu.:1828  
##  Median :1057.5    Median :2134  
##  Mean   : 991.2    Mean   :2304  
##  3rd Qu.:1229.5    3rd Qu.:2793  
##  Max.   :1440.0    Max.   :4900


  • Almost 17 hours of Average sitting, this is some kinda serious problem whether it is for a person or even for a company, insights from the data representing wearable device to be faulty as it can not happen, imagining everyone sitting for 17 hours . If we look closer on min. and max. sitting minutes are 0 and 24 hours respectively which 100 % inaccurate. so now ‘sedentary tracker’ need to be fixed or it’s time for the upgrade.



  • Data Cleaning & Manipulation on “daily_cal” :


daily_cal <- daily_cal %>% 
  clean_names() %>% 
  rename(act_date = activity_day) %>%
  mutate(act_date = as.Date(act_date, format = "%m/%d/%Y")) %>% 
  distinct()

str(daily_cal)
## 'data.frame':    940 obs. of  3 variables:
##  $ id      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ act_date: Date, format: "2016-04-12" "2016-04-13" ...
##  $ calories: int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...


  • Statistical Summary
summary(daily_cal)
##        id               act_date             calories   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :   0  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.:1828  
##  Median :4.445e+09   Median :2016-04-26   Median :2134  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   :2304  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:2793  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :4900


  • Looks good



  • Data Cleaning & Manipulation on “daily_inten” :
daily_inten <- daily_inten %>% 
  clean_names() %>% 
  rename(act_date = activity_day) %>% 
  mutate(act_date = as.Date(act_date, format = "%m/%d/%Y")) %>% 
  distinct()

str(daily_inten)
## 'data.frame':    940 obs. of  10 variables:
##  $ id                        : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ act_date                  : Date, format: "2016-04-12" "2016-04-13" ...
##  $ sedentary_minutes         : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ lightly_active_minutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ fairly_active_minutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ very_active_minutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ sedentary_active_distance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ light_active_distance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ moderately_active_distance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ very_active_distance      : num  1.88 1.57 2.44 2.14 2.71 ...


  • Statistical Summary
summary(daily_inten)
##        id               act_date          sedentary_minutes
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :   0.0   
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 729.8   
##  Median :4.445e+09   Median :2016-04-26   Median :1057.5   
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 991.2   
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:1229.5   
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :1440.0   
##  lightly_active_minutes fairly_active_minutes very_active_minutes
##  Min.   :  0.0          Min.   :  0.00        Min.   :  0.00     
##  1st Qu.:127.0          1st Qu.:  0.00        1st Qu.:  0.00     
##  Median :199.0          Median :  6.00        Median :  4.00     
##  Mean   :192.8          Mean   : 13.56        Mean   : 21.16     
##  3rd Qu.:264.0          3rd Qu.: 19.00        3rd Qu.: 32.00     
##  Max.   :518.0          Max.   :143.00        Max.   :210.00     
##  sedentary_active_distance light_active_distance moderately_active_distance
##  Min.   :0.000000          Min.   : 0.000        Min.   :0.0000            
##  1st Qu.:0.000000          1st Qu.: 1.945        1st Qu.:0.0000            
##  Median :0.000000          Median : 3.365        Median :0.2400            
##  Mean   :0.001606          Mean   : 3.341        Mean   :0.5675            
##  3rd Qu.:0.000000          3rd Qu.: 4.782        3rd Qu.:0.8000            
##  Max.   :0.110000          Max.   :10.710        Max.   :6.4800            
##  very_active_distance
##  Min.   : 0.000      
##  1st Qu.: 0.000      
##  Median : 0.210      
##  Mean   : 1.503      
##  3rd Qu.: 2.053      
##  Max.   :21.920


  • Looks good



  • Data Cleaning & Manipulation on “daily_steps” :
daily_steps <- daily_steps %>% 
  clean_names() %>% 
  rename(act_date = activity_day) %>% 
  mutate(act_date = as.Date(act_date, format = "%m/%d/%Y")) %>% 
  distinct()

str(daily_steps)
## 'data.frame':    940 obs. of  3 variables:
##  $ id        : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ act_date  : Date, format: "2016-04-12" "2016-04-13" ...
##  $ step_total: int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...


  • Statistical Summary
summary(daily_steps)
##        id               act_date            step_total   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790  
##  Median :4.445e+09   Median :2016-04-26   Median : 7406  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019


  • Looks good



Working on Minutes Datasets :


  • Data Cleaning & Manipulation on “min_cal” :
min_cal <- min_cal %>% 
  clean_names() %>% 
  rename(act_date = activity_minute)%>% 
  distinct()


min_cal$date <- mdy_hms(min_cal$act_date)
min_cal$time <- format(as.POSIXct(min_cal$date), format = "%H:%M %p")

min_cal <- min_cal %>% 
  select(-c(act_date))

str(min_cal)
## 'data.frame':    1325580 obs. of  4 variables:
##  $ id      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ calories: num  0.786 0.786 0.786 0.786 0.786 ...
##  $ date    : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
##  $ time    : chr  "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...


  • Statistical Summary
summary(min_cal)
##        id               calories            date                       
##  Min.   :1.504e+09   Min.   : 0.0000   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:2.320e+09   1st Qu.: 0.9357   1st Qu.:2016-04-19 01:51:00.00  
##  Median :4.445e+09   Median : 1.2176   Median :2016-04-26 06:27:00.00  
##  Mean   :4.848e+09   Mean   : 1.6231   Mean   :2016-04-26 12:09:55.15  
##  3rd Qu.:6.962e+09   3rd Qu.: 1.4327   3rd Qu.:2016-05-03 18:55:00.00  
##  Max.   :8.878e+09   Max.   :19.7499   Max.   :2016-05-12 15:59:00.00  
##      time          
##  Length:1325580    
##  Class :character  
##  Mode  :character  
##                    
##                    
## 


  • Looks good



  • Data Cleaning & Manipulation on “min_inten” :
min_inten <- min_inten %>% 
  clean_names() %>% 
  rename(act_date = activity_minute) %>% 
  distinct()
  
min_inten$date <- mdy_hms(min_inten$act_date)
min_inten$time <- format(as.POSIXct(min_inten$date), format = "%H:%M %p")
min_inten <- min_inten %>% 
  select(-c(act_date))

str(min_inten)
## 'data.frame':    1325580 obs. of  4 variables:
##  $ id       : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ intensity: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ date     : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
##  $ time     : chr  "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...


  • Statistical Summary
summary(min_inten)
##        id              intensity           date                       
##  Min.   :1.504e+09   Min.   :0.0000   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:2.320e+09   1st Qu.:0.0000   1st Qu.:2016-04-19 01:51:00.00  
##  Median :4.445e+09   Median :0.0000   Median :2016-04-26 06:27:00.00  
##  Mean   :4.848e+09   Mean   :0.2006   Mean   :2016-04-26 12:09:55.15  
##  3rd Qu.:6.962e+09   3rd Qu.:0.0000   3rd Qu.:2016-05-03 18:55:00.00  
##  Max.   :8.878e+09   Max.   :3.0000   Max.   :2016-05-12 15:59:00.00  
##      time          
##  Length:1325580    
##  Class :character  
##  Mode  :character  
##                    
##                    
## 


  • Looks good



  • Data Cleaning & Manipulation on “min_mets” :
min_mets <- min_mets %>% 
  clean_names() %>% 
  rename(act_date = activity_minute) %>%
  rename(mets = me_ts) %>% 
  distinct()

min_mets$date <- mdy_hms(min_mets$act_date)
min_mets$time <- format(as.POSIXct(min_mets$date), format = "%H:%M %p")

min_mets <- min_mets %>% 
  select(-c(act_date))

str(min_mets)
## 'data.frame':    1325580 obs. of  4 variables:
##  $ id  : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ mets: int  10 10 10 10 10 12 12 12 12 12 ...
##  $ date: POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
##  $ time: chr  "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...


  • Statistical Summary
summary(min_mets)
##        id                 mets             date                       
##  Min.   :1.504e+09   Min.   :  0.00   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:2.320e+09   1st Qu.: 10.00   1st Qu.:2016-04-19 01:51:00.00  
##  Median :4.445e+09   Median : 10.00   Median :2016-04-26 06:27:00.00  
##  Mean   :4.848e+09   Mean   : 14.69   Mean   :2016-04-26 12:09:55.15  
##  3rd Qu.:6.962e+09   3rd Qu.: 11.00   3rd Qu.:2016-05-03 18:55:00.00  
##  Max.   :8.878e+09   Max.   :157.00   Max.   :2016-05-12 15:59:00.00  
##      time          
##  Length:1325580    
##  Class :character  
##  Mode  :character  
##                    
##                    
## 


  • Looks good



  • Data Cleaning & Manipulation on “min_mets” :
min_sleep <- min_sleep %>% 
  clean_names() %>% 
  rename(act_date = date) %>% 
  distinct()

min_sleep$date <- mdy_hms(min_sleep$act_date)
min_sleep$time <- format(as.POSIXct(min_sleep$date), format = "%H:%M %p")

min_sleep <- min_sleep %>% 
  select(-c(act_date))

str(min_sleep)
## 'data.frame':    187978 obs. of  5 variables:
##  $ id    : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ value : int  3 2 1 1 1 1 1 2 2 2 ...
##  $ log_id: num  1.14e+10 1.14e+10 1.14e+10 1.14e+10 1.14e+10 ...
##  $ date  : POSIXct, format: "2016-04-12 02:47:30" "2016-04-12 02:48:30" ...
##  $ time  : chr  "02:47 AM" "02:48 AM" "02:49 AM" "02:50 AM" ...


  • Statistical Summary
summary(min_sleep)
##        id                value           log_id         
##  Min.   :1.504e+09   Min.   :1.000   Min.   :1.137e+10  
##  1st Qu.:3.977e+09   1st Qu.:1.000   1st Qu.:1.144e+10  
##  Median :4.703e+09   Median :1.000   Median :1.150e+10  
##  Mean   :4.997e+09   Mean   :1.096   Mean   :1.150e+10  
##  3rd Qu.:6.962e+09   3rd Qu.:1.000   3rd Qu.:1.155e+10  
##  Max.   :8.792e+09   Max.   :3.000   Max.   :1.162e+10  
##       date                            time          
##  Min.   :2016-04-11 20:48:00.00   Length:187978     
##  1st Qu.:2016-04-19 02:48:00.00   Class :character  
##  Median :2016-04-26 21:48:00.00   Mode  :character  
##  Mean   :2016-04-26 13:31:23.11                     
##  3rd Qu.:2016-05-03 23:47:00.00                     
##  Max.   :2016-05-12 09:56:00.00


  • “min_sleep” dataset has no further information about data, this is an example of bad data. After seeing this I went through the dataset and noticed something weired data and we can not use this dataset for Data Analysis


  • Dropping ‘min_sleep’ dataset
rm(min_sleep)


  • Data Cleaning & Manipulation on “min_mets” :
min_steps <- min_steps %>% 
  clean_names() %>% 
  rename(act_date = activity_minute) %>% 
  distinct()

min_steps$date <- mdy_hms(min_steps$act_date)
min_steps$time <- format(as.POSIXct(min_steps$date), format = "%H:%M %p")

min_steps <- min_steps %>% 
  select(-c(act_date))
str(min_steps)
## 'data.frame':    1325580 obs. of  4 variables:
##  $ id   : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ steps: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ date : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
##  $ time : chr  "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...


  • Statistical Summary
summary(min_steps)
##        id                steps              date                       
##  Min.   :1.504e+09   Min.   :  0.000   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:2.320e+09   1st Qu.:  0.000   1st Qu.:2016-04-19 01:51:00.00  
##  Median :4.445e+09   Median :  0.000   Median :2016-04-26 06:27:00.00  
##  Mean   :4.848e+09   Mean   :  5.336   Mean   :2016-04-26 12:09:55.15  
##  3rd Qu.:6.962e+09   3rd Qu.:  0.000   3rd Qu.:2016-05-03 18:55:00.00  
##  Max.   :8.878e+09   Max.   :220.000   Max.   :2016-05-12 15:59:00.00  
##      time          
##  Length:1325580    
##  Class :character  
##  Mode  :character  
##                    
##                    
## 


  • Looks good



Working on heart_sec Datasets :


  • Data Cleaning & Manipulation on “heart_sec” :
heart_sec <- heart_sec %>% 
  clean_names() %>% 
  rename(act_date = time, bpm = value) %>%
  distinct()

heart_sec$date <- mdy_hms(heart_sec$act_date)
heart_sec$time <- format(as.POSIXct(heart_sec$date), format = "%H:%M:%S %p")

heart_sec <- heart_sec %>% 
  select(-c(act_date))

str(heart_sec)
## 'data.frame':    2483658 obs. of  4 variables:
##  $ id  : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ bpm : int  97 102 105 103 101 95 91 93 94 93 ...
##  $ date: POSIXct, format: "2016-04-12 07:21:00" "2016-04-12 07:21:05" ...
##  $ time: chr  "07:21:00 AM" "07:21:05 AM" "07:21:10 AM" "07:21:20 AM" ...


  • Statistical Summary
summary(heart_sec)
##        id                 bpm              date                       
##  Min.   :2.022e+09   Min.   : 36.00   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:4.388e+09   1st Qu.: 63.00   1st Qu.:2016-04-19 06:18:10.00  
##  Median :5.554e+09   Median : 73.00   Median :2016-04-26 20:28:50.00  
##  Mean   :5.514e+09   Mean   : 77.33   Mean   :2016-04-26 19:43:52.24  
##  3rd Qu.:6.962e+09   3rd Qu.: 88.00   3rd Qu.:2016-05-04 08:00:20.00  
##  Max.   :8.878e+09   Max.   :203.00   Max.   :2016-05-12 16:20:00.00  
##      time          
##  Length:2483658    
##  Class :character  
##  Mode  :character  
##                    
##                    
## 


  • Looks good, somehow some people facing some serious kind of disease, their bpm is very low they need to be hospitalized where some facing very high with bpm.



Working on Hourly Datasets :


  • Data Cleaning & Manipulation on “hour_steps” :
hour_steps <- hour_steps %>% 
  clean_names() %>% 
  rename(act_date = activity_hour) %>% 
  distinct()

hour_steps$date <- mdy_hms(hour_steps$act_date)
hour_steps$time <- format(as.POSIXct(hour_steps$date), format = "%H:%M %p")

hour_steps <- hour_steps %>% 
  select(-c(act_date))

str(hour_steps)
## 'data.frame':    22099 obs. of  4 variables:
##  $ id        : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ step_total: int  373 160 151 0 0 0 0 0 250 1864 ...
##  $ date      : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
##  $ time      : chr  "00:00 AM" "01:00 AM" "02:00 AM" "03:00 AM" ...


  • Statistical Summary
summary(hour_steps)
##        id              step_total           date                       
##  Min.   :1.504e+09   Min.   :    0.0   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:2.320e+09   1st Qu.:    0.0   1st Qu.:2016-04-19 01:00:00.00  
##  Median :4.445e+09   Median :   40.0   Median :2016-04-26 06:00:00.00  
##  Mean   :4.848e+09   Mean   :  320.2   Mean   :2016-04-26 11:46:42.58  
##  3rd Qu.:6.962e+09   3rd Qu.:  357.0   3rd Qu.:2016-05-03 19:00:00.00  
##  Max.   :8.878e+09   Max.   :10554.0   Max.   :2016-05-12 15:00:00.00  
##      time          
##  Length:22099      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 


  • Looks good



  • Data Cleaning & Manipulation on “hour_inten” :
hour_inten <- hour_inten %>% 
  clean_names() %>% 
  rename(act_date = activity_hour) %>% 
  distinct()

hour_inten$date <- mdy_hms(hour_inten$act_date)
hour_inten$time <- format(as.POSIXct(hour_inten$date), format = "%H:%M %p")

hour_inten <- hour_inten %>% 
  select(-c(act_date))

str(hour_inten)
## 'data.frame':    22099 obs. of  5 variables:
##  $ id               : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ total_intensity  : int  20 8 7 0 0 0 0 0 13 30 ...
##  $ average_intensity: num  0.333 0.133 0.117 0 0 ...
##  $ date             : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
##  $ time             : chr  "00:00 AM" "01:00 AM" "02:00 AM" "03:00 AM" ...


  • Statistical Summary
summary(hour_inten)
##        id            total_intensity  average_intensity
##  Min.   :1.504e+09   Min.   :  0.00   Min.   :0.0000   
##  1st Qu.:2.320e+09   1st Qu.:  0.00   1st Qu.:0.0000   
##  Median :4.445e+09   Median :  3.00   Median :0.0500   
##  Mean   :4.848e+09   Mean   : 12.04   Mean   :0.2006   
##  3rd Qu.:6.962e+09   3rd Qu.: 16.00   3rd Qu.:0.2667   
##  Max.   :8.878e+09   Max.   :180.00   Max.   :3.0000   
##       date                            time          
##  Min.   :2016-04-12 00:00:00.00   Length:22099      
##  1st Qu.:2016-04-19 01:00:00.00   Class :character  
##  Median :2016-04-26 06:00:00.00   Mode  :character  
##  Mean   :2016-04-26 11:46:42.58                     
##  3rd Qu.:2016-05-03 19:00:00.00                     
##  Max.   :2016-05-12 15:00:00.00


  • Looks good



  • Data Cleaning & Manipulation on “hour_cal” :
hour_cal <- hour_cal %>% 
  clean_names() %>% 
  rename(act_date = activity_hour)%>% 
  distinct()

hour_cal$date <- mdy_hms(hour_cal$act_date)
hour_cal$time <- format(as.POSIXct(hour_cal$date), format = "%H:%M %p")


hour_cal <- hour_cal %>% 
  select(-c(act_date))

str(hour_cal)
## 'data.frame':    22099 obs. of  4 variables:
##  $ id      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ calories: int  81 61 59 47 48 48 48 47 68 141 ...
##  $ date    : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
##  $ time    : chr  "00:00 AM" "01:00 AM" "02:00 AM" "03:00 AM" ...


  • Statistical Summary
summary(hour_cal)
##        id               calories           date                       
##  Min.   :1.504e+09   Min.   : 42.00   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:2.320e+09   1st Qu.: 63.00   1st Qu.:2016-04-19 01:00:00.00  
##  Median :4.445e+09   Median : 83.00   Median :2016-04-26 06:00:00.00  
##  Mean   :4.848e+09   Mean   : 97.39   Mean   :2016-04-26 11:46:42.58  
##  3rd Qu.:6.962e+09   3rd Qu.:108.00   3rd Qu.:2016-05-03 19:00:00.00  
##  Max.   :8.878e+09   Max.   :948.00   Max.   :2016-05-12 15:00:00.00  
##      time          
##  Length:22099      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 


  • Looks good



finding relationship bewtween different datasets and merging :


  • hourly data
hourly_data <- hour_cal %>% 
  left_join(hour_inten, by = c("id", "date", "time")) %>% 
  left_join(hour_steps, by = c("id", "date", "time")) %>% 
  arrange(time) %>% 
  distinct()

hourly_data <- hourly_data %>% 
  select(-c(average_intensity))
str(hourly_data)
## 'data.frame':    22099 obs. of  6 variables:
##  $ id             : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ calories       : int  81 69 56 60 77 47 82 47 54 54 ...
##  $ date           : POSIXct, format: "2016-04-12 00:00:00" "2016-04-13 00:00:00" ...
##  $ time           : chr  "00:00 AM" "00:00 AM" "00:00 AM" "00:00 AM" ...
##  $ total_intensity: int  20 14 4 6 15 0 21 0 2 2 ...
##  $ step_total     : int  373 144 81 83 459 0 416 0 16 17 ...


  • Statistical Summary
summary(hourly_data)
##        id               calories           date                       
##  Min.   :1.504e+09   Min.   : 42.00   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:2.320e+09   1st Qu.: 63.00   1st Qu.:2016-04-19 01:00:00.00  
##  Median :4.445e+09   Median : 83.00   Median :2016-04-26 06:00:00.00  
##  Mean   :4.848e+09   Mean   : 97.39   Mean   :2016-04-26 11:46:42.58  
##  3rd Qu.:6.962e+09   3rd Qu.:108.00   3rd Qu.:2016-05-03 19:00:00.00  
##  Max.   :8.878e+09   Max.   :948.00   Max.   :2016-05-12 15:00:00.00  
##      time           total_intensity    step_total     
##  Length:22099       Min.   :  0.00   Min.   :    0.0  
##  Class :character   1st Qu.:  0.00   1st Qu.:    0.0  
##  Mode  :character   Median :  3.00   Median :   40.0  
##                     Mean   : 12.04   Mean   :  320.2  
##                     3rd Qu.: 16.00   3rd Qu.:  357.0  
##                     Max.   :180.00   Max.   :10554.0


  • Looks good


  • Minute data
minute_data <- min_mets %>%
  left_join(min_cal, by = c("id", "date", "time")) %>% 
  left_join(min_inten, by = c("id", "date", "time")) %>% 
  left_join(min_steps, by = c("id", "date", "time")) %>% 
  distinct()

str(minute_data)
## 'data.frame':    1325580 obs. of  7 variables:
##  $ id       : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ mets     : int  10 10 10 10 10 12 12 12 12 12 ...
##  $ date     : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 00:01:00" ...
##  $ time     : chr  "00:00 AM" "00:01 AM" "00:02 AM" "00:03 AM" ...
##  $ calories : num  0.786 0.786 0.786 0.786 0.786 ...
##  $ intensity: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ steps    : int  0 0 0 0 0 0 0 0 0 0 ...


  • Statistical Summary
summary(minute_data)
##        id                 mets             date                       
##  Min.   :1.504e+09   Min.   :  0.00   Min.   :2016-04-12 00:00:00.00  
##  1st Qu.:2.320e+09   1st Qu.: 10.00   1st Qu.:2016-04-19 01:51:00.00  
##  Median :4.445e+09   Median : 10.00   Median :2016-04-26 06:27:00.00  
##  Mean   :4.848e+09   Mean   : 14.69   Mean   :2016-04-26 12:09:55.15  
##  3rd Qu.:6.962e+09   3rd Qu.: 11.00   3rd Qu.:2016-05-03 18:55:00.00  
##  Max.   :8.878e+09   Max.   :157.00   Max.   :2016-05-12 15:59:00.00  
##      time              calories         intensity          steps        
##  Length:1325580     Min.   : 0.0000   Min.   :0.0000   Min.   :  0.000  
##  Class :character   1st Qu.: 0.9357   1st Qu.:0.0000   1st Qu.:  0.000  
##  Mode  :character   Median : 1.2176   Median :0.0000   Median :  0.000  
##                     Mean   : 1.6231   Mean   :0.2006   Mean   :  5.336  
##                     3rd Qu.: 1.4327   3rd Qu.:0.0000   3rd Qu.:  0.000  
##                     Max.   :19.7499   Max.   :3.0000   Max.   :220.000


  • Looks good


creating some dataframe for data visualization :


  • effectable datasets (till now):
  1. daily_act
  2. hourly_data
  3. minute_data
  4. min_sleep
  5. daily_sleep
  6. sec_heart



PHASE 5 : Data Visualization



1. daily_act

daily_act <- daily_act %>% 
  select(c(id, act_date, 
           total_steps, 
           sedentary_minutes,
           calories))
str(daily_act)
## 'data.frame':    940 obs. of  5 variables:
##  $ id               : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ act_date         : Date, format: "2016-04-12" "2016-04-13" ...
##  $ total_steps      : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ sedentary_minutes: int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ calories         : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...


daily_act %>% 
  ggplot(aes(sedentary_minutes, total_steps, 
             color = sedentary_minutes)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = 'purple') +
  labs( x = "Total Sedentary Minutes", y = "Total Steps Taken",
        color = "Sedentary Minutes", 
        title = "Relation Between Daily Sedentary Time By Steps Taken ",
        caption = "Data Analyst : JP")+
  annotate("text", x=220, y= 30000, label= "R^2 =  0.10" , color= "red", 
           fontface = "bold"  , size = 5, angle = 25) +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

summary(lm(sedentary_minutes ~ total_steps, daily_act))$r.squared
## [1] 0.1072455



daily_act %>% 
  ggplot(aes(calories, total_steps, color = sedentary_minutes)) +
  geom_point(size = 2, alpha = 0.4) +
  geom_smooth(method = 'lm' , se = FALSE, color = 'purple') +
  labs(x = "Calories",
       y = "Total Steps Taken",
       color = "Sedentary Minutes",
       title = "Relationship Between Calories by Steps Taken",
       subtitle = "Linear Regression Model has Small fit for this relationship",
       caption = "Data Analyst : JP") +
  annotate("text", x=500, y= 30000, label = "R^2 =  0.34", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 ) +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

summary(lm(calories ~ total_steps , daily_act))$r.squared
## [1] 0.3499528



daily_sleep %>% 
  ggplot(aes(sleep_min, inbed_min, color = factor(total_sleep_records))) +
  geom_point(size = 2, alpha = 0.5) + 
  geom_smooth(method = 'lm' , se = FALSE, color = "#330000") +
  labs(x = "Total Minutes Sleep",
       y = "Total Minute in Bed",
       color = "Sleep(s)",
       title = "Relationship Between Sleep vs In Bed Time",
       subtitle = "Linear Regression Model has Strong fit for this relationship",
       caption = "Data Analyst : JP") +
  annotate("text", x=175, y= 775, label = "R^2 =  0.86", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 ) +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

summary(lm(sleep_min~inbed_min, daily_sleep))$r.squared
## [1] 0.8656858




daily_sleep_act <- daily_sleep %>% 
  left_join(daily_act, by = c("id", "act_date")) %>% 
  distinct()

str(daily_sleep_act)
## 'data.frame':    410 obs. of  8 variables:
##  $ id                 : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ act_date           : Date, format: "2016-04-12" "2016-04-13" ...
##  $ total_sleep_records: int  1 2 1 2 1 1 1 1 1 1 ...
##  $ sleep_min          : int  327 384 412 340 700 304 360 325 361 430 ...
##  $ inbed_min          : int  346 407 442 367 712 320 377 364 384 449 ...
##  $ total_steps        : int  13162 10735 9762 12669 9705 15506 10544 9819 14371 10039 ...
##  $ sedentary_minutes  : int  728 776 726 773 539 775 818 838 732 709 ...
##  $ calories           : int  1985 1797 1745 1863 1728 2035 1786 1775 1949 1788 ...
rm(daily_sleep_act)



2. hourly_data


hourly_data %>% 
  filter(step_total < 6000) %>% 
  ggplot(aes(total_intensity, step_total)) +
  geom_point(color = "blue", size = 2, alpha = 0.3)+
  geom_smooth(method = 'lm', se = FALSE, color = "red") +
  labs(x = "Intensity",
       y = "Steps Taken",
       title = "Relationship Between Intensity & Total Steps",
       subtitle = "Linear Regression Model has Strong fit for this relationship",
       caption = "Data Analyst : JP") +
  annotate("text", x=10, y= 5000, label = "R^2 =  0.80", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 ) +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

summary(lm(total_intensity~step_total, hourly_data))$r.squared
## [1] 0.8027856


hourly_data %>% 
  filter(calories < 600) %>% 
  ggplot(aes(total_intensity, calories)) +
  geom_point(color = "blue", size = 2, alpha = 0.3)+
  geom_smooth(method = 'lm', se = FALSE, color = "red") +
  labs(x = "Intensity",
       y = "Calories",
       title = "Relationship Between Intensity & Calories",
       subtitle = "Linear Regression Model has Strong fit for this relationship",
       caption = "Data Analyst : JP") +
  annotate("text", x=10, y= 475, label = "R^2 =  0.80", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 ) +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

summary(lm(total_intensity~calories, hourly_data))$r.squared
## [1] 0.8039204



hourly_data %>% 
  filter(calories < 600) %>% 
  ggplot(aes(step_total, calories)) +
  geom_point(color = "blue", size = 2, alpha = 0.3)+
  geom_smooth(method = 'lm', se = FALSE, color = "red") +
  labs(x = "Steps Taken",
       y = "Calories",
       title = "Relationship Between Steps Taken & Calories",
       subtitle = "Linear Regression Model has Strong fit for this relationship",
       caption = "Data Analyst : JP") +
  annotate("text", x=6800, y= 150, label = "R^2 =  0.66", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 ) +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

summary(lm(step_total~calories, hourly_data))$r.squared
## [1] 0.6641728




3. Minute Data


minute_data %>% 
  ggplot(aes(mets, calories)) +
  geom_line(color = "blue", size = 0.5, alpha = 0.3)+
  geom_smooth(method = 'lm', se = FALSE, color = "red") +
  labs(x = "METs",
       y = "Calories",
       title = "Relationship Between METs & Calories",
       subtitle = "Linear Regression Model has Strong fit for this relationship",
       caption = "Data Analyst : JP") +
  annotate("text", x=25, y= 17, label = "R^2 =  0.91", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 ) +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

summary(lm(mets~calories, minute_data))$r.squared
## [1] 0.9138607




PHASE 6 : Act


The act phase would be done by the Executive team of the company. So,Passing the documented report to :

Urška Sršen & the Team


Recommendations for Bellabeat Marketing Strategy:


  1. Bellabeat should use improvised notification system to notify users to exercise more through reminders as they can keep track of health. And more centralized about engagement in the app with some marketing strategy like steps taken can be redeemed in the app to buy product on discount and something like this


  1. Bellabeat should upgrade their sedentary tracking device while advertising for exchange with new one which has much more functions and improvised upgraded hardware which gonna provides precise tracking. So, Company will benefit by having more customers and more data for the next time when we will be doing Analysis for great insights


  1. Bellabeat should encourage their customers for better sleeping habits, like best time to sleep and should also encourage to become “bellabeat membership” since it ofers 24/7 access to fully personalized guidance, and one more thing, Bellabeat should upgrade their system to notify users when heart rate rise or drops as these are Real engagement customers will not ignore.



Recommendations based on the limitations of the datasets:


  1. Larger Sample size & more extended period of data is needed to get in-depth precise statistical analysis.

  2. Data collection required from primary/secondary data sources just to increases credibility and reliability of the datasets.



  • Saved


  • Exported data


  • write.csv(hourly_data, file =“hourly_data.csv”)
  • write.csv(minute_data, file =“minute_data.csv”)
  • write.csv(daily_act, file =“daily_act.csv”)
  • write.csv(daily_sleep, file =“daily_sleep.csv”)