Scenario

I’m a junior data analyst at Bellabeat, a high-tech maker of women’s health goods. Bellabeat is a modest, successful startup with potential to grow in the global smart device industry. Bellabeat founders and CCO Urka Sren believes examining smart device fitness data might help the company develop. I have been requested to study smart device data for one of Bellabeat’s products to learn how consumers use smart devices. The insights will shape the company’s marketing approach. I will offer the research and recommendations to Bellabeat’s management team.

Introduction

About the company

Bellabeat was formed by Urka Sren and Sando Mur. Sren used her artistic experience to make technology that encourages women worldwide. Bellabeat empowers women by collecting data on movement, sleep, stress, and reproductive health. Bellabeat was started in 2013 and has swiftly become a tech-driven women’s wellness firm. Bellabeat had various offices and products by 2016. Bellabeat items are sold by a growing number of online shops in addition to their website. The corporation invests in radio, billboards, print, and TV, but focuses on internet marketing.

Products

Bellabeat app

The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf

Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time

This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring

This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership

Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Data Analysis Process

1.0 PHASE 1 :ASK

First up, the analysts needed to define what the project would look like and what would qualify as a successful result. So, to determine these things, they asked effective questions and collaborated with leaders and managers who were interested in the outcome of their people analysis.

Ask effective questions
Define the problem
Use structured thinking
Communicate with others
What is the problem you are trying to solve?

1.1 Identify the business task

The business task will be associated with a non-Bellabeat smart devices and analyze the smart device usage data in order to gain insight into how people are already using their smart devices.Then, using the data, generate recommendations for the Bellabeat marketing strategy team to understand the trends. These recommendations and trend insights will be used to enhance the features, functionality, and service quality of the Bellabeat app. So, the business task can be summed up as follows:

Analyze the non-Bellabeat smart device usage data in order to gain insight to help Bellabeat marketing strategy team to understand the trends.
Utilizing the trends and insights provided by smart device usage data in order to improve the features, functionality, and overall service quality of the Bellabeat app

1.2 Consider key stakeholders

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer

Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team

Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.

2.0 PHASE 2 : PREPARE

• Understand how data is generated and collected

• Identify and use different data formats, types, and structures

• Make sure data is unbiased and credible

• Organize and protect data

2.1 Dataset overview

This case study utilises FitBit Fitness Tracker Data that is publicly accessible on Kaggle. The dataset was collected based on the responses of thirty individuals who participated in a distributed survey conducted by Amazon Mechanical Turk between December 3 and December 5, 2016.

The thirty people who participated in the survey are qualified Fitbit users who gave their approval to the submission of personal tracker data. This data included minute-by-minute output for tracking of physical activity, heart rate, and sleep.

FitBit Fitness Tracker Data consists of 18 csv file in long format. The data should then be converted to wide format in order to reduce data dimensionality for data analysis.

2.2 Data credibility

ROCC will be applied to detect if there are data credibility issues.

Reliabile

It is less reliable. The dataset is comprised of 30 Fitbit users, which is insufficient to represent the eight million active users in 2016. This would result in a margin of error of 23.56% with a confidence level of 95%, which is less reliable based on margin error calculator .

Central Limit Theorem (CLT) says that 30 is the smallest number of samples that can be used. The dataset is still valid based on CLT.

However, it depends on the stakes. Larger samples are needed for reliable results.
Original:

It is not original. The data set was produced by Amazon Mechanical Turk responders to a distributed survey, which is considered as third party data. It would have been preferable if FitBit had delivered the data directly.
Comprehensive:

It does not cover every comprehensive aspect. The data are insufficient because they are lacking certain information (e.g., sex, age, genetic disease) that would assist in producing a more precise analysis. As a result, they cannot be considered comprehensive.

And yet again, the information was gathered over a span of two months, which is insufficient. It is preferable to have data covering a period of at least a year.

Last, How did they chose thirty people at random? Which strategy they implement? Does it come from a sample that was picked at random and from a place that was chosen at random as well?
Current

It is not current.The data was collected between December 3 and December 5 of 2016, which is now a total of six years ago.
Cited:

It is Cited. The datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk.

2.3 How Data organized

2.3.1 Loading Packages

2.3.2 Importing Data Sets

dailyActivity_merged <- read.csv("https://raw.githubusercontent.com/soonkienyuan/DataAnalytics_Capstone_case_study2/main/datasets/dailyActivity_merged.csv")

dailyCalories_merged  <- read.csv("https://raw.githubusercontent.com/soonkienyuan/DataAnalytics_Capstone_case_study2/main/datasets/dailyCalories_merged.csv")

dailyIntensities_merged <- read.csv("https://raw.githubusercontent.com/soonkienyuan/DataAnalytics_Capstone_case_study2/main/datasets/dailyIntensities_merged.csv")

dailySteps_merged <- read.csv("https://raw.githubusercontent.com/soonkienyuan/DataAnalytics_Capstone_case_study2/main/datasets/dailySteps_merged.csv")

sleepDay_merged <- read.csv("https://raw.githubusercontent.com/soonkienyuan/DataAnalytics_Capstone_case_study2/main/datasets/sleepDay_merged.csv")

weightLogInfo_merged <- read.csv("https://raw.githubusercontent.com/soonkienyuan/DataAnalytics_Capstone_case_study2/main/datasets/weightLogInfo_merged.csv")

2.4 Dataset structure

2.4.1 `dailyActivity_merged`

# first view of the data
head(dailyActivity_merged)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

#structure of dataset
str(dailyActivity_merged)

## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

#skimming

skim_without_charts(dailyActivity_merged)

Data summary
Name	dailyActivity_merged
Number of rows	940
Number of columns	15
_______________________
Column type frequency:
character	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityDate	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	4.855407e+09	2.424805e+09	1503960366	2.320127e+09	4.445115e+09	6.962181e+09	8.877689e+09
TotalSteps	1	7.637910e+03	5.087150e+03	0	3.789750e+03	7.405500e+03	1.072700e+04	3.601900e+04
TotalDistance	1	5.490000e+00	3.920000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
TrackerDistance	1	5.480000e+00	3.910000e+00	0	2.620000e+00	5.240000e+00	7.710000e+00	2.803000e+01
LoggedActivitiesDistance	1	1.100000e-01	6.200000e-01	0	0.000000e+00	0.000000e+00	0.000000e+00	4.940000e+00
VeryActiveDistance	1	1.500000e+00	2.660000e+00	0	0.000000e+00	2.100000e-01	2.050000e+00	2.192000e+01
ModeratelyActiveDistance	1	5.700000e-01	8.800000e-01	0	0.000000e+00	2.400000e-01	8.000000e-01	6.480000e+00
LightActiveDistance	1	3.340000e+00	2.040000e+00	0	1.950000e+00	3.360000e+00	4.780000e+00	1.071000e+01
SedentaryActiveDistance	1	0.000000e+00	1.000000e-02	0	0.000000e+00	0.000000e+00	0.000000e+00	1.100000e-01
VeryActiveMinutes	1	2.116000e+01	3.284000e+01	0	0.000000e+00	4.000000e+00	3.200000e+01	2.100000e+02
FairlyActiveMinutes	1	1.356000e+01	1.999000e+01	0	0.000000e+00	6.000000e+00	1.900000e+01	1.430000e+02
LightlyActiveMinutes	1	1.928100e+02	1.091700e+02	0	1.270000e+02	1.990000e+02	2.640000e+02	5.180000e+02
SedentaryMinutes	1	9.912100e+02	3.012700e+02	0	7.297500e+02	1.057500e+03	1.229500e+03	1.440000e+03
Calories	1	2.303610e+03	7.181700e+02	0	1.828500e+03	2.134000e+03	2.793250e+03	4.900000e+03

#how many participant records exist in the datasets?

as.data.frame(table(dailyActivity_merged$Id)) %>% rename(Id =Var1 )

##            Id Freq
## 1  1503960366   31
## 2  1624580081   31
## 3  1644430081   30
## 4  1844505072   31
## 5  1927972279   31
## 6  2022484408   31
## 7  2026352035   31
## 8  2320127002   31
## 9  2347167796   18
## 10 2873212765   31
## 11 3372868164   20
## 12 3977333714   30
## 13 4020332650   31
## 14 4057192912    4
## 15 4319703577   31
## 16 4388161847   31
## 17 4445114986   31
## 18 4558609924   31
## 19 4702921684   31
## 20 5553957443   31
## 21 5577150313   30
## 22 6117666160   28
## 23 6290855005   29
## 24 6775888955   26
## 25 6962181067   31
## 26 7007744171   26
## 27 7086361926   31
## 28 8053475328   31
## 29 8253242879   19
## 30 8378563200   31
## 31 8583815059   31
## 32 8792009665   29
## 33 8877689391   31

#How many partcipants in the datasets

length(table(dailyActivity_merged$Id))

## [1] 33

#or 
n_distinct(dailyActivity_merged$Id)

## [1] 33

2.4.2 `dailyCalories_merged`

# first view of the data
head(dailyCalories_merged)

##           Id ActivityDay Calories
## 1 1503960366   4/12/2016     1985
## 2 1503960366   4/13/2016     1797
## 3 1503960366   4/14/2016     1776
## 4 1503960366   4/15/2016     1745
## 5 1503960366   4/16/2016     1863
## 6 1503960366   4/17/2016     1728

#structure of dataset
str(dailyCalories_merged)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

#skimming

skim_without_charts(dailyCalories_merged)

Data summary
Name	dailyCalories_merged
Number of rows	940
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityDay	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	0	1	4.855407e+09	2.424805e+09	1503960366	2320127002.0	4445114986	6.962181e+09	8877689391
Calories	0	1	2.303610e+03	7.181700e+02	0	1828.5	2134	2.793250e+03	4900

#how many participant records exist in the datasets?

as.data.frame(table(dailyCalories_merged$Id)) %>% rename(Id =Var1 )

##            Id Freq
## 1  1503960366   31
## 2  1624580081   31
## 3  1644430081   30
## 4  1844505072   31
## 5  1927972279   31
## 6  2022484408   31
## 7  2026352035   31
## 8  2320127002   31
## 9  2347167796   18
## 10 2873212765   31
## 11 3372868164   20
## 12 3977333714   30
## 13 4020332650   31
## 14 4057192912    4
## 15 4319703577   31
## 16 4388161847   31
## 17 4445114986   31
## 18 4558609924   31
## 19 4702921684   31
## 20 5553957443   31
## 21 5577150313   30
## 22 6117666160   28
## 23 6290855005   29
## 24 6775888955   26
## 25 6962181067   31
## 26 7007744171   26
## 27 7086361926   31
## 28 8053475328   31
## 29 8253242879   19
## 30 8378563200   31
## 31 8583815059   31
## 32 8792009665   29
## 33 8877689391   31

#How many partcipants in the datasets

length(table(dailyCalories_merged$Id))

## [1] 33

#or 
n_distinct(dailyCalories_merged$Id)

## [1] 33

2.4.3 `dailyIntensities_merged`

# first view of the data
head(dailyIntensities_merged)

##           Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366   4/12/2016              728                  328
## 2 1503960366   4/13/2016              776                  217
## 3 1503960366   4/14/2016             1218                  181
## 4 1503960366   4/15/2016              726                  209
## 5 1503960366   4/16/2016              773                  221
## 6 1503960366   4/17/2016              539                  164
##   FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1                  13                25                       0
## 2                  19                21                       0
## 3                  11                30                       0
## 4                  34                29                       0
## 5                  10                36                       0
## 6                  20                38                       0
##   LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1                6.06                     0.55               1.88
## 2                4.71                     0.69               1.57
## 3                3.91                     0.40               2.44
## 4                2.83                     1.26               2.14
## 5                5.04                     0.41               2.71
## 6                2.51                     0.78               3.19

#structure of dataset
str(dailyIntensities_merged)

## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay             : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...

#skimming

skim_without_charts(dailyIntensities_merged)

Data summary
Name	dailyIntensities_merged
Number of rows	940
Number of columns	10
_______________________
Column type frequency:
character	1
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityDay	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	4.855407e+09	2.424805e+09	1503960366	2.320127e+09	4.445115e+09	6.962181e+09	8.877689e+09
SedentaryMinutes	1	9.912100e+02	3.012700e+02	0	7.297500e+02	1.057500e+03	1.229500e+03	1.440000e+03
LightlyActiveMinutes	1	1.928100e+02	1.091700e+02	0	1.270000e+02	1.990000e+02	2.640000e+02	5.180000e+02
FairlyActiveMinutes	1	1.356000e+01	1.999000e+01	0	0.000000e+00	6.000000e+00	1.900000e+01	1.430000e+02
VeryActiveMinutes	1	2.116000e+01	3.284000e+01	0	0.000000e+00	4.000000e+00	3.200000e+01	2.100000e+02
SedentaryActiveDistance	1	0.000000e+00	1.000000e-02	0	0.000000e+00	0.000000e+00	0.000000e+00	1.100000e-01
LightActiveDistance	1	3.340000e+00	2.040000e+00	0	1.950000e+00	3.360000e+00	4.780000e+00	1.071000e+01
ModeratelyActiveDistance	1	5.700000e-01	8.800000e-01	0	0.000000e+00	2.400000e-01	8.000000e-01	6.480000e+00
VeryActiveDistance	1	1.500000e+00	2.660000e+00	0	0.000000e+00	2.100000e-01	2.050000e+00	2.192000e+01

#how many participant records exist in the datasets?

as.data.frame(table(dailyIntensities_merged$Id)) %>% rename(Id =Var1 )

##            Id Freq
## 1  1503960366   31
## 2  1624580081   31
## 3  1644430081   30
## 4  1844505072   31
## 5  1927972279   31
## 6  2022484408   31
## 7  2026352035   31
## 8  2320127002   31
## 9  2347167796   18
## 10 2873212765   31
## 11 3372868164   20
## 12 3977333714   30
## 13 4020332650   31
## 14 4057192912    4
## 15 4319703577   31
## 16 4388161847   31
## 17 4445114986   31
## 18 4558609924   31
## 19 4702921684   31
## 20 5553957443   31
## 21 5577150313   30
## 22 6117666160   28
## 23 6290855005   29
## 24 6775888955   26
## 25 6962181067   31
## 26 7007744171   26
## 27 7086361926   31
## 28 8053475328   31
## 29 8253242879   19
## 30 8378563200   31
## 31 8583815059   31
## 32 8792009665   29
## 33 8877689391   31

#How many partcipants in the datasets

length(table(dailyIntensities_merged$Id))

## [1] 33

#or 
n_distinct(dailyIntensities_merged$Id)

## [1] 33

2.4.4 `dailySteps_merged`

# first view of the data
head(dailySteps_merged)

##           Id ActivityDay StepTotal
## 1 1503960366   4/12/2016     13162
## 2 1503960366   4/13/2016     10735
## 3 1503960366   4/14/2016     10460
## 4 1503960366   4/15/2016      9762
## 5 1503960366   4/16/2016     12669
## 6 1503960366   4/17/2016      9705

#structure of dataset
str(dailySteps_merged)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...

#skimming

skim_without_charts(dailySteps_merged)

Data summary
Name	dailySteps_merged
Number of rows	940
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityDay	0	1	8	9	0	31	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	0	1	4.855407e+09	2.424805e+09	1503960366	2.320127e+09	4445114986.0	6962181067	8877689391
StepTotal	0	1	7.637910e+03	5.087150e+03	0	3.789750e+03	7405.5	10727	36019

#how many participant records exist in the datasets?

as.data.frame(table(dailySteps_merged$Id)) %>% rename(Id =Var1 )

##            Id Freq
## 1  1503960366   31
## 2  1624580081   31
## 3  1644430081   30
## 4  1844505072   31
## 5  1927972279   31
## 6  2022484408   31
## 7  2026352035   31
## 8  2320127002   31
## 9  2347167796   18
## 10 2873212765   31
## 11 3372868164   20
## 12 3977333714   30
## 13 4020332650   31
## 14 4057192912    4
## 15 4319703577   31
## 16 4388161847   31
## 17 4445114986   31
## 18 4558609924   31
## 19 4702921684   31
## 20 5553957443   31
## 21 5577150313   30
## 22 6117666160   28
## 23 6290855005   29
## 24 6775888955   26
## 25 6962181067   31
## 26 7007744171   26
## 27 7086361926   31
## 28 8053475328   31
## 29 8253242879   19
## 30 8378563200   31
## 31 8583815059   31
## 32 8792009665   29
## 33 8877689391   31

#How many partcipants in the datasets

length(table(dailySteps_merged$Id))

## [1] 33

#or 
n_distinct(dailySteps_merged$Id)

## [1] 33

2.4.5 `sleepDay_merged`

# first view of the data
head(sleepDay_merged)

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

#structure of dataset
str(sleepDay_merged)

## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

#skimming

skim_without_charts(sleepDay_merged)

Data summary
Name	sleepDay_merged
Number of rows	413
Number of columns	5
_______________________
Column type frequency:
character	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
SleepDay	0	1	20	21	0	31	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	1	5.000979e+09	2.06036e+09	1503960366	3977333714	4702921684	6962181067	8792009665
TotalSleepRecords	1	1.120000e+00	3.50000e-01	1	1	1	1	3
TotalMinutesAsleep	1	4.194700e+02	1.18340e+02	58	361	433	490	796
TotalTimeInBed	1	4.586400e+02	1.27100e+02	61	403	463	526	961

#how many participant records exist in the datasets?

as.data.frame(table(sleepDay_merged$Id)) %>% rename(Id =Var1 )

##            Id Freq
## 1  1503960366   25
## 2  1644430081    4
## 3  1844505072    3
## 4  1927972279    5
## 5  2026352035   28
## 6  2320127002    1
## 7  2347167796   15
## 8  3977333714   28
## 9  4020332650    8
## 10 4319703577   26
## 11 4388161847   24
## 12 4445114986   28
## 13 4558609924    5
## 14 4702921684   28
## 15 5553957443   31
## 16 5577150313   26
## 17 6117666160   18
## 18 6775888955    3
## 19 6962181067   31
## 20 7007744171    2
## 21 7086361926   24
## 22 8053475328    3
## 23 8378563200   32
## 24 8792009665   15

#How many partcipants in the datasets

length(table(sleepDay_merged$Id))

## [1] 24

#or 
n_distinct(sleepDay_merged$Id)

## [1] 24

2.4.6 `weightLogInfo_merged`

# first view of the data
head(weightLogInfo_merged)

##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12

#structure of dataset
str(weightLogInfo_merged)

## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

#skimming

skim_without_charts(weightLogInfo_merged)

Data summary
Name	weightLogInfo_merged
Number of rows	67
Number of columns	8
_______________________
Column type frequency:
character	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Date	0	1	19	21	0	56	0
IsManualReport	0	1	4	5	0	2	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Id	0	1.00	7.009282e+09	1.950322e+09	1.503960e+09	6.962181e+09	6.962181e+09	8.877689e+09	8.877689e+09
WeightKg	0	1.00	7.204000e+01	1.392000e+01	5.260000e+01	6.140000e+01	6.250000e+01	8.505000e+01	1.335000e+02
WeightPounds	0	1.00	1.588100e+02	3.070000e+01	1.159600e+02	1.353600e+02	1.377900e+02	1.875000e+02	2.943200e+02
Fat	65	0.03	2.350000e+01	2.120000e+00	2.200000e+01	2.275000e+01	2.350000e+01	2.425000e+01	2.500000e+01
BMI	0	1.00	2.519000e+01	3.070000e+00	2.145000e+01	2.396000e+01	2.439000e+01	2.556000e+01	4.754000e+01
LogId	0	1.00	1.461772e+12	7.829948e+08	1.460444e+12	1.461079e+12	1.461802e+12	1.462375e+12	1.463098e+12

#how many participant records exist in the datasets?

as.data.frame(table(weightLogInfo_merged$Id)) %>% rename(Id =Var1 )

##           Id Freq
## 1 1503960366    2
## 2 1927972279    1
## 3 2873212765    2
## 4 4319703577    2
## 5 4558609924    5
## 6 5577150313    1
## 7 6962181067   30
## 8 8877689391   24

#How many partcipants in the datasets

length(table(weightLogInfo_merged$Id))

## [1] 8

#or 
n_distinct(weightLogInfo_merged$Id)

## [1] 8

3.0 PHASE 3 : PROCESS

Noticed that he dailyActivity_merged datasets are just a composite of dailyCalories_merged dataset, dailyIntensities_merged dataset, and dailySteps_merged dataset.

Hence, dataset will used in this study would be dailyActivity_merged, sleepDay_merged and weightLogInfo_merged

3.1 Loading Packages

3.2 Checking for duplicte data

dailyActivity_merged %>% get_dupes()

## No variable names specified - using all columns.

## No duplicate combinations found of: Id, ActivityDate, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, ... and 6 other variables

##  [1] Id                       ActivityDate             TotalSteps              
##  [4] TotalDistance            TrackerDistance          LoggedActivitiesDistance
##  [7] VeryActiveDistance       ModeratelyActiveDistance LightActiveDistance     
## [10] SedentaryActiveDistance  VeryActiveMinutes        FairlyActiveMinutes     
## [13] LightlyActiveMinutes     SedentaryMinutes         Calories                
## [16] dupe_count              
## <0 rows> (or 0-length row.names)

sleepDay_merged %>% get_dupes()

## No variable names specified - using all columns.

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 4388161847  5/5/2016 12:00:00 AM                 1                471
## 2 4388161847  5/5/2016 12:00:00 AM                 1                471
## 3 4702921684  5/7/2016 12:00:00 AM                 1                520
## 4 4702921684  5/7/2016 12:00:00 AM                 1                520
## 5 8378563200 4/25/2016 12:00:00 AM                 1                388
## 6 8378563200 4/25/2016 12:00:00 AM                 1                388
##   TotalTimeInBed dupe_count
## 1            495          2
## 2            495          2
## 3            543          2
## 4            543          2
## 5            402          2
## 6            402          2

weightLogInfo_merged %>% get_dupes()

## No variable names specified - using all columns.

## No duplicate combinations found of: Id, Date, WeightKg, WeightPounds, Fat, BMI, IsManualReport, LogId

## [1] Id             Date           WeightKg       WeightPounds   Fat           
## [6] BMI            IsManualReport LogId          dupe_count    
## <0 rows> (or 0-length row.names)

I discovered three duplicate data in sleepDay_merged by using the getdupes() function in janitor package.

By running skim without charts from the skimr package,found that weightLogInfo_merged fat variable contains 65 NA values.

Hence, I decided to drop fat column in weightLogInfo_merged and duplicate data in sleepDay_merged

3.3 Remove duplicates and NA

sleepDay_merged <- sleepDay_merged %>% distinct()

weightLogInfo_merged <- subset(weightLogInfo_merged, select = -c(Fat))

3.4 Tranforming data

The time and date are recorded together in the same column of both the sleep and weight databases, which I found interesting. If I do decide to use dates to analyse the data across the three datasets, it will be most helpful to divide them into Date and Time columns.

weightLogInfo_merged <- weightLogInfo_merged %>% separate(Date, into=c("Date", "Time"), sep=" ")

## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

sleepDay_merged <- sleepDay_merged %>% separate(SleepDay, into=c("Date", "Time"), sep=" ")

## Warning: Expected 2 pieces. Additional pieces discarded in 410 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

4.0 PHASE 4 :Analyze

4.1 Loading Packages

4.2 Quick summary statistics

For the dailyActivity_merged

dailyActivity_merged %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes,
         Calories) %>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900

For the sleepDay_merged

sleepDay_merged %>%  
  select(TotalSleepRecords,
  TotalMinutesAsleep,
  TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

For the weightLogInfo_merged

weightLogInfo_merged %>% 
  select(WeightKg,BMI ) %>% 
  summary()

##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

4.3 Data Exploration

4.3.1 `dailyActivity_merged`

ggplot(data=dailyActivity_merged, aes(x=TotalSteps, y=TotalDistance)) + geom_point() +geom_smooth(method=lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(data=dailyActivity_merged, aes(x=TotalSteps, y=SedentaryMinutes)) + geom_point() + geom_smooth(method=lm)

## `geom_smooth()` using formula 'y ~ x'

ggplot(data=dailyActivity_merged, aes(x=Calories, y=TotalDistance)) + geom_point() + geom_smooth(method=lm)

## `geom_smooth()` using formula 'y ~ x'

Trends for dailyActivity_merged

The correlation coefficient between TotalSteps and TotalDistance is extremely high. This means that the more steps the user walks, the farther they travel.
The scatter plot between TotalSteps and SedentaryMinutes allows for the estimation of the following conclusion: the longer the SedentaryMinutes, the more likely the totalsteps have decreased. This means the longer Sedentary time of user, the more likely the users walk less.
In other words, the longer a user spends on Sedentary activities, the less probable it is that they would engage in physical activity.
The scatter plot between Calories and TotalDistance allows for the estimation of the following conclusion: the higher the TotalDistance, the more likely the Calories burn have increase. The greater the TotalDistance, the greater the likelihood that the Calories burned will increase.
In other words, the longer a user walks, the greater the chance that they will have a caloric deficit.

4.3.2 `sleepDay_merged`

heatmap

# corelation matrix
cormat <- sleepDay_merged %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  cor() %>% round(2)


#reshape the data for heatmap
melted_cormat <- melt(cormat)

#data visualization
ggplot(data=melted_cormat,aes(x = Var1, y = Var2, fill = value))+
  geom_tile()+scale_fill_gradient(high = "green", low = "red")+
  ggtitle("sleepDay_merged dataset")+ 
  theme(plot.title = element_text(size = 20, face = "bold"))+
  geom_text(aes(label = round(value, 3)))+
  scale_fill_continuous(low = "red", high = "lightgreen")

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

Trends for dailyActivity_merged

The correlation coefficient between TotalTimeInBed and TotalMinutesAsleep is very high and positive based on the heat map.
other words, when sleep duration increases, so does total bedtime.

4.3.3 `weightLogInfo_merged`

ggplot(data=weightLogInfo_merged, aes(x=WeightKg, y=BMI)) + geom_point() + geom_smooth(method=lm)

## `geom_smooth()` using formula 'y ~ x'

Trends for dailyActivity_merged

Nothing interesting found.
BMI is the result measure that links body weight to height.

4.4 Data merging

I decide to merge the data between dailyActivity_merged and sleepDay_merged.

The weightLogInfo_merged will not be combined because there are only 8 users in this dataset, while the total number of users in all other datasets is 33.

In contrast to other datasets, which have 33 users contributing information, the weightLogInfo_merged datasets only have 8 users contributing data, which makes the information irrelevant and unsuitable for usage.

Hence, the weightLogInfo_merged

#put all data frames into list
data_list <- list(dailyActivity_merged, sleepDay_merged)

#merge all data frames in list
combined_data<- data_list %>% reduce(full_join, by='Id')

ncol(combined_data)

## [1] 20

n_distinct(combined_data$Id)

## [1] 33

5.0 PRASE 5 : SHARE

5.1 Tableau Dashboard

write.csv(combined_data,"C:\\Users\\soonk\\Documents\\GitHub\\DataAnalytics_Capstone_case_study2\\datasets\\combined_data.csv", row.names = FALSE)

Tableau will be used to finish part of this PHASE. Refer to the Tableau file (share.twbx).

I created a dashbaord to make it easier to spot trends and obtain insights from the data.

5.2 Minutes Asleep vs. Sedentary Minutes

ggplot(data=combined_data, aes(x=TotalMinutesAsleep, y=SedentaryMinutes)) + 
geom_point(color='darkblue') + geom_smooth() +
  labs(title="Minutes Asleep vs. Sedentary Minutes")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 227 rows containing non-finite values (stat_smooth).

## Warning: Removed 227 rows containing missing values (geom_point).

Interpretation

Nothing special trends or findings found

5.3 Total Minutes Asleep vs. Total Time in Bed

ggplot(data=combined_data, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + 
  geom_point()+ labs(title="Total Minutes Asleep vs. Total Time in Bed")+geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 227 rows containing non-finite values (stat_smooth).

## Warning: Removed 227 rows containing missing values (geom_point).

Interpretation

Total time in bed seems positively related to total time asleep. So, to help Bellabeat users get better sleep and increase overall customer experiance, the Bellabeat should think about implementing a notification system that prompts them to lie down and relax.

5.3 Total Steps vs. Calories

ggplot(data=combined_data, aes(x=TotalSteps, y=Calories)) + 
  geom_point() + geom_smooth() + labs(title="Total Steps vs. Calories")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Interpretation

The scatter plot between Calories and TotalStep allows for the estimation of the following conclusion: the higher the Total Step, the more likely the Calories burn have increase.
In other words, the longer a user walks, the greater the chance that they will have a caloric deficit.
Consequently, a reward system should be considered to motivate users to walk and exercise more. Consider a comprehensive reward system with targeted tasks. For example, if your total steps exceed 5,000, you will receive 500 points. The points earned can be redeemed for rewards such as Bellabeat merchandise or discounts on other Bellabeat products, allowing the company to expand and promote the product line while increasing revenue and attracting potential customers.

5.4

very_active_min <- sum(combined_data$VeryActiveMinutes)
fairly_active_min <- sum(combined_data$FairlyActiveMinutes)
lightly_activemin <- sum(combined_data$LightlyActiveMinutes)
sedentary_min <- sum(combined_data$SedentaryMinutes)
total_min <- very_active_min + fairly_active_min + lightly_activemin + sedentary_min


min_list <- c(very_active_min,fairly_active_min,lightly_activemin,sedentary_min)

lbls <- c("VeryActive","FairlyActive","LightlyActive","Sedentary")


pct <- round(min_list/total_min*100)

lbls <- paste(lbls, pct)
lbls <- paste(lbls, "%", sep="")

pie(min_list, labels = lbls, col = c("purple", "violetred1", "green3","cyan"), main = "Percentage of Activity in Minutes")

Interpretation

-Average user spends 79% of a time for Sedentary,Very active and fairly active only make up 2% of the entire time. Lightly active make up of 19% of the total time.

_ This is not recommended and not an ideal result for fitness tracking data.

5.0 PRASE 6 : ACT

5.1 Revisiting Business Task

Analyze the non-Bellabeat smart device usage data in order to gain insight to help Bellabeat marketing strategy team to understand the trends.
Utilizing the trends and insights provided by smart device usage data in order to improve the features, functionality, and overall service quality of the Bellabeat app

5.2 Trends Identified

The longer a user spends on Sedentary activities, the less probable it is that they would engage in physical activity.
The longer a user walks, the greater the chance that they will have a caloric deficit.
Total time in bed seems positively related to total time asleep.
Average user spends 79% of a time for Sedentary,Very active and fairly active only make up 2% of the entire time. Lightly active make up of 19% of the total time.
The participants averaged 25.19 BMI, which is overweight.
On average, participants slept less than 8 hours.

5.3 Recommendations

5.3.1 Complete Reward System

A system of rewards should be devised to recognise people who have done well.in an effort to entice a client
Reward should be given to those who attain various levels depending on the number of daily steps taken.
In order to advance to the next level, the user must maintain a certain level of activity for a certain amount of time (a month or a week). For each level, user would get a certain number of points redeemable for Bellabeat items or discounts on other Bellabeat products.
This strategy will indirectly promote and raise sales of the other Bellabeat product line while also enhancing its reputation in the marketplace.

5.3.3 In-App competition and rankings

Bellabeat could enable in-app tournaments against friends or users in the same city/state to encourage physical activities engagement.
Make an animated and creative total steps ranking page for users in the same city or state to get them to be more active and spend less time sitting (sedentary time).

5.3.4 Notification and Sleep time

Participants slept less than 8 hours on average, therefore Notification and Sleep Time should be a major issue.
Total time in bed is positively correlated with total time asleep, according to our results. They could set a bedtime and receive a reminder minutes before. Breathing advice, podcasts with relaxing music, and sleep techniques can help customers sleep.
Bellabeat should therefore create a system that includes sleep guidance, heart rate monitoring while sleeping, and a sleep reminder in order to improve the user’s quality of sleep.

5.4 Future Works

Larger, more representative sample size (with 95% confidence and 5% margin of error).
Random with no prejudice in selection.
At least 1 year should be spent collecting data.
More about the person’s age, sex, height, etc.
More recent and current information or anything from the previous year is suggested.
Have an original (First Party Data) data source, or verify primary/secondary data for integrity and trustworthiness.

Google Data Analytics Capstone : How Can a Wellness Technology Company Play It Smart?

soon kien yuan

2022-08-18

Scenario

Introduction

About the company

Products

Data Analysis Process

1.0 PHASE 1 :ASK

1.1 Identify the business task

1.2 Consider key stakeholders

2.0 PHASE 2 : PREPARE

2.1 Dataset overview

2.2 Data credibility

2.3 How Data organized

2.3.1 Loading Packages

2.3.2 Importing Data Sets

2.4 Dataset structure

2.4.1 dailyActivity_merged

2.4.2 dailyCalories_merged

2.4.3 dailyIntensities_merged

2.4.4 dailySteps_merged

2.4.5 sleepDay_merged

2.4.6 weightLogInfo_merged

3.0 PHASE 3 : PROCESS

3.1 Loading Packages

3.2 Checking for duplicte data

3.3 Remove duplicates and NA

3.4 Tranforming data

4.0 PHASE 4 :Analyze

4.1 Loading Packages

4.2 Quick summary statistics

4.3 Data Exploration

4.3.1 dailyActivity_merged

4.3.2 sleepDay_merged

4.3.3 weightLogInfo_merged

4.4 Data merging

5.0 PRASE 5 : SHARE

5.1 Tableau Dashboard

5.2 Minutes Asleep vs. Sedentary Minutes

5.3 Total Minutes Asleep vs. Total Time in Bed

5.3 Total Steps vs. Calories

5.4

5.0 PRASE 6 : ACT

5.1 Revisiting Business Task

5.2 Trends Identified

5.3 Recommendations

5.3.1 Complete Reward System

5.3.2 Social Media Contest

5.3.3 In-App competition and rankings

5.3.4 Notification and Sleep time

5.4 Future Works

2.4.1 `dailyActivity_merged`

2.4.2 `dailyCalories_merged`

2.4.3 `dailyIntensities_merged`

2.4.4 `dailySteps_merged`

2.4.5 `sleepDay_merged`

2.4.6 `weightLogInfo_merged`

4.3.1 `dailyActivity_merged`

4.3.2 `sleepDay_merged`

4.3.3 `weightLogInfo_merged`