Google Capstone Study 2

Introduction

Over the duration of the Google Data Analytics (DA) Certificate program, I have exhibited what it means to be a data analyst and to go through the Data Analytics Life Cycle (DALC). To be a data analyst means you are responsible (but not limited) for the role of collecting, storing, and organizing data to make data-driven decisions for a given company. The following steps are taken and implemented for the DALC:

  1. Ask - define a clear summary of the business objective.
  2. Prepare - provide a description of all data sources used.
  3. Process - document any data cleaning or manipulation tactics.
  4. Analyze - provide a summary of analysis.
  5. Share - provide supporting visualizations and key findings.
  6. Act - proved top high-level recommendations or next steps based on analysis.

As I went through each of these phases of the DALC, I had learned a series of technical skills such as Excel, Google Sheets, SQL, Tableau, PowerPoint, Google Slides, and R Studio. While gaining knowledge of these technical skills we had also developed communication, critical thinking, problem-solving, research, and analytical skills. While this capstone project will not reflect all of these skills, it will showcase my ability to use new skills as I work towards earning my Google DA Certificate.

Ask

1.1 Business Objective

Bellabeat has determined that they want to gather insights into how people are already using their smart devices. The insights gathered will help drive business decisions that will evolve their marketing strategy. This strategy will allow Bellabeat to empower women with knowledge of their own personal activities, sleep, stress, and reproductive health through those smart devices.

1.2 Business Goals

1.3 Key Stakeholders

Prepare & Process

2.1 Useful Data Tools

During this capstone project, there will be helpful tools to prepare, process, analyze, and share the data within each dataset. Starting with excel, then moving into R Studio, there will be insights gathered for analysis. Other resources such as search engines, supplemental guides, and templates will be considered for completion of this capstone project.

2.2 Data Collection, Storage, Organization, & Use

For the purpose of this capstone project, 2nd party data was collected from FitBit Fitness Tracker Data by Möbius from the Kaggle Community. This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Bellabeat does not disclose personal information such as names, addresses, etc. Data will be accessible and transparent to those interested in the analysis of the dataset. Furthermore, for this capstone project, the dataset has been stored into an R Studio Cloud project and on a dekstop. Before any data exploration could begin, I have identified the contents of 18 different CSV files in an assortment of long and wide data formats. The data seems to be categorized into three types: seconds, minutes, and daily accounts for the data.

2.3 Loading Packages

The following packages will be installed/loaded because they are most commonly used for data exploration, cleaning, manipulation, analysis, and visualizations. Each package has its own predetermined set of functions. Note: not all data packages may be used in this analysis.

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("janitor")
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library("lubridate")
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library("dplyr")
library("ggplot2")

2.4 Importing Datasets & Data Exploration

The following datasets have been chosen to be imported after an initial analysis of the data. This initial analysis included identifying field names of each dataset or CSV file. By determining the field names, it gives us a some insight into how to develop our analysis. The data is relevant to the businesses needs, all data has its own attributes assigned to each field, and each field has proper naming conventions across all datasets. This allows the data to be accurate, complete, and consistent, verifying data integrity to be true.

activity <- read.csv("/cloud/project/Fitabase_Data_4.12.16_5.12.16/dailyActivity_merged.csv")
sleep <- read.csv("/cloud/project/Fitabase_Data_4.12.16_5.12.16/sleepDay_merged.csv")
weight <- read.csv("/cloud/project/Fitabase_Data_4.12.16_5.12.16/weightLogInfo_merged.csv")

2.5 Data Cleansing

Data integrity is valid based on exploring some of the tables and their datasets. Each field’s heading is consistent with proper naming conventions, each attribute is assigned to its field, and data is consistent throughout each dataset, making the datasets available for analysis, as seen down below.

head(activity)
##           Id ActivityDay TotalSteps TotalDistance TrackerDistance
## 1 1503960366   4/12/2016      13162          8.50            8.50
## 2 1503960366   4/13/2016      10735          6.97            6.97
## 3 1503960366   4/14/2016      10460          6.74            6.74
## 4 1503960366   4/15/2016       9762          6.28            6.28
## 5 1503960366   4/16/2016      12669          8.16            8.16
## 6 1503960366   4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
colnames(activity)
##  [1] "Id"                       "ActivityDay"             
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
head(sleep)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
colnames(sleep)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(weight)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12
colnames(weight)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Analyze

3.1 A Few Summary Statisitics

The following data will result in knowing how many participants there were per dataset, as well as how many records per each dataset.

Number of participants per dataset:

n_distinct(activity$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8

Number of records per dataset:

nrow(activity)
## [1] 940
nrow(sleep)
## [1] 413
nrow(weight)
## [1] 67

Below are summary statistics on all activity for participants:

summary(activity)
##        Id            ActivityDay          TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Length:940         Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Mode  :character   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09                      Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09                      3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09                      Max.   :36019   Max.   :28.030  
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900

Below are summary statistics on all sleep activity for participants:

summary(sleep)
##        Id              SleepDay         TotalSleepRecords TotalMinutesAsleep
##  Min.   :1.504e+09   Length:413         Min.   :1.000     Min.   : 58.0     
##  1st Qu.:3.977e+09   Class :character   1st Qu.:1.000     1st Qu.:361.0     
##  Median :4.703e+09   Mode  :character   Median :1.000     Median :433.0     
##  Mean   :5.001e+09                      Mean   :1.119     Mean   :419.5     
##  3rd Qu.:6.962e+09                      3rd Qu.:1.000     3rd Qu.:490.0     
##  Max.   :8.792e+09                      Max.   :3.000     Max.   :796.0     
##  TotalTimeInBed 
##  Min.   : 61.0  
##  1st Qu.:403.0  
##  Median :463.0  
##  Mean   :458.6  
##  3rd Qu.:526.0  
##  Max.   :961.0

4.1 Visualizations

The following pie chart represents the proportions of each distinct value in relation as a whole. The pie chart is consistent with finding the distinct values of each number of participants per dataset.

Act

5.1 Limitations & Conclusions

To further analyze daily activity, consulting another analyst on time stamp data is necessary. The data from the time stamp can be used to determine when the participant is tracking their data. Also, by tracking time stamps, determining when the participant is sleeping and for how long will provide further insights into sleep health. By analyzing both time stamp analyses can give us further insights into sleep data because we know when the participant is active or asleep. This knowledge is important because sleep data needs to be more accurately and consistent with daily activity. This could be done by working on new technology to track battery life for the participants devices. By tracking battery life, this can give us insights into when participants are tracking their data to determine if removal of the smart devices before bed is a key factor in sleep records being collected. Lastly, Bellabeat should focus on more new technologies that can better track weight data. This could be as far as a smart scale linked to any smart devices that the participants could have. The seamless data tracking could potentially make participants more apt to track their weight data.

5.2 Marketing Strategies & Next Steps

New marketing strategies should be recommended to Bellabeat’s co founders from the marketing analytics team. First, finding an analyst to provide further insights into the current data. Next, explaining the idea of marketing a new longer battery life to participant’s smart devices should increase data collection. The analytics team can look into introducing a new product line of a weight scale that should help in the collection of weight data. And to wrap things up, further data could be tracked by prompting the user to share feedback when using the application.

I Hope You Enjoyed My Analysis!

Kevin Ketchum