Over the duration of the Google Data Analytics (DA) Certificate program, I have exhibited what it means to be a data analyst and to go through the Data Analytics Life Cycle (DALC). To be a data analyst means you are responsible (but not limited) for the role of collecting, storing, and organizing data to make data-driven decisions for a given company. The following steps are taken and implemented for the DALC:
As I went through each of these phases of the DALC, I had learned a series of technical skills such as Excel, Google Sheets, SQL, Tableau, PowerPoint, Google Slides, and R Studio. While gaining knowledge of these technical skills we had also developed communication, critical thinking, problem-solving, research, and analytical skills. While this capstone project will not reflect all of these skills, it will showcase my ability to use new skills as I work towards earning my Google DA Certificate.
Ask
Bellabeat has determined that they want to gather insights into how people are already using their smart devices. The insights gathered will help drive business decisions that will evolve their marketing strategy. This strategy will allow Bellabeat to empower women with knowledge of their own personal activities, sleep, stress, and reproductive health through those smart devices.
Prepare & Process
During this capstone project, there will be helpful tools to prepare, process, analyze, and share the data within each dataset. Starting with excel, then moving into R Studio, there will be insights gathered for analysis. Other resources such as search engines, supplemental guides, and templates will be considered for completion of this capstone project.
For the purpose of this capstone project, 2nd party data was collected from FitBit Fitness Tracker Data by Möbius from the Kaggle Community. This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Bellabeat does not disclose personal information such as names, addresses, etc. Data will be accessible and transparent to those interested in the analysis of the dataset. Furthermore, for this capstone project, the dataset has been stored into an R Studio Cloud project and on a dekstop. Before any data exploration could begin, I have identified the contents of 18 different CSV files in an assortment of long and wide data formats. The data seems to be categorized into three types: seconds, minutes, and daily accounts for the data.
The following packages will be installed/loaded because they are most commonly used for data exploration, cleaning, manipulation, analysis, and visualizations. Each package has its own predetermined set of functions. Note: not all data packages may be used in this analysis.
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("janitor")
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("lubridate")
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library("dplyr")
library("ggplot2")
The following datasets have been chosen to be imported after an initial analysis of the data. This initial analysis included identifying field names of each dataset or CSV file. By determining the field names, it gives us a some insight into how to develop our analysis. The data is relevant to the businesses needs, all data has its own attributes assigned to each field, and each field has proper naming conventions across all datasets. This allows the data to be accurate, complete, and consistent, verifying data integrity to be true.
activity <- read.csv("/cloud/project/Fitabase_Data_4.12.16_5.12.16/dailyActivity_merged.csv")
sleep <- read.csv("/cloud/project/Fitabase_Data_4.12.16_5.12.16/sleepDay_merged.csv")
weight <- read.csv("/cloud/project/Fitabase_Data_4.12.16_5.12.16/weightLogInfo_merged.csv")
Data integrity is valid based on exploring some of the tables and their datasets. Each field’s heading is consistent with proper naming conventions, each attribute is assigned to its field, and data is consistent throughout each dataset, making the datasets available for analysis, as seen down below.
head(activity)
## Id ActivityDay TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
colnames(activity)
## [1] "Id" "ActivityDay"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
head(sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
colnames(sleep)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(weight)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
colnames(weight)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
Analyze
The following data will result in knowing how many participants there were per dataset, as well as how many records per each dataset.
Number of participants per dataset:
n_distinct(activity$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
Number of records per dataset:
nrow(activity)
## [1] 940
nrow(sleep)
## [1] 413
nrow(weight)
## [1] 67
Below are summary statistics on all activity for participants:
summary(activity)
## Id ActivityDay TotalSteps TotalDistance
## Min. :1.504e+09 Length:940 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Mode :character Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :36019 Max. :28.030
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
Below are summary statistics on all sleep activity for participants:
summary(sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## Min. :1.504e+09 Length:413 Min. :1.000 Min. : 58.0
## 1st Qu.:3.977e+09 Class :character 1st Qu.:1.000 1st Qu.:361.0
## Median :4.703e+09 Mode :character Median :1.000 Median :433.0
## Mean :5.001e+09 Mean :1.119 Mean :419.5
## 3rd Qu.:6.962e+09 3rd Qu.:1.000 3rd Qu.:490.0
## Max. :8.792e+09 Max. :3.000 Max. :796.0
## TotalTimeInBed
## Min. : 61.0
## 1st Qu.:403.0
## Median :463.0
## Mean :458.6
## 3rd Qu.:526.0
## Max. :961.0
The previous data gives further insight into how the data can help us make informed decisions. By checking how many distinct participants are in each dataset (and number of records) shows there is not consistency between each dataset. This means that participants were most likely to track activity most, then sleep, and least likely weight.
Participants were able to track their daily activities the most. This could be due to the nature of how the data has been collected, or how willing they were to participate. Participants were most likely to use their smart devices to track data during the day. We can know for certain because steps, active minutes, and distance has been recorded with time stamps.
Participants were able to track their sleep data, but not for as nearly as many records as there were for daily activity. By examining daily activity to daily sleep, through time stamps, determination towards when participants were sleeping and how long can be calculated. Lastly, one could deduct that participants may not have a full charge on their smart devices when going to bed which could lead to an understanding that they must take off their device at night. In return, having less sleep results collected.
Lastly, participants were able to track their weight data. It appears that weight data is less attractive to the participants. There are only 8 records for sleep versus the 24 for sleep and 33 for activity. The factor in weight data could be that the technology to track weight is not as user-friendly or seamlessly than tracking daily activity.
Share
The following pie chart represents the proportions of each distinct value in relation as a whole. The pie chart is consistent with finding the distinct values of each number of participants per dataset.
Act
To further analyze daily activity, consulting another analyst on time stamp data is necessary. The data from the time stamp can be used to determine when the participant is tracking their data. Also, by tracking time stamps, determining when the participant is sleeping and for how long will provide further insights into sleep health. By analyzing both time stamp analyses can give us further insights into sleep data because we know when the participant is active or asleep. This knowledge is important because sleep data needs to be more accurately and consistent with daily activity. This could be done by working on new technology to track battery life for the participants devices. By tracking battery life, this can give us insights into when participants are tracking their data to determine if removal of the smart devices before bed is a key factor in sleep records being collected. Lastly, Bellabeat should focus on more new technologies that can better track weight data. This could be as far as a smart scale linked to any smart devices that the participants could have. The seamless data tracking could potentially make participants more apt to track their weight data.
New marketing strategies should be recommended to Bellabeat’s co founders from the marketing analytics team. First, finding an analyst to provide further insights into the current data. Next, explaining the idea of marketing a new longer battery life to participant’s smart devices should increase data collection. The analytics team can look into introducing a new product line of a weight scale that should help in the collection of weight data. And to wrap things up, further data could be tracked by prompting the user to share feedback when using the application.