Welcome to my Google Data Analytics Professional Capstone Case Study. In this case study, I will act as a Data Analyst for Bellabeat and use Data Analytics to answer some business questions posed from an executive stakeholder. I will use the following steps to complete this Data Analysis:
What metrics to measure
Locate data in your database
Create security measures to protect that data
Using spreadsheet functions to find incorrectly entered data
Using SQL functions to check for extra spaces
Removing repeated entries
Checking as much as possible for bias in the data
Perform calculations
Combine data from multiple sources
Create tables with your results
Make better decisions
Make more informed decisions
Lead to stronger outcomes
Successfully communicate your findings
Bellabeat is a wellness technology company that manufactures health-focused products that are marketed towards women.
What are the problems I am trying to solve?
Urska Srsen, the CEO of Bellabeat, asks me to focus on a non-Bellabeat smart device and analyze that device’s data to see how consumers are using their device. Urska’s hope is that by analyzing consumer data we could reveal opportunities for growth by creating new marketing strategies.
I will import the dataset in the “Prepare Phase.” I will, first, make sure the data is credible. Then, I will clean and organize the data.
Srsen encourages me to use a specific dataset:
FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius)
This Kaggle dataset contains personal fitness tracker data from 30 FitBit users. These eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate and sleep monitoring. It includes information about daily activity, steps and heart rate that can be used to explore user habits.
About: FitBit Fitness Tracker Data was generated by FitBit users via a survey by Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. It includes 18 CSV files.
I’m installing packages into R that will assist me in my data cleaning and analysis.
install.packages("tidyverse")
##
## The downloaded binary packages are in
## /var/folders/7n/_w6s7zkx0573ltvgsfgz_qh00000gn/T//RtmpGsycUJ/downloaded_packages
install.packages("lubridate")
##
## The downloaded binary packages are in
## /var/folders/7n/_w6s7zkx0573ltvgsfgz_qh00000gn/T//RtmpGsycUJ/downloaded_packages
install.packages("dplyr")
##
## The downloaded binary packages are in
## /var/folders/7n/_w6s7zkx0573ltvgsfgz_qh00000gn/T//RtmpGsycUJ/downloaded_packages
install.packages("ggplot2")
##
## The downloaded binary packages are in
## /var/folders/7n/_w6s7zkx0573ltvgsfgz_qh00000gn/T//RtmpGsycUJ/downloaded_packages
install.packages("tidyr")
##
## The downloaded binary packages are in
## /var/folders/7n/_w6s7zkx0573ltvgsfgz_qh00000gn/T//RtmpGsycUJ/downloaded_packages
install.packages("here")
##
## The downloaded binary packages are in
## /var/folders/7n/_w6s7zkx0573ltvgsfgz_qh00000gn/T//RtmpGsycUJ/downloaded_packages
install.packages("skimr")
##
## The downloaded binary packages are in
## /var/folders/7n/_w6s7zkx0573ltvgsfgz_qh00000gn/T//RtmpGsycUJ/downloaded_packages
install.packages("janitor")
##
## The downloaded binary packages are in
## /var/folders/7n/_w6s7zkx0573ltvgsfgz_qh00000gn/T//RtmpGsycUJ/downloaded_packages
Now, I am loading these packages I installed.
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
library(here)
library(skimr)
library(janitor)
I am now going to import the dataset. After looking through the data, I decided to use only download the CSV files that would be useful in my analysis.
Activity <- read.csv("Desktop/Fitabase Data 4.12.16-5.12.16/Files_used/dailyActivity_merged.csv")
head(Activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
colnames(Activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
str(Activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
Calories <- read.csv("Desktop/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
head(Calories)
## Id ActivityDay Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
colnames(Calories)
## [1] "Id" "ActivityDay" "Calories"
str(Calories)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
Intensities <- read.csv("Desktop/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
head(Intensities)
## Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366 4/12/2016 728 328
## 2 1503960366 4/13/2016 776 217
## 3 1503960366 4/14/2016 1218 181
## 4 1503960366 4/15/2016 726 209
## 5 1503960366 4/16/2016 773 221
## 6 1503960366 4/17/2016 539 164
## FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
colnames(Intensities)
## [1] "Id" "ActivityDay"
## [3] "SedentaryMinutes" "LightlyActiveMinutes"
## [5] "FairlyActiveMinutes" "VeryActiveMinutes"
## [7] "SedentaryActiveDistance" "LightActiveDistance"
## [9] "ModeratelyActiveDistance" "VeryActiveDistance"
str(Intensities)
## 'data.frame': 940 obs. of 10 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
Heartrate <- read.csv("Desktop/Fitabase Data 4.12.16-5.12.16/Files_used/heartrate_seconds_merged.csv")
head(Heartrate)
## Id Time Value
## 1 2022484408 4/12/2016 7:21:00 AM 97
## 2 2022484408 4/12/2016 7:21:05 AM 102
## 3 2022484408 4/12/2016 7:21:10 AM 105
## 4 2022484408 4/12/2016 7:21:20 AM 103
## 5 2022484408 4/12/2016 7:21:25 AM 101
## 6 2022484408 4/12/2016 7:22:05 AM 95
colnames(Heartrate)
## [1] "Id" "Time" "Value"
str(Heartrate)
## 'data.frame': 2483658 obs. of 3 variables:
## $ Id : num 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ Time : chr "4/12/2016 7:21:00 AM" "4/12/2016 7:21:05 AM" "4/12/2016 7:21:10 AM" "4/12/2016 7:21:20 AM" ...
## $ Value: int 97 102 105 103 101 95 91 93 94 93 ...
Sleep <- read.csv("Desktop/Fitabase Data 4.12.16-5.12.16/Files_used/sleepDay_merged.csv")
head(Sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
colnames(Sleep)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
str(Sleep)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
Weight <- read.csv("Desktop/Fitabase Data 4.12.16-5.12.16/Files_used/weightLogInfo_merged.csv")
head(Weight)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
colnames(Weight)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
str(Weight)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
We can now see that the CSV files were correctly imported into R.
I realized that there were some consistency issues with the times and dates of some of the datasets. I formatted them to make the date and time consistent throughout.
After that, I reviewed and cleaned the dataset using functions like glimpse(), skim_without_charts, distinct() and clean_names().
The steps I took when cleaning the dataset: * Activity, Calories and Intensities data had no issues with duplicates, spelling, missing values or blank spaces. * Converted the date and time to be consistent throughout all data. * There were 3 duplicates found in Sleep dataset–removed duplicates. * Removed ‘Fat’ column in Weight dataset that had too many missing values.
As I stated above, there were some issues with the date and time of some of the data, so I changed the date/time in the Activity, Intensities and Sleep tables, so they are now formatted correctly.
#Activity
Activity$ActivityDate=as.POSIXct(Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
Activity$date <- format(Activity$ActivityDate, format = "%m/%d/%y")
Activity$ActivityDate=as.Date(Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
Activity$date=as.Date(Activity$date, format="%m/%d/%Y")
#Intensities
Intensities$ActivityDay=as.Date(Intensities$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())
#Sleep
Sleep$SleepDay=as.POSIXct(Sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Sleep$date <- format(Sleep$SleepDay, format = "%m/%d/%y")
Sleep$date=as.Date(Sleep$date, "% m/% d/% y")
Now, that the dates and times are formatted correctly and consistently, I can start analyzing the datasets.
All the data is cleaned, so now I can start analyzing the data.
How many participants contributed to each dataset?
#Activity
Activity %>%
summarise(Activity_participants = n_distinct(Activity$Id))
## Activity_participants
## 1 33
n_distinct(Calories$Id)
## [1] 33
n_distinct(Intensities$Id)
## [1] 33
n_distinct(Heartrate$Id)
## [1] 14
n_distinct(Sleep$Id)
## [1] 24
n_distinct(Weight$Id)
## [1] 8
From analyzing this data, I found that there are:
From this, I determined that there are not enough participants in the Heartrate and Weight dataset to receive any useful, unbiased information from them.
I will now only focus on the Activity, Calories, Intensities and Sleep datasets.
Statistics from each data frame:
# Activity
Activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes, Calories) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes Calories
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :1057.5 Median :2134
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :4900
# Intensities
Intensities %>%
select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>%
summary()
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
# Calories
Calories %>%
select(Calories) %>%
summary()
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
# Sleep
Sleep %>%
select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
I need to merge some datasets before beginning to visualize the data. I will merge Sleep and Activity on the Id column. There are more participants in the Activity dataset, then the Sleep dataset, so I will use ‘outer_join’.
Combined_data_outer <- merge(Sleep, Activity, by="Id", all = TRUE)
n_distinct(Combined_data_outer$Id)
## [1] 33