This is a capstone project for Google Data Analytics Professional Certificate Case Study 2: How Can a Wellness Technology Company Play It Smart?
To explore and detect patterns in smart device usage to gain insight to comprehend consumer behavior with non-Bellabeat devices. Leverage the identified trends in smart device usage and apply them to Bellabeat customers.
To leverage the insights gained, and strategically incorporate them into Bellabeat’s marketing approach to unlock company growth.
Datasets were provided by Möbius
License: CC0: Public Domain
Privacy: Datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
#Data Credibility Check Using ROCCC Method
Reliability The provided Kaggle dataset includes personal fitness tracker data from a limited group of 30 eligible Fitbit users. Due to the small sample size, it may not accurately represent the entire population, leading to potential bias. However, by expanding the sample size and incorporating additional data, it would be possible to mitigate the limitations associated with a small data size. Additionally, further investigation and exploration are necessary to understand the criteria that define users as “eligible” for submitting personal tracker data, as mentioned in the dataset’s content section.
Original The datasets are provided from the public domain by Möbius, instead of the original distributor, Amazon Mechanical Turk. These data sets are third-party.
Comprehensive Data sets are not comprehensive. They are missing important information on age, gender, and what type of device was used.
Current These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. It’s 7 year old dataset so a more updated dataset would be better.
The datasets included in this study are sourced from crowd-sourcing data collected through a distributed survey conducted on Amazon Mechanical Turk. As a result, the data source is appropriately acknowledged and cited.
I chose three datasets to help me:
dailyActivity_merged.csv file (summary of steps and calories burned) sleepDay_merged.csv file (participants’ sleep patterns.) weightLogInfo_merged file (contains weight data)
These files serve as excellent resources for analyzing participant usage and are commonly utilized to track overall health and weight.
I will be using Google Sheets to clean my data, R Script will transform and explore the data after cleaning, and Tableau for data visualizations.
**In Google Sheets*
#setting up the environment
#Load Libraries
library('tidyverse')
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require('forcats')
library('openxlsx')
## Warning: package 'openxlsx' was built under R version 4.2.3
library('janitor')
## Warning: package 'janitor' was built under R version 4.2.3
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library('skimr')
## Warning: package 'skimr' was built under R version 4.2.3
library('here')
## Warning: package 'here' was built under R version 4.2.3
## here() starts at C:/Users/admin/OneDrive/Documents/Case Sudy 1
library('dplyr')
library('lubridate')
library('ggplot2')
#Start by Importing Data sets
activity <-read_csv("dailyActivity_merged.csv", show_col_types = FALSE)
sleep <-read_csv("sleepDay_merged.csv", show_col_types = FALSE)
weight <-read_csv("weightLogInfo_merged.csv", show_col_types = FALSE)
head(activity)
head(sleep)
head(weight)
#Convert Id to character data type
#Convert Day to date format
#Rename various dates to Day
activity <-activity %>%
mutate_at(vars(Id), as.character) %>%
mutate_at(vars(ActivityDate), as.Date, format = "%m/%d/%y") %>%
rename("Day"="ActivityDate")
sleep <-sleep %>%
mutate_at(vars(Id), as.character) %>%
mutate_at(vars(SleepDay), as.Date, format = "%m/%d/%y") %>%
rename("Day"="SleepDay")
weight <-weight %>%
mutate_at(vars(Id,LogId), as.character) %>%
mutate_at(vars(Date),as.Date, format = "%m/%d/%y") %>%
rename("Day"="Date")
#Combine data frames
#Add day of the the week to combined dataset
combined_data <-sleep %>%
right_join(activity, by=c('Id', 'Day')) %>%
left_join(weight, by=c('Id', 'Day')) %>%
mutate(Weekday = weekdays(as.Date(Day, "%m/%d/%Y")))
# Find and remove duplicate rows
#count NAs and distinct Ids
combined_data <-combined_data[!duplicated(combined_data), ]
sum(is.na(combined_data))
## [1] 6893
n_distinct(combined_data$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
# Add the days of the week to combined data
combined_data$Weekday <-factor(combined_data$Weekday, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#Select summary data
combined_data %>%
select(TotalMinutesAsleep, TotalSteps, TotalDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories, WeightKg, Fat, BMI, IsManualReport) %>%
summary()
## TotalMinutesAsleep TotalSteps TotalDistance VeryActiveMinutes
## Min. : 58.0 Min. : 0 Min. : 0.000 Min. : 0.00
## 1st Qu.:361.0 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 0.00
## Median :432.5 Median : 7406 Median : 5.245 Median : 4.00
## Mean :419.2 Mean : 7638 Mean : 5.490 Mean : 21.16
## 3rd Qu.:490.0 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.: 32.00
## Max. :796.0 Max. :36019 Max. :28.030 Max. :210.00
## NA's :530
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8 1st Qu.:1828
## Median : 6.00 Median :199.0 Median :1057.5 Median :2134
## Mean : 13.56 Mean :192.8 Mean : 991.2 Mean :2304
## 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :143.00 Max. :518.0 Max. :1440.0 Max. :4900
##
## WeightKg Fat BMI IsManualReport
## Min. : 52.60 Min. :22.00 Min. :21.45 Mode :logical
## 1st Qu.: 61.40 1st Qu.:22.75 1st Qu.:23.96 FALSE:26
## Median : 62.50 Median :23.50 Median :24.39 TRUE :41
## Mean : 72.04 Mean :23.50 Mean :25.19 NA's :873
## 3rd Qu.: 85.05 3rd Qu.:24.25 3rd Qu.:25.56
## Max. :133.50 Max. :25.00 Max. :47.54
## NA's :873 NA's :938 NA's :873
#Total steps by day
ggplot(data=combined_data, aes(x=Weekday, y=TotalSteps)) + geom_bar(stat="identity", fill="#56B4E9")+ labs(title="Steps by Day", y="Total Steps")
#Minutes of moderate activity per day
ggplot(data=combined_data, aes(x=Weekday, y=FairlyActiveMinutes)) + geom_bar(stat="identity", fill="#E69F00")+ labs(title="Fairly Active Minutes by Day", y="Minutes")
#Logged Activities Distance
ggplot(data=combined_data, aes(x=Weekday, y=LoggedActivitiesDistance)) + geom_bar(stat="identity", fill="#56B4E9")+ labs(title="Logged Activity Distance by Day", y="Logged Activity Distance")
#Distribution of sleep time
ggplot(combined_data, aes(TotalMinutesAsleep)) + geom_histogram(bins=10, na.rm=TRUE,color = "#000000",fill="#fa8072" )+ labs(title="Distribution of Total Time Asleep", x="Total Time Asleep (minutes)")
#Total minutes Asleep vs Calories
ggplot(combined_data) + geom_point(mapping = aes(x=TotalMinutesAsleep/60, y=Calories), na.rm=TRUE, color="red") + labs(title="Calories vs Time Slept", x="Time Asleep (Hours)", y="Calories")
# Export data to excel file
write.xlsx(combined_data, file="Fitbit_Fitness_Data.xlsx")