This is a capstone project for Google Data Analytics Professional Certificate Case Study 2: How Can a Wellness Technology Company Play It Smart?
To explore and detect patterns in smart device usage to gain insight to comprehend consumer behavior with non-Bellabeat devices. Leverage the identified trends in smart device usage and apply them to Bellabeat customers.
To leverage the insights gained, and strategically incorporate them into Bellabeat’s marketing approach to unlock company growth.
Datasets were provided by Möbius
License: CC0: Public Domain
Privacy: Respondents generated Datasets to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
#Data Credibility Check Using ROCCC Method
Reliability The provided Kaggle dataset includes personal fitness tracker data from a limited group of 30 eligible Fitbit users. Due to the small sample size, it may not accurately represent the entire population, leading to potential bias. However, by expanding the sample size and incorporating additional data, it would be possible to mitigate the limitations associated with a small data size. Additionally, further investigation and exploration are necessary to understand the criteria that define users as “eligible” for submitting personal tracker data, as mentioned in the dataset’s content section.
Original The datasets are provided from the public domain by Möbius, instead of the original distributor, Amazon Mechanical Turk. These data sets are third-party.
Comprehensive Data sets are not comprehensive. They are missing important information on age, gender, and what type of device was used.
Current These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. It’s 7 a 7 year old dataset, so a more updated would be better.
Cited
The datasets included in this study are sourced from crowd-sourcing data collected through a distributed survey conducted on Amazon Mechanical Turk. As a result, the data source is appropriately acknowledged and cited.
I chose three datasets to help me:
dailyActivity_merged.csv file (summary of steps and calories burned) sleepDay_merged.csv file (participants’ sleep patterns.) weightLogInfo_merged file (contains weight data)
These files serve as excellent resources for analyzing participant usage and are commonly utilized to track overall health and weight.
I will use Google Sheets to clean my data, R Script will transform and explore the data after cleaning and creating visualizations, and Tableau will present data visualizations upon reporting.
In Google Sheets
#setting up the environment
#Load Libraries
library('tidyverse')
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.2 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.2 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require('forcats')
library('openxlsx')
## Warning: package 'openxlsx' was built under R version 4.2.3
library('janitor')
## Warning: package 'janitor' was built under R version 4.2.3
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library('skimr')
## Warning: package 'skimr' was built under R version 4.2.3
library('here')
## Warning: package 'here' was built under R version 4.2.3
## here() starts at C:/Users/admin/OneDrive/Documents/Case Sudy 1
library('dplyr')
library('lubridate')
library('ggplot2')
#Start by Importing Data sets
activity <-read_csv("dailyActivity_merged.csv", show_col_types = FALSE)
sleep <-read_csv("sleepDay_merged.csv", show_col_types = FALSE)
weight <-read_csv("weightLogInfo_merged.csv", show_col_types = FALSE)
head(activity)
head(sleep)
head(weight)
#Convert Id to character data type
#Convert Day to date format
#Rename various dates to Day
activity <-activity %>%
mutate_at(vars(Id), as.character) %>%
mutate_at(vars(ActivityDate), as.Date, format = "%m/%d/%y") %>%
rename("Day"="ActivityDate")
sleep <-sleep %>%
mutate_at(vars(Id), as.character) %>%
mutate_at(vars(SleepDay), as.Date, format = "%m/%d/%y") %>%
rename("Day"="SleepDay")
weight <-weight %>%
mutate_at(vars(Id,LogId), as.character) %>%
mutate_at(vars(Date),as.Date, format = "%m/%d/%y") %>%
rename("Day"="Date")
#Combine data frames
#Add day of the the week to combined dataset
combined_data <-sleep %>%
right_join(activity, by=c('Id', 'Day')) %>%
left_join(weight, by=c('Id', 'Day')) %>%
mutate(Weekday = weekdays(as.Date(Day, "%m/%d/%Y")))
# Find and remove duplicate rows
#count NAs and distinct Ids
combined_data <-combined_data[!duplicated(combined_data), ]
sum(is.na(combined_data))
## [1] 6893
n_distinct(combined_data$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
# Add the days of the week to combined data
combined_data$Weekday <-factor(combined_data$Weekday, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#Select summary data
combined_data %>%
select(TotalMinutesAsleep, TotalSteps, TotalDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories, WeightKg, Fat, BMI, IsManualReport) %>%
summary()
## TotalMinutesAsleep TotalSteps TotalDistance VeryActiveMinutes
## Min. : 58.0 Min. : 0 Min. : 0.000 Min. : 0.00
## 1st Qu.:361.0 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 0.00
## Median :432.5 Median : 7406 Median : 5.245 Median : 4.00
## Mean :419.2 Mean : 7638 Mean : 5.490 Mean : 21.16
## 3rd Qu.:490.0 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.: 32.00
## Max. :796.0 Max. :36019 Max. :28.030 Max. :210.00
## NA's :530
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8 1st Qu.:1828
## Median : 6.00 Median :199.0 Median :1057.5 Median :2134
## Mean : 13.56 Mean :192.8 Mean : 991.2 Mean :2304
## 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :143.00 Max. :518.0 Max. :1440.0 Max. :4900
##
## WeightKg Fat BMI IsManualReport
## Min. : 52.60 Min. :22.00 Min. :21.45 Mode :logical
## 1st Qu.: 61.40 1st Qu.:22.75 1st Qu.:23.96 FALSE:26
## Median : 62.50 Median :23.50 Median :24.39 TRUE :41
## Mean : 72.04 Mean :23.50 Mean :25.19 NA's :873
## 3rd Qu.: 85.05 3rd Qu.:24.25 3rd Qu.:25.56
## Max. :133.50 Max. :25.00 Max. :47.54
## NA's :873 NA's :938 NA's :873
The typical individual has an average weight of 72.04 kg, a BMI of 25.19, and dedicates the majority of their time to engaging in light activities. On average, they enjoy approximately 6.9 hours of sleep, take around 7638 steps, and cover a distance of 5.49 km each day.
#Total steps by day
ggplot(data=combined_data, aes(x=Weekday, y=TotalSteps)) + geom_bar(stat="identity", fill="#56B4E9")+ labs(title="Steps by Day", y="Total Steps")
Sundays witnessed the highest step count among users, while Fridays recorded the lowest number of steps taken. Given the consistently significant values, the marketing team can deduce that users highly prioritize the step-tracking feature of health fitness devices. This observation further suggests that the said feature would be immensely beneficial to Bellabeat customers.
#Minutes of moderate activity per day
ggplot(data=combined_data, aes(x=Weekday, y=FairlyActiveMinutes)) + geom_bar(stat="identity", fill="#E69F00")+ labs(title="Fairly Active Minutes by Day", y="Minutes")
A noteworthy observation is a decline in the duration of moderate activity on Wednesdays, ut increases again on Thursdays. This pattern could be attributed to individuals returning to work on Mondays and potentially feeling demotivated or fatigued by the time Wednesday arrives.
#Logged Activities Distance
ggplot(data=combined_data, aes(x=Weekday, y=LoggedActivitiesDistance)) + geom_bar(stat="identity", fill="#56B4E9")+ labs(title="Logged Activity Distance by Day", y="Logged Activity Distance")
Based on the provided histogram, the majority of individuals slept for a duration ranging from 312 to 563 minutes (equivalent to 5.2 to 9.4 hours). It is important to note that this range represents the actual sleep time and does not encompass the total time spent in bed for rest purposes.
#Distribution of sleep time
ggplot(combined_data, aes(TotalMinutesAsleep)) + geom_histogram(bins=10, na.rm=TRUE,color = "#000000",fill="#fa8072" )+ labs(title="Distribution of Total Time Asleep", x="Total Time Asleep (minutes)")
Based on the provided histogram, the majority of individuals experienced a sleep duration ranging from 312 to 563 minutes (equivalent to 5.2 to 9.4 hours). It is important to note that this range specifically refers to the actual sleep time and does not encompass the total time spent in bed for resting purposes.
#Total minutes Asleep vs Calories
ggplot(combined_data) + geom_point(mapping = aes(x=TotalMinutesAsleep/60, y=Calories), na.rm=TRUE, color="red") + labs(title="Calories vs Time Slept", x="Time Asleep (Hours)", y="Calories")
Except for a few exceptional cases, individuals who slept for a duration between 5 and 7 hours demonstrated calorie expenditure. When focusing solely on weight loss and calories burned, this observation correlates with the sleep range of 5.2 to 9.4 hours previously mentioned. This correlation suggests that individuals who maintain their sleep duration within this range might potentially burn a higher amount of calories.
The daily steps seem to be lowest on Fridays, maybe due to the end of a work week, and the highest on Sunday. To ensure more use the marketing team can notify users on Thursday evenings, as well as Friday and Saturday mornings, to motivate users to maintain their physical activity levels throughout the day.
Based on the observation that a significant number of users did not utilize the Logged Distance feature on FitBit devices, it implies a preference among users for automatic data collection. Consequently, the Bellabeat marketing team may consider omitting the activity distance log function and collecting the data automatically.
In relation to the size of the data set, there weren’t many weight entries. It can be inferred that individuals might not have been focused on weight loss or lacked a device capable of automatically recording this data. Considering this information, the Bellabeat team could market weight devices like smart scales that offer automatic recording of weight information.
Maybe tracking other styles and brands of fitness devices like Apple watches will provide a larger sample size and more data.