Introduction - ASK PHASE

This is a capstone project for Google Data Analytics Professional Certificate Case Study 2: How Can a Wellness Technology Company Play It Smart?

Summary of Business Task

What is the problem you are trying to solve?

To explore and detect patterns in smart device usage to gain insight to comprehend consumer behavior with non-Bellabeat devices. Leverage the identified trends in smart device usage and apply them to Bellabeat customers.

How can your insights drive business decisions?

To leverage the insights gained, and strategically incorporate them into Bellabeat’s marketing approach to unlock company growth.

Data Sources used - PREPARE PHASE

Datasets were provided by Möbius

  1. License: CC0: Public Domain

  2. Source: https://zenodo.org/record/53894#.X9oeh3Uzaao

  3. Privacy: Respondents generated Datasets to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

#Data Credibility Check Using ROCCC Method

Reliability The provided Kaggle dataset includes personal fitness tracker data from a limited group of 30 eligible Fitbit users. Due to the small sample size, it may not accurately represent the entire population, leading to potential bias. However, by expanding the sample size and incorporating additional data, it would be possible to mitigate the limitations associated with a small data size. Additionally, further investigation and exploration are necessary to understand the criteria that define users as “eligible” for submitting personal tracker data, as mentioned in the dataset’s content section.

Original The datasets are provided from the public domain by Möbius, instead of the original distributor, Amazon Mechanical Turk. These data sets are third-party.

Comprehensive Data sets are not comprehensive. They are missing important information on age, gender, and what type of device was used.

Current These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. It’s 7 a 7 year old dataset, so a more updated would be better.

Cited

The datasets included in this study are sourced from crowd-sourcing data collected through a distributed survey conducted on Amazon Mechanical Turk. As a result, the data source is appropriately acknowledged and cited.


The PROCESS PHASE

Choosing dataset files

I chose three datasets to help me:

dailyActivity_merged.csv file (summary of steps and calories burned) sleepDay_merged.csv file (participants’ sleep patterns.) weightLogInfo_merged file (contains weight data)

These files serve as excellent resources for analyzing participant usage and are commonly utilized to track overall health and weight.

Tools Used

I will use Google Sheets to clean my data, R Script will transform and explore the data after cleaning and creating visualizations, and Tableau will present data visualizations upon reporting.

Cleaning the Data

In Google Sheets

  1. Imported datasets into Sheets
  2. Started by filtering out the zero in the dataset and deleting the rows with zero across all columns
  3. Changed the format of the Id columns to text format in all 3 datasets
  4. Converted ActivityDate, SleepDate, and Date columns from Datetime to Date type
  5. In the WeightLogInfo_merge fi,le I converted WeightKg and WeightPounds, and BMI column to two decimals spaces to make the data easier to read.
  6. In the WeightLogInfo_merge file, I deleted the Fat field because there are only two entries

Transforming and Exploring Data

#setting up the environment
#Load Libraries
library('tidyverse')
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.2     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.2     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require('forcats')
library('openxlsx')
## Warning: package 'openxlsx' was built under R version 4.2.3
library('janitor')
## Warning: package 'janitor' was built under R version 4.2.3
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library('skimr')
## Warning: package 'skimr' was built under R version 4.2.3
library('here')
## Warning: package 'here' was built under R version 4.2.3
## here() starts at C:/Users/admin/OneDrive/Documents/Case Sudy 1
library('dplyr')
library('lubridate')
library('ggplot2')
#Start by Importing Data sets
activity <-read_csv("dailyActivity_merged.csv", show_col_types = FALSE)
sleep <-read_csv("sleepDay_merged.csv", show_col_types = FALSE)
weight <-read_csv("weightLogInfo_merged.csv", show_col_types = FALSE)

head(activity)
head(sleep)
head(weight)
#Convert Id to character data type 
#Convert Day to date format 
#Rename various dates to Day

activity <-activity %>%
  mutate_at(vars(Id), as.character) %>%
  mutate_at(vars(ActivityDate), as.Date, format = "%m/%d/%y") %>%
  rename("Day"="ActivityDate") 


sleep <-sleep %>%
  mutate_at(vars(Id), as.character) %>%
  mutate_at(vars(SleepDay), as.Date, format = "%m/%d/%y") %>%
  rename("Day"="SleepDay")

weight <-weight %>%
  mutate_at(vars(Id,LogId), as.character) %>%
  mutate_at(vars(Date),as.Date, format = "%m/%d/%y") %>%
  rename("Day"="Date")
#Combine data frames
#Add day of the the week to combined dataset
combined_data <-sleep %>%
  right_join(activity, by=c('Id', 'Day')) %>%
  left_join(weight, by=c('Id', 'Day')) %>%
  mutate(Weekday = weekdays(as.Date(Day, "%m/%d/%Y")))
# Find and remove duplicate rows
#count NAs and distinct Ids
combined_data <-combined_data[!duplicated(combined_data), ]
sum(is.na(combined_data))
## [1] 6893
n_distinct(combined_data$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
# Add the days of the week to combined data
combined_data$Weekday <-factor(combined_data$Weekday, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

ANALYSIS PHASE

#Select summary data
combined_data %>%
  select(TotalMinutesAsleep, TotalSteps, TotalDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories, WeightKg, Fat, BMI, IsManualReport) %>%
  summary()
##  TotalMinutesAsleep   TotalSteps    TotalDistance    VeryActiveMinutes
##  Min.   : 58.0      Min.   :    0   Min.   : 0.000   Min.   :  0.00   
##  1st Qu.:361.0      1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:  0.00   
##  Median :432.5      Median : 7406   Median : 5.245   Median :  4.00   
##  Mean   :419.2      Mean   : 7638   Mean   : 5.490   Mean   : 21.16   
##  3rd Qu.:490.0      3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 32.00   
##  Max.   :796.0      Max.   :36019   Max.   :28.030   Max.   :210.00   
##  NA's   :530                                                          
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0.0   Min.   :   0  
##  1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8   1st Qu.:1828  
##  Median :  6.00      Median :199.0        Median :1057.5   Median :2134  
##  Mean   : 13.56      Mean   :192.8        Mean   : 991.2   Mean   :2304  
##  3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :143.00      Max.   :518.0        Max.   :1440.0   Max.   :4900  
##                                                                          
##     WeightKg           Fat             BMI        IsManualReport 
##  Min.   : 52.60   Min.   :22.00   Min.   :21.45   Mode :logical  
##  1st Qu.: 61.40   1st Qu.:22.75   1st Qu.:23.96   FALSE:26       
##  Median : 62.50   Median :23.50   Median :24.39   TRUE :41       
##  Mean   : 72.04   Mean   :23.50   Mean   :25.19   NA's :873      
##  3rd Qu.: 85.05   3rd Qu.:24.25   3rd Qu.:25.56                  
##  Max.   :133.50   Max.   :25.00   Max.   :47.54                  
##  NA's   :873      NA's   :938     NA's   :873

The typical individual has an average weight of 72.04 kg, a BMI of 25.19, and dedicates the majority of their time to engaging in light activities. On average, they enjoy approximately 6.9 hours of sleep, take around 7638 steps, and cover a distance of 5.49 km each day.

#Total steps by day

ggplot(data=combined_data, aes(x=Weekday, y=TotalSteps)) + geom_bar(stat="identity", fill="#56B4E9")+ labs(title="Steps by Day", y="Total Steps")

Sundays witnessed the highest step count among users, while Fridays recorded the lowest number of steps taken. Given the consistently significant values, the marketing team can deduce that users highly prioritize the step-tracking feature of health fitness devices. This observation further suggests that the said feature would be immensely beneficial to Bellabeat customers.

#Minutes of moderate activity per day

ggplot(data=combined_data, aes(x=Weekday, y=FairlyActiveMinutes)) + geom_bar(stat="identity", fill="#E69F00")+ labs(title="Fairly Active Minutes by Day", y="Minutes")

A noteworthy observation is a decline in the duration of moderate activity on Wednesdays, ut increases again on Thursdays. This pattern could be attributed to individuals returning to work on Mondays and potentially feeling demotivated or fatigued by the time Wednesday arrives.

#Logged Activities Distance

ggplot(data=combined_data, aes(x=Weekday, y=LoggedActivitiesDistance)) + geom_bar(stat="identity", fill="#56B4E9")+ labs(title="Logged Activity Distance by Day", y="Logged Activity Distance")

Based on the provided histogram, the majority of individuals slept for a duration ranging from 312 to 563 minutes (equivalent to 5.2 to 9.4 hours). It is important to note that this range represents the actual sleep time and does not encompass the total time spent in bed for rest purposes.

#Distribution of sleep time

ggplot(combined_data, aes(TotalMinutesAsleep)) + geom_histogram(bins=10, na.rm=TRUE,color = "#000000",fill="#fa8072" )+ labs(title="Distribution of Total Time Asleep", x="Total Time Asleep (minutes)")

Based on the provided histogram, the majority of individuals experienced a sleep duration ranging from 312 to 563 minutes (equivalent to 5.2 to 9.4 hours). It is important to note that this range specifically refers to the actual sleep time and does not encompass the total time spent in bed for resting purposes.

#Total minutes Asleep vs Calories

ggplot(combined_data) + geom_point(mapping = aes(x=TotalMinutesAsleep/60, y=Calories), na.rm=TRUE, color="red") + labs(title="Calories vs Time Slept", x="Time Asleep (Hours)", y="Calories")

Except for a few exceptional cases, individuals who slept for a duration between 5 and 7 hours demonstrated calorie expenditure. When focusing solely on weight loss and calories burned, this observation correlates with the sleep range of 5.2 to 9.4 hours previously mentioned. This correlation suggests that individuals who maintain their sleep duration within this range might potentially burn a higher amount of calories.

ACT Phase

  • The daily steps seem to be lowest on Fridays, maybe due to the end of a work week, and the highest on Sunday. To ensure more use the marketing team can notify users on Thursday evenings, as well as Friday and Saturday mornings, to motivate users to maintain their physical activity levels throughout the day.

  • Based on the observation that a significant number of users did not utilize the Logged Distance feature on FitBit devices, it implies a preference among users for automatic data collection. Consequently, the Bellabeat marketing team may consider omitting the activity distance log function and collecting the data automatically.

  • In relation to the size of the data set, there weren’t many weight entries. It can be inferred that individuals might not have been focused on weight loss or lacked a device capable of automatically recording this data. Considering this information, the Bellabeat team could market weight devices like smart scales that offer automatic recording of weight information.

  • Maybe tracking other styles and brands of fitness devices like Apple watches will provide a larger sample size and more data.