Introduction

This is a capstone project for Google Data Analytics Professional Certificate Case Study 2: How Can a Wellness Technology Company Play It Smart?

Summary of Business Task

What is the problem you are trying to solve?

To explore and detect patterns in smart device usage to gain insight to comprehend consumer behavior with non-Bellabeat devices. Leverage the identified trends in smart device usage and apply them to Bellabeat customers.

How can your insights drive business decisions?

To leverage the insights gained, and strategically incorporate them into Bellabeat’s marketing approach to unlock company growth.

Data Sources used

Datasets were provided by Möbius

  1. License: CC0: Public Domain

  2. Source: https://zenodo.org/record/53894#.X9oeh3Uzaao

  3. Privacy: Datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

#Data Credibility Check Using ROCCC Method

Reliability The provided Kaggle dataset includes personal fitness tracker data from a limited group of 30 eligible Fitbit users. Due to the small sample size, it may not accurately represent the entire population, leading to potential bias. However, by expanding the sample size and incorporating additional data, it would be possible to mitigate the limitations associated with a small data size. Additionally, further investigation and exploration are necessary to understand the criteria that define users as “eligible” for submitting personal tracker data, as mentioned in the dataset’s content section.

Original The datasets are provided from the public domain by Möbius, instead of the original distributor, Amazon Mechanical Turk. These data sets are third-party.

Comprehensive Data sets are not comprehensive. They are missing important information on age, gender, and what type of device was used.

Current These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. It’s 7 year old dataset so a more updated dataset would be better.

Cited

The datasets included in this study are sourced from crowd-sourcing data collected through a distributed survey conducted on Amazon Mechanical Turk. As a result, the data source is appropriately acknowledged and cited.


The Process

Choosing dataset files

I chose three datasets to help me:

dailyActivity_merged.csv file (summary of steps and calories burned) sleepDay_merged.csv file (participants’ sleep patterns.) weightLogInfo_merged file (contains weight data)

These files serve as excellent resources for analyzing participant usage and are commonly utilized to track overall health and weight.

Tools Used

I will be using Google Sheets to clean my data, R Script will transform and explore the data after cleaning, and Tableau for data visualizations.

Cleaning the Data

**In Google Sheets*

  1. Imported datasets into Sheets
  2. Started by filtering out the zero in the dataset and deleting the rows with zero across all columns
  3. Changed the format of the Id columns to text format in all 3 datasets
  4. Converted ActivityDate, SleepDate, and Date columns from Datetime to Date type
  5. In the WeightLogInfo_merge file I converted WeightKg and WeightPounds and BMI column to two decimals spaces to make the data easier to read.
  6. In the WeightLogInfo_merge file, I deleted the Fat field because there are only two entries

Transforming and Exploring Data

#setting up the environment
#Load Libraries
library('tidyverse')
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require('forcats')
library('openxlsx')
## Warning: package 'openxlsx' was built under R version 4.2.3
library('janitor')
## Warning: package 'janitor' was built under R version 4.2.3
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library('skimr')
## Warning: package 'skimr' was built under R version 4.2.3
library('here')
## Warning: package 'here' was built under R version 4.2.3
## here() starts at C:/Users/admin/OneDrive/Documents/Case Sudy 1
library('dplyr')
library('lubridate')
library('ggplot2')
#Start by Importing Data sets
activity <-read_csv("dailyActivity_merged.csv", show_col_types = FALSE)
sleep <-read_csv("sleepDay_merged.csv", show_col_types = FALSE)
weight <-read_csv("weightLogInfo_merged.csv", show_col_types = FALSE)

head(activity)
head(sleep)
head(weight)
#Convert Id to character data type 
#Convert Day to date format 
#Rename various dates to Day

activity <-activity %>%
  mutate_at(vars(Id), as.character) %>%
  mutate_at(vars(ActivityDate), as.Date, format = "%m/%d/%y") %>%
  rename("Day"="ActivityDate") 


sleep <-sleep %>%
  mutate_at(vars(Id), as.character) %>%
  mutate_at(vars(SleepDay), as.Date, format = "%m/%d/%y") %>%
  rename("Day"="SleepDay")

weight <-weight %>%
  mutate_at(vars(Id,LogId), as.character) %>%
  mutate_at(vars(Date),as.Date, format = "%m/%d/%y") %>%
  rename("Day"="Date")
#Combine data frames
#Add day of the the week to combined dataset
combined_data <-sleep %>%
  right_join(activity, by=c('Id', 'Day')) %>%
  left_join(weight, by=c('Id', 'Day')) %>%
  mutate(Weekday = weekdays(as.Date(Day, "%m/%d/%Y")))
# Find and remove duplicate rows
#count NAs and distinct Ids
combined_data <-combined_data[!duplicated(combined_data), ]
sum(is.na(combined_data))
## [1] 6893
n_distinct(combined_data$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
# Add the days of the week to combined data
combined_data$Weekday <-factor(combined_data$Weekday, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#Select summary data
combined_data %>%
  select(TotalMinutesAsleep, TotalSteps, TotalDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories, WeightKg, Fat, BMI, IsManualReport) %>%
  summary()
##  TotalMinutesAsleep   TotalSteps    TotalDistance    VeryActiveMinutes
##  Min.   : 58.0      Min.   :    0   Min.   : 0.000   Min.   :  0.00   
##  1st Qu.:361.0      1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:  0.00   
##  Median :432.5      Median : 7406   Median : 5.245   Median :  4.00   
##  Mean   :419.2      Mean   : 7638   Mean   : 5.490   Mean   : 21.16   
##  3rd Qu.:490.0      3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 32.00   
##  Max.   :796.0      Max.   :36019   Max.   :28.030   Max.   :210.00   
##  NA's   :530                                                          
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0.0   Min.   :   0  
##  1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8   1st Qu.:1828  
##  Median :  6.00      Median :199.0        Median :1057.5   Median :2134  
##  Mean   : 13.56      Mean   :192.8        Mean   : 991.2   Mean   :2304  
##  3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :143.00      Max.   :518.0        Max.   :1440.0   Max.   :4900  
##                                                                          
##     WeightKg           Fat             BMI        IsManualReport 
##  Min.   : 52.60   Min.   :22.00   Min.   :21.45   Mode :logical  
##  1st Qu.: 61.40   1st Qu.:22.75   1st Qu.:23.96   FALSE:26       
##  Median : 62.50   Median :23.50   Median :24.39   TRUE :41       
##  Mean   : 72.04   Mean   :23.50   Mean   :25.19   NA's :873      
##  3rd Qu.: 85.05   3rd Qu.:24.25   3rd Qu.:25.56                  
##  Max.   :133.50   Max.   :25.00   Max.   :47.54                  
##  NA's   :873      NA's   :938     NA's   :873
#Total steps by day

ggplot(data=combined_data, aes(x=Weekday, y=TotalSteps)) + geom_bar(stat="identity", fill="#56B4E9")+ labs(title="Steps by Day", y="Total Steps")

#Minutes of moderate activity per day

ggplot(data=combined_data, aes(x=Weekday, y=FairlyActiveMinutes)) + geom_bar(stat="identity", fill="#E69F00")+ labs(title="Fairly Active Minutes by Day", y="Minutes")

#Logged Activities Distance

ggplot(data=combined_data, aes(x=Weekday, y=LoggedActivitiesDistance)) + geom_bar(stat="identity", fill="#56B4E9")+ labs(title="Logged Activity Distance by Day", y="Logged Activity Distance")

#Distribution of sleep time

ggplot(combined_data, aes(TotalMinutesAsleep)) + geom_histogram(bins=10, na.rm=TRUE,color = "#000000",fill="#fa8072" )+ labs(title="Distribution of Total Time Asleep", x="Total Time Asleep (minutes)")

#Total minutes Asleep vs Calories

ggplot(combined_data) + geom_point(mapping = aes(x=TotalMinutesAsleep/60, y=Calories), na.rm=TRUE, color="red") + labs(title="Calories vs Time Slept", x="Time Asleep (Hours)", y="Calories")

# Export data to excel file 
write.xlsx(combined_data, file="Fitbit_Fitness_Data.xlsx")