This is a case study for the Google Data Analytics Certificate and will be performed on FitBit data using the FitBit Fitness Tracker Data. This case study will demonstrate skills learned through the Google Data Analytics Certificate course, with a heavy focus on R.
This document should highlight skills acquired through the Google Data Analytics Certification course, along with trying to address the case study use-case. Some repetition is kept on this document as a reference to the troubleshooting and thought process.
Bellabeat needs to better understand smart device tracking data to gain further insight into how consumers are using their smart devices. These insights should drive improvements to BellaBeat marketing strategies.
Questions we will explore further in the data:
The Kaggle Dataset is a group of 29 CSV files, organized into two folders. Each folder appears to effectively be a month of data (labeled “3.12.16-4.11.16” and “4.12.16-5.12.16” respectively).
Of the 29 CSVs, I will be focusing primarily on the higher level summary data: hours and days. Second and minute data will not be used for this case study.
First I must install and load the tidyverse package.
## For Desktop Applications, we need to configure CRAN
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
install.packages("weatherData")
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\semar\AppData\Local\Temp\RtmpYFkol4\downloaded_packages
library(tidyverse)
Then, I utilize the read_csv function to prepare the data as dataframes
setwd("~/Data Analytics/R/RStudio/FitBit_CaseStudy/FitBit_CaseStudy_12022024/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16")
dailyActivity_3.12 <- read_csv("dailyActivity_merged.csv")
hourlyCalories_3.12 <- read_csv("hourlyCalories_merged.csv")
hourlyIntensities_3.12 <- read_csv("hourlyIntensities_merged.csv")
hourlySteps_3.12 <- read_csv("hourlySteps_merged.csv")
weightLogInfo_3.12 <- read_csv("weightLogInfo_merged.csv")
minuteSleep_3.12 <- read_csv("minuteSleep_merged.csv")
minuteMETs_3.12 <- read_csv("minuteMETsNarrow_merged.csv")
setwd("~/Data Analytics/R/RStudio/FitBit_CaseStudy/FitBit_CaseStudy_12022024/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16")
dailyActivity_4.12 <- read_csv("dailyActivity_merged.csv")
hourlyCalories_4.12 <- read_csv("hourlyCalories_merged.csv")
hourlyIntensities_4.12 <- read_csv("hourlyIntensities_merged.csv")
hourlySteps_4.12 <- read_csv("hourlySteps_merged.csv")
weightLogInfo_4.12 <- read_csv("weightLogInfo_merged.csv")
minuteSleep_4.12 <- read_csv("minuteSleep_merged.csv")
minuteMETs_4.12 <- read_csv("minuteMETsNarrow_merged.csv")
There are 7 files in the 4.12 folder that do not exist in the 3.12 folder, listing those separately for organizational purposes. For variable labeling clarity, these will not have the date identifier at the end.
setwd("~/Data Analytics/R/RStudio/FitBit_CaseStudy/FitBit_CaseStudy_12022024/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16")
dailyCalories <- read_csv("dailyCalories_merged.csv")
dailyIntensities <- read_csv("dailyIntensities_merged.csv")
dailySteps <- read_csv("dailySteps_merged.csv")
sleepDay <- read_csv("sleepDay_merged.csv")
Data can be viewed outside of R utilizing spreadsheets or SQL, however this Case study will be making use of various R functions (view, summary, glimpse, colnames, head, etc.). These functions should be repeated as needed for any dataframes (examples below).
head(weightLogInfo)
summary(weightLogInfo)
colnames(dailyActivity)
glimpse(dailyActivity)
Having a peek at some of the tables, I am able to identify a few trends in the data:
I’ll finish organizing the data by combining the 5 pairs of similar datasets into singular dataframes (and cleaning up the unused single dataframes). This will make it easier to work with as we move on to process and analyze.
dailyActivity <- rbind2(dailyActivity_3.12, dailyActivity_4.12)
rm(dailyActivity_3.12, dailyActivity_4.12)
hourlyCalories <- rbind2(hourlyCalories_3.12, hourlyCalories_4.12)
rm(hourlyCalories_3.12, hourlyCalories_4.12)
hourlyIntensities <- rbind2(hourlyIntensities_3.12, hourlyIntensities_4.12)
rm(hourlyIntensities_3.12, hourlyIntensities_4.12)
hourlySteps <- rbind2(hourlySteps_3.12, hourlySteps_4.12)
rm(hourlySteps_3.12, hourlySteps_4.12)
weightLogInfo <- rbind2(weightLogInfo_3.12, weightLogInfo_4.12)
rm(weightLogInfo_3.12, weightLogInfo_4.12)
minuteSleep <- rbind2(minuteSleep_3.12, minuteSleep_4.12)
rm(minuteSleep_3.12, minuteSleep_4.12)
minuteMETs <- rbind2(minuteMETs_3.12, minuteMETs_4.12)
rm(minuteMETs_3.12, minuteMETs_4.12)
Now that our data is organized, I can explore cleaning up and transforming the data.
I’ll start by correcting the date formatting for our datasets. I’ll also add a weekday column for the data, as this may be helpful in analysis (perhaps users are more active on certain days of the week, etc.)
colnames(dailyActivity)[2] <- "ActivityDay"
dailyActivity$ActivityDay <- format(mdy(dailyActivity$ActivityDay), "%m/%d/%Y")
dailyActivity$ActivityDay <- as.Date(dailyActivity$ActivityDay, "%m/%d/%Y")
dailyActivity$Weekday <- wday(dailyActivity$ActivityDay, label = TRUE, abbr = FALSE)
dailyCalories$ActivityDay <- format(mdy(dailyCalories$ActivityDay), "%m/%d/%Y")
dailyCalories$ActivityDay <- as.Date(dailyCalories$ActivityDay, "%m/%d/%Y")
dailyCalories$Weekday <- wday(dailyCalories$ActivityDay, label = TRUE, abbr = FALSE)
dailyIntensities$ActivityDay <- format(mdy(dailyIntensities$ActivityDay), "%m/%d/%Y")
dailyIntensities$ActivityDay <- as.Date(dailyIntensities$ActivityDay, "%m/%d/%Y")
dailyIntensities$Weekday <- wday(dailyIntensities$ActivityDay, label = TRUE, abbr = FALSE)
dailySteps$ActivityDay <- format(mdy(dailySteps$ActivityDay), "%m/%d/%Y")
dailySteps$ActivityDay <- as.Date(dailySteps$ActivityDay, "%m/%d/%Y")
dailySteps$Weekday <- wday(dailySteps$ActivityDay, label = TRUE, abbr = FALSE)
hourlyCalories <- separate(hourlyCalories, ActivityHour, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
hourlyCalories$ActivityHourOnly <- NULL
hourlyCalories$ActivityDay <- format(mdy(hourlyCalories$ActivityDay), "%m/%d/%Y")
hourlyCalories$ActivityDay <- as.Date(hourlyCalories$ActivityDay, "%m/%d/%Y")
hourlyCalories$ActivityHour <- format(mdy_hms(hourlyCalories$ActivityHour), "%m/%d/%Y %H:%M:%S")
hourlyCalories$ActivityHour <- mdy_hms(hourlyCalories$ActivityHour)
hourlyCalories$Weekday <- wday(hourlyCalories$ActivityDay, label = TRUE, abbr = FALSE)
hourlyIntensities <- separate(hourlyIntensities, ActivityHour, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
hourlyIntensities$ActivityHourOnly <- NULL
hourlyIntensities$ActivityDay <- format(mdy(hourlyIntensities$ActivityDay), "%m/%d/%Y")
hourlyIntensities$ActivityDay <- as.Date(hourlyIntensities$ActivityDay, "%m/%d/%Y")
hourlyIntensities$ActivityHour <- format(mdy_hms(hourlyIntensities$ActivityHour), "%m/%d/%Y %H:%M:%S")
hourlyIntensities$ActivityHour <- mdy_hms(hourlyIntensities$ActivityHour)
hourlyIntensities$Weekday <- wday(hourlyIntensities$ActivityDay, label = TRUE, abbr = FALSE)
hourlySteps <- separate(hourlySteps, ActivityHour, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
hourlySteps$ActivityHourOnly <- NULL
hourlySteps$ActivityDay <- format(mdy(hourlySteps$ActivityDay), "%m/%d/%Y")
hourlySteps$ActivityDay <- as.Date(hourlySteps$ActivityDay, "%m/%d/%Y")
hourlySteps$ActivityHour <- format(mdy_hms(hourlySteps$ActivityHour), "%m/%d/%Y %H:%M:%S")
hourlySteps$ActivityHour <- mdy_hms(hourlySteps$ActivityHour)
hourlySteps$Weekday <- wday(hourlySteps$ActivityDay, label = TRUE, abbr = FALSE)
sleepDay <- separate(sleepDay, SleepDay, into=c('ActivityDay', 'SleepDayOnly'), sep=' ', remove=FALSE)
sleepDay$SleepDayOnly <- NULL
sleepDay$sleepDay <- NULL
sleepDay$ActivityDay <- format(mdy(sleepDay$ActivityDay), "%m/%d/%Y")
sleepDay$ActivityDay <- as.Date(sleepDay$ActivityDay, "%m/%d/%Y")
sleepDay$SleepDay <- format(mdy_hms(sleepDay$SleepDay), "%m/%d/%Y %H:%M:%S")
sleepDay$SleepDay <- mdy_hms(sleepDay$SleepDay)
sleepDay$Weekday <- wday(sleepDay$ActivityDay, label = TRUE, abbr = FALSE)
weightLogInfo <- separate(weightLogInfo, Date, into=c('ActivityDay', 'DateOnly'), sep=' ', remove=FALSE)
weightLogInfo$DateOnly <- NULL
weightLogInfo$ActivityDay <- format(mdy(weightLogInfo$ActivityDay), "%m/%d/%Y")
weightLogInfo$ActivityDay <- as.Date(weightLogInfo$ActivityDay, "%m/%d/%Y")
weightLogInfo$Date <- format(mdy_hms(weightLogInfo$Date), "%m/%d/%Y %H:%M:%S")
weightLogInfo$Date <- mdy_hms(weightLogInfo$Date)
weightLogInfo$Weekday <- wday(weightLogInfo$ActivityDay, label = TRUE, abbr = FALSE)
colnames(weightLogInfo)[2] <- "ActivityHour"
minuteSleep <- separate(minuteSleep, date, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
minuteSleep$ActivityHourOnly <- NULL
minuteSleep$ActivityDay <- format(mdy(minuteSleep$ActivityDay), "%m/%d/%Y")
minuteSleep$ActivityDay <- as.Date(minuteSleep$ActivityDay, "%m/%d/%Y")
minuteSleep$ActivityHour <- format(mdy_hms(minuteSleep$date), "%m/%d/%Y %H:%M:%S")
minuteSleep$ActivityHour <- mdy_hms(minuteSleep$date)
minuteSleep$Weekday <- wday(minuteSleep$ActivityDay, label = TRUE, abbr = FALSE)
minuteMETs <- separate(minuteMETs, ActivityMinute, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
minuteMETs$ActivityHourOnly <- NULL
minuteMETs$ActivityDay <- format(mdy(minuteMETs$ActivityDay), "%m/%d/%Y")
minuteMETs$ActivityDay <- as.Date(minuteMETs$ActivityDay, "%m/%d/%Y")
minuteMETs$ActivityMinute <- format(mdy_hms(minuteMETs$ActivityMinute), "%m/%d/%Y %H:%M:%S")
minuteMETs$ActivityMinute <- mdy_hms(minuteMETs$ActivityMinute)
minuteMETs$Weekday <- wday(minuteMETs$ActivityDay, label = TRUE, abbr = FALSE)
Going through the data, it becomes apparent that the daily datasets are incomplete, such as Id 1624580081 not showing activity on 2016-03-12 even though there is activity in other tables, inconsistent numbers of objects between tables, etc.
I pull summary data from the hourly and minute datasets to create more complete daily datasets (and replace the datasets that were pre-populated from the .csv’s)
dailyCalories <- hourlyCalories %>% group_by(Id, ActivityDay, Weekday) %>% arrange(ActivityDay, Id) %>% summarize(Calories = sum(Calories))
dailySleep <- minuteSleep %>% group_by(Id, ActivityDay, Weekday) %>% summarize(TotalSleepRecords = n_distinct(logId), TotalMinutesAsleep =sum(value == 1), TotalMinutesInBed = n(), .groups = 'keep')
rm(sleepDay, minuteSleep)
dailySteps <- hourlySteps %>% group_by(Id, ActivityDay, Weekday) %>% arrange(ActivityDay, Id) %>% summarize(StepTotal= sum(StepTotal))
MinuteMETs was a particularly challenging table. METs are measured from 10-189 on the table, but when comparing to a quick Google search it is not super clear what the METs measurements are when compared to a more standard 1-6 (values exceeding 6 indicate vigorous intensity).
To accurately report this data would require consulting the database owner for further clarity around the measurement expectations, however I’ve gone through some numbers as an example on how this may look
minuteMETs <- minuteMETs %>% arrange(ActivityMinute) %>% group_by(Id) %>% filter(duplicated(ActivityMinute) ==FALSE)
dailyMETs <- minuteMETs %>% group_by(Id, ActivityDay, Weekday) %>% summarize(SedentaryMinutes = sum(METs >= 10 & METs <= 20), LightlyActiveMinutes = sum(METs >= 21 & METs <= 50), ModeratelyActiveMinutes = sum(METs >= 51 & METs <= 80), VeryActiveMinutes = sum(METs >=81), TotalMinutesMeasured = sum(SedentaryMinutes, LightlyActiveMinutes, ModeratelyActiveMinutes, VeryActiveMinutes))
rm(minuteMETs)
Now that I have consistent subsets of data, I’ll use them to create a new dailyActivity table. To potentially help when it comes to analysis, I’m breaking out the data in two primary ways:
dailyActivity_Full <- dailySteps %>% left_join(select(dailyActivity, Id, ActivityDay, Weekday, TrackerDistance), by=c("Id", "ActivityDay", "Weekday")) %>% left_join(dailyMETs, by=c("Id", "ActivityDay", "Weekday")) %>% left_join(dailyCalories, by=c("Id", "ActivityDay", "Weekday")) %>% left_join(dailySleep, by=c("Id", "ActivityDay", "Weekday"))
dailyActivity_NA <- dailyActivity_Full %>% mutate(TrackerDistance = replace_na(TrackerDistance, 0)) %>% mutate(TotalSleepRecords = replace_na(TotalSleepRecords, 0)) %>% mutate(TotalMinutesAsleep = replace_na(TotalMinutesAsleep, 0)) %>% mutate(TotalMinutesInBed = replace_na(TotalMinutesInBed, 0))
While reviewing the data, it became apparent that 4/12 has duplicate data. This is likely due to being stored in both the 3.12-4.12 and 4.12-5.12 datasets.
ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=ActivityDay, y=Calories, fill=Id), stat='identity') + labs(title="Calories by Day")
I’ll remove the duplicate data from the dailyActivity_Full, dailyActivity_NA, and weightLogInfo dataframes. Interestingly, the tracker distance is not repeated for the duplicate data, so we will exclude this from the distinct filter.
dailyActivity_Full <- dailyActivity_Full %>% distinct(Id, ActivityDay, Weekday, StepTotal, SedentaryMinutes, LightlyActiveMinutes, ModeratelyActiveMinutes, VeryActiveMinutes, TotalMinutesMeasured, Calories, TotalSleepRecords, TotalMinutesAsleep, TotalMinutesInBed, .keep_all=TRUE)
## Confirm unique calories and total calories are now the same
sum(dailyActivity_Full$ActivityDay == "2016-04-12")
## [1] 33
dailyActivity_Full %>% group_by(ActivityDay) %>% filter(ActivityDay == "2016-04-12") %>% summarize(Unique_Calories = n_distinct(Calories))
## # A tibble: 1 × 2
## ActivityDay Unique_Calories
## <date> <int>
## 1 2016-04-12 33
## Repeat for other dataframes
dailyActivity_NA <- dailyActivity_NA %>% distinct(Id, ActivityDay, Weekday, StepTotal, SedentaryMinutes, LightlyActiveMinutes, ModeratelyActiveMinutes, VeryActiveMinutes, TotalMinutesMeasured, Calories, TotalSleepRecords, TotalMinutesAsleep, TotalMinutesInBed, .keep_all=TRUE)
weightLogInfo <- weightLogInfo %>% distinct(Id, ActivityHour, ActivityDay, WeightKg, WeightPounds, Fat, BMI, IsManualReport, LogId, Weekday, .keep_all=TRUE)
## See the graphs again to confirm the issue is corrected. It's worth noting that 4/12 still shows higher calorie than other days, however unique count was confirmed to be correct with total rows at 33 of each.
ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=ActivityDay, y=Calories, fill=Id), stat='identity') + labs(title="Calories by Day")
To keep copies of the newly processed data, we can write the files as CSV for archiving or to migrate to other tools (such as to use with Tableau).
write.csv(weightLogInfo, "~/Data Analytics/R/RStudio/FitBit_CaseStudy/Exported_Data/weightLogInfo.csv")
write.csv(dailyActivity_Full, "~/Data Analytics/R/RStudio/FitBit_CaseStudy/Exported_Data/dailyActivity_Full.csv")
write.csv(dailyActivity_NA, "~/Data Analytics/R/RStudio/FitBit_CaseStudy/Exported_Data/dailyActivity_NA.csv")
I start with some bar charts to compare Weekday data to identify if there are any weekday trends as it correlates to smart device usage.
ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=Weekday, y=Calories, fill=Id), stat='identity') + labs(title="Calories by Day of Week")
ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=Weekday, y=TrackerDistance, fill=Id), stat='identity') + labs(title="Tracker Distance by Day of Week")
ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=Weekday, y=TotalMinutesAsleep, fill=Id), stat='identity') + labs(title="Sleep Time by Day of Week")
I then use Line (smooth/trend) to identify any trends in usage over time.
ggplot(data=weightLogInfo) + geom_smooth(mapping=aes(x=ActivityDay, y=WeightPounds)) + labs(title="Weight over time") + facet_wrap(~Id) + geom_point(mapping=aes(x=ActivityDay, y=WeightPounds))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=dailyActivity_Full) + geom_smooth(mapping=aes(x=ActivityDay, y=TotalMinutesAsleep)) + labs(title="Asleep Minutes - Trend Only")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data=dailyActivity_Full) + geom_smooth(mapping=aes(x=ActivityDay, y=TotalMinutesAsleep)) + geom_point(mapping=aes(x=ActivityDay, y=TotalMinutesAsleep)) + labs(title="Asleep Minutes - with Points")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data=dailyActivity_Full) + geom_smooth(mapping=aes(x=ActivityDay, y=TotalMinutesMeasured)) + facet_wrap(~Id) + theme(axis.text.x = element_text(angle = 45)) + labs(title="Minutes Measured by User ID", subtitle="By Day")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Finally, I’ll use point graphs to perform some correlation analysis.
ggplot(data=dailyActivity_Full) + geom_point(mapping=aes(x=Calories, y=StepTotal)) + geom_smooth(mapping=aes(x=Calories, y=StepTotal)) + theme(axis.text.x = element_text(angle = 45)) + labs(title="Step Total vs. Calories")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data=dailyActivity_Full) + geom_point(mapping=aes(x=Calories, y=LightlyActiveMinutes + ModeratelyActiveMinutes + VeryActiveMinutes)) + geom_smooth(mapping=aes(x=Calories, y=LightlyActiveMinutes + ModeratelyActiveMinutes + VeryActiveMinutes)) + labs(title="Total Active Minutes vs. Calories", subtitle="Lightly, Moderately, and Very Tracking Combined", y="Total Active Minutes")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data=dailyActivity_Full) + geom_point(mapping=aes(x=StepTotal, y=TotalMinutesAsleep)) + geom_smooth(mapping=aes(x=StepTotal, y=TotalMinutesAsleep)) + labs(title="Total Minutes Asleep vs. Step Total")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data=dailyActivity_Full) + geom_point(mapping=aes(x=Calories, y=TotalMinutesAsleep)) + geom_smooth(mapping=aes(x=Calories, y=TotalMinutesAsleep)) + labs(title="Total Minutes Asleep vs. Calories")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For the conclusion, I’ll go back to the three initial questions that were proposed:
Of the trends identified in this document, I believe the most notable findings to be: