Introduction

This is a case study for the Google Data Analytics Certificate and will be performed on FitBit data using the FitBit Fitness Tracker Data. This case study will demonstrate skills learned through the Google Data Analytics Certificate course, with a heavy focus on R.

This document should highlight skills acquired through the Google Data Analytics Certification course, along with trying to address the case study use-case. Some repetition is kept on this document as a reference to the troubleshooting and thought process.

Ask

Bellabeat needs to better understand smart device tracking data to gain further insight into how consumers are using their smart devices. These insights should drive improvements to BellaBeat marketing strategies.

Questions we will explore further in the data:

Prepare

The Kaggle Dataset is a group of 29 CSV files, organized into two folders. Each folder appears to effectively be a month of data (labeled “3.12.16-4.11.16” and “4.12.16-5.12.16” respectively).

Upload the data to RStudio:

  • Select Upload in the Files view, then select Choose File
  • Select the .zip file that was downloaded from Kaggle.
  • RStudio will extract the zipped contents, and contents will now be visible in the Files view.

Read data for analysis

Of the 29 CSVs, I will be focusing primarily on the higher level summary data: hours and days. Second and minute data will not be used for this case study.

First I must install and load the tidyverse package.

## For Desktop Applications, we need to configure CRAN
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
install.packages("weatherData")

install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\semar\AppData\Local\Temp\RtmpYFkol4\downloaded_packages
library(tidyverse)

Then, I utilize the read_csv function to prepare the data as dataframes

setwd("~/Data Analytics/R/RStudio/FitBit_CaseStudy/FitBit_CaseStudy_12022024/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16")
dailyActivity_3.12 <- read_csv("dailyActivity_merged.csv")
hourlyCalories_3.12 <- read_csv("hourlyCalories_merged.csv")
hourlyIntensities_3.12 <- read_csv("hourlyIntensities_merged.csv")
hourlySteps_3.12 <- read_csv("hourlySteps_merged.csv")
weightLogInfo_3.12 <- read_csv("weightLogInfo_merged.csv")
minuteSleep_3.12 <- read_csv("minuteSleep_merged.csv")
minuteMETs_3.12 <- read_csv("minuteMETsNarrow_merged.csv")

setwd("~/Data Analytics/R/RStudio/FitBit_CaseStudy/FitBit_CaseStudy_12022024/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16")
dailyActivity_4.12 <- read_csv("dailyActivity_merged.csv")
hourlyCalories_4.12 <- read_csv("hourlyCalories_merged.csv")
hourlyIntensities_4.12 <- read_csv("hourlyIntensities_merged.csv")
hourlySteps_4.12 <- read_csv("hourlySteps_merged.csv")
weightLogInfo_4.12 <- read_csv("weightLogInfo_merged.csv")
minuteSleep_4.12 <- read_csv("minuteSleep_merged.csv")
minuteMETs_4.12 <- read_csv("minuteMETsNarrow_merged.csv")

There are 7 files in the 4.12 folder that do not exist in the 3.12 folder, listing those separately for organizational purposes. For variable labeling clarity, these will not have the date identifier at the end.

setwd("~/Data Analytics/R/RStudio/FitBit_CaseStudy/FitBit_CaseStudy_12022024/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16")
dailyCalories <- read_csv("dailyCalories_merged.csv")
dailyIntensities <- read_csv("dailyIntensities_merged.csv")
dailySteps <- read_csv("dailySteps_merged.csv")
sleepDay <- read_csv("sleepDay_merged.csv")

Get familiar with the data

Data can be viewed outside of R utilizing spreadsheets or SQL, however this Case study will be making use of various R functions (view, summary, glimpse, colnames, head, etc.). These functions should be repeated as needed for any dataframes (examples below).

head(weightLogInfo)
summary(weightLogInfo)
colnames(dailyActivity)
glimpse(dailyActivity)

Having a peek at some of the tables, I am able to identify a few trends in the data:

  • ID appears to be common in all of the tables. This can be helpful as a join field when we get to processing the data
  • Data is primarily sorted into time chunks: seconds, minute, hour, and day. With particular note that heartRate is the only seconds measurement, and the weightLogInfo tables are effectively daily (albeit, not recorded every day or could be taken multiple times in a day).
  • Since the data is separated by two months of similar data, it may make sense to combine these into a single table (such as combining dailyActivity_3.12 and dailyActivity_4.12).
  • Dates are not formatted properly and will need to be cleaned.

I’ll finish organizing the data by combining the 5 pairs of similar datasets into singular dataframes (and cleaning up the unused single dataframes). This will make it easier to work with as we move on to process and analyze.

dailyActivity <- rbind2(dailyActivity_3.12, dailyActivity_4.12)
rm(dailyActivity_3.12, dailyActivity_4.12)

hourlyCalories <- rbind2(hourlyCalories_3.12, hourlyCalories_4.12)
rm(hourlyCalories_3.12, hourlyCalories_4.12)

hourlyIntensities <- rbind2(hourlyIntensities_3.12, hourlyIntensities_4.12)
rm(hourlyIntensities_3.12, hourlyIntensities_4.12)

hourlySteps <- rbind2(hourlySteps_3.12, hourlySteps_4.12)
rm(hourlySteps_3.12, hourlySteps_4.12)

weightLogInfo <- rbind2(weightLogInfo_3.12, weightLogInfo_4.12)
rm(weightLogInfo_3.12, weightLogInfo_4.12)

minuteSleep <- rbind2(minuteSleep_3.12, minuteSleep_4.12)
rm(minuteSleep_3.12, minuteSleep_4.12)

minuteMETs <- rbind2(minuteMETs_3.12, minuteMETs_4.12)
rm(minuteMETs_3.12, minuteMETs_4.12)

Process

Now that our data is organized, I can explore cleaning up and transforming the data.

Date Formatting

I’ll start by correcting the date formatting for our datasets. I’ll also add a weekday column for the data, as this may be helpful in analysis (perhaps users are more active on certain days of the week, etc.)

colnames(dailyActivity)[2] <- "ActivityDay"
dailyActivity$ActivityDay <- format(mdy(dailyActivity$ActivityDay), "%m/%d/%Y")
dailyActivity$ActivityDay <- as.Date(dailyActivity$ActivityDay, "%m/%d/%Y")
dailyActivity$Weekday <- wday(dailyActivity$ActivityDay, label = TRUE, abbr = FALSE)


dailyCalories$ActivityDay <- format(mdy(dailyCalories$ActivityDay), "%m/%d/%Y")
dailyCalories$ActivityDay <- as.Date(dailyCalories$ActivityDay, "%m/%d/%Y")
dailyCalories$Weekday <- wday(dailyCalories$ActivityDay, label = TRUE, abbr = FALSE)

dailyIntensities$ActivityDay <- format(mdy(dailyIntensities$ActivityDay), "%m/%d/%Y")
dailyIntensities$ActivityDay <- as.Date(dailyIntensities$ActivityDay, "%m/%d/%Y")
dailyIntensities$Weekday <- wday(dailyIntensities$ActivityDay, label = TRUE, abbr = FALSE)

dailySteps$ActivityDay <- format(mdy(dailySteps$ActivityDay), "%m/%d/%Y")
dailySteps$ActivityDay <- as.Date(dailySteps$ActivityDay, "%m/%d/%Y")
dailySteps$Weekday <- wday(dailySteps$ActivityDay, label = TRUE, abbr = FALSE)

hourlyCalories <- separate(hourlyCalories, ActivityHour, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
hourlyCalories$ActivityHourOnly <- NULL
hourlyCalories$ActivityDay <- format(mdy(hourlyCalories$ActivityDay), "%m/%d/%Y")
hourlyCalories$ActivityDay <- as.Date(hourlyCalories$ActivityDay, "%m/%d/%Y")
hourlyCalories$ActivityHour <- format(mdy_hms(hourlyCalories$ActivityHour), "%m/%d/%Y %H:%M:%S")
hourlyCalories$ActivityHour <- mdy_hms(hourlyCalories$ActivityHour)
hourlyCalories$Weekday <- wday(hourlyCalories$ActivityDay, label = TRUE, abbr = FALSE)

hourlyIntensities <- separate(hourlyIntensities, ActivityHour, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
hourlyIntensities$ActivityHourOnly <- NULL
hourlyIntensities$ActivityDay <- format(mdy(hourlyIntensities$ActivityDay), "%m/%d/%Y")
hourlyIntensities$ActivityDay <- as.Date(hourlyIntensities$ActivityDay, "%m/%d/%Y")
hourlyIntensities$ActivityHour <- format(mdy_hms(hourlyIntensities$ActivityHour), "%m/%d/%Y %H:%M:%S")
hourlyIntensities$ActivityHour <- mdy_hms(hourlyIntensities$ActivityHour)
hourlyIntensities$Weekday <- wday(hourlyIntensities$ActivityDay, label = TRUE, abbr = FALSE)

hourlySteps <- separate(hourlySteps, ActivityHour, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
hourlySteps$ActivityHourOnly <- NULL
hourlySteps$ActivityDay <- format(mdy(hourlySteps$ActivityDay), "%m/%d/%Y")
hourlySteps$ActivityDay <- as.Date(hourlySteps$ActivityDay, "%m/%d/%Y")
hourlySteps$ActivityHour <- format(mdy_hms(hourlySteps$ActivityHour), "%m/%d/%Y %H:%M:%S")
hourlySteps$ActivityHour <- mdy_hms(hourlySteps$ActivityHour)
hourlySteps$Weekday <- wday(hourlySteps$ActivityDay, label = TRUE, abbr = FALSE)

sleepDay <- separate(sleepDay, SleepDay, into=c('ActivityDay', 'SleepDayOnly'), sep=' ', remove=FALSE)
sleepDay$SleepDayOnly <- NULL
sleepDay$sleepDay <- NULL
sleepDay$ActivityDay <- format(mdy(sleepDay$ActivityDay), "%m/%d/%Y")
sleepDay$ActivityDay <- as.Date(sleepDay$ActivityDay, "%m/%d/%Y")
sleepDay$SleepDay <- format(mdy_hms(sleepDay$SleepDay), "%m/%d/%Y %H:%M:%S")
sleepDay$SleepDay <- mdy_hms(sleepDay$SleepDay)
sleepDay$Weekday <- wday(sleepDay$ActivityDay, label = TRUE, abbr = FALSE)

weightLogInfo <- separate(weightLogInfo, Date, into=c('ActivityDay', 'DateOnly'), sep=' ', remove=FALSE)
weightLogInfo$DateOnly <- NULL
weightLogInfo$ActivityDay <- format(mdy(weightLogInfo$ActivityDay), "%m/%d/%Y")
weightLogInfo$ActivityDay <- as.Date(weightLogInfo$ActivityDay, "%m/%d/%Y")
weightLogInfo$Date <- format(mdy_hms(weightLogInfo$Date), "%m/%d/%Y %H:%M:%S")
weightLogInfo$Date <- mdy_hms(weightLogInfo$Date)
weightLogInfo$Weekday <- wday(weightLogInfo$ActivityDay, label = TRUE, abbr = FALSE)
colnames(weightLogInfo)[2] <- "ActivityHour"

minuteSleep <- separate(minuteSleep, date, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
minuteSleep$ActivityHourOnly <- NULL
minuteSleep$ActivityDay <- format(mdy(minuteSleep$ActivityDay), "%m/%d/%Y")
minuteSleep$ActivityDay <- as.Date(minuteSleep$ActivityDay, "%m/%d/%Y")
minuteSleep$ActivityHour <- format(mdy_hms(minuteSleep$date), "%m/%d/%Y %H:%M:%S")
minuteSleep$ActivityHour <- mdy_hms(minuteSleep$date)
minuteSleep$Weekday <- wday(minuteSleep$ActivityDay, label = TRUE, abbr = FALSE)

minuteMETs <- separate(minuteMETs, ActivityMinute, into=c('ActivityDay', 'ActivityHourOnly'), sep=' ', remove=FALSE)
minuteMETs$ActivityHourOnly <- NULL
minuteMETs$ActivityDay <- format(mdy(minuteMETs$ActivityDay), "%m/%d/%Y")
minuteMETs$ActivityDay <- as.Date(minuteMETs$ActivityDay, "%m/%d/%Y")
minuteMETs$ActivityMinute <- format(mdy_hms(minuteMETs$ActivityMinute), "%m/%d/%Y %H:%M:%S")
minuteMETs$ActivityMinute <- mdy_hms(minuteMETs$ActivityMinute)
minuteMETs$Weekday <- wday(minuteMETs$ActivityDay, label = TRUE, abbr = FALSE)

Summary DataFrames

Going through the data, it becomes apparent that the daily datasets are incomplete, such as Id 1624580081 not showing activity on 2016-03-12 even though there is activity in other tables, inconsistent numbers of objects between tables, etc.

I pull summary data from the hourly and minute datasets to create more complete daily datasets (and replace the datasets that were pre-populated from the .csv’s)

dailyCalories <- hourlyCalories %>% group_by(Id, ActivityDay, Weekday) %>% arrange(ActivityDay, Id) %>% summarize(Calories = sum(Calories))

dailySleep <- minuteSleep %>% group_by(Id, ActivityDay, Weekday) %>% summarize(TotalSleepRecords = n_distinct(logId), TotalMinutesAsleep =sum(value == 1), TotalMinutesInBed = n(), .groups = 'keep')
rm(sleepDay, minuteSleep)

dailySteps <- hourlySteps %>% group_by(Id, ActivityDay, Weekday) %>% arrange(ActivityDay, Id) %>% summarize(StepTotal= sum(StepTotal))

MinuteMETs was a particularly challenging table. METs are measured from 10-189 on the table, but when comparing to a quick Google search it is not super clear what the METs measurements are when compared to a more standard 1-6 (values exceeding 6 indicate vigorous intensity).

To accurately report this data would require consulting the database owner for further clarity around the measurement expectations, however I’ve gone through some numbers as an example on how this may look

  • I remove duplicate measures (some results pulled back more than 1440 minutes in a day)
  • I create Summaries based on the count of MET’s within a range
minuteMETs <- minuteMETs %>% arrange(ActivityMinute) %>% group_by(Id) %>% filter(duplicated(ActivityMinute) ==FALSE)

dailyMETs <- minuteMETs %>% group_by(Id, ActivityDay, Weekday) %>% summarize(SedentaryMinutes = sum(METs >= 10 & METs <= 20), LightlyActiveMinutes = sum(METs >= 21 & METs <= 50), ModeratelyActiveMinutes = sum(METs >= 51 & METs <= 80), VeryActiveMinutes = sum(METs >=81), TotalMinutesMeasured = sum(SedentaryMinutes, LightlyActiveMinutes, ModeratelyActiveMinutes, VeryActiveMinutes))
rm(minuteMETs)

Create Final Daily Activity

Now that I have consistent subsets of data, I’ll use them to create a new dailyActivity table. To potentially help when it comes to analysis, I’m breaking out the data in two primary ways:

  • dailyActivity_Full: Contains NA values where certain activity wasn’t tracked by a user on a day.
    • drop_na can be used on this table to perform other views into the data
  • dailyActivity_NA: NA is changed to 0.
    • This can be helpful for time trends (to see if device wasn’t used, or perhaps wasn’t working)
dailyActivity_Full <- dailySteps %>%  left_join(select(dailyActivity, Id, ActivityDay, Weekday, TrackerDistance), by=c("Id", "ActivityDay", "Weekday")) %>% left_join(dailyMETs, by=c("Id", "ActivityDay", "Weekday")) %>% left_join(dailyCalories, by=c("Id", "ActivityDay", "Weekday")) %>% left_join(dailySleep, by=c("Id", "ActivityDay", "Weekday"))

dailyActivity_NA <- dailyActivity_Full %>% mutate(TrackerDistance = replace_na(TrackerDistance, 0)) %>% mutate(TotalSleepRecords = replace_na(TotalSleepRecords, 0)) %>% mutate(TotalMinutesAsleep = replace_na(TotalMinutesAsleep, 0)) %>% mutate(TotalMinutesInBed = replace_na(TotalMinutesInBed, 0))

Remove Duplicates

While reviewing the data, it became apparent that 4/12 has duplicate data. This is likely due to being stored in both the 3.12-4.12 and 4.12-5.12 datasets.

ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=ActivityDay, y=Calories, fill=Id), stat='identity') + labs(title="Calories by Day")


I’ll remove the duplicate data from the dailyActivity_Full, dailyActivity_NA, and weightLogInfo dataframes. Interestingly, the tracker distance is not repeated for the duplicate data, so we will exclude this from the distinct filter.

dailyActivity_Full <- dailyActivity_Full %>% distinct(Id, ActivityDay, Weekday, StepTotal, SedentaryMinutes, LightlyActiveMinutes, ModeratelyActiveMinutes, VeryActiveMinutes, TotalMinutesMeasured, Calories, TotalSleepRecords, TotalMinutesAsleep, TotalMinutesInBed, .keep_all=TRUE)

## Confirm unique calories and total calories are now the same
sum(dailyActivity_Full$ActivityDay == "2016-04-12")
## [1] 33
dailyActivity_Full %>% group_by(ActivityDay) %>% filter(ActivityDay == "2016-04-12") %>% summarize(Unique_Calories = n_distinct(Calories))
## # A tibble: 1 × 2
##   ActivityDay Unique_Calories
##   <date>                <int>
## 1 2016-04-12               33
## Repeat for other dataframes
dailyActivity_NA <- dailyActivity_NA %>% distinct(Id, ActivityDay, Weekday, StepTotal, SedentaryMinutes, LightlyActiveMinutes, ModeratelyActiveMinutes, VeryActiveMinutes, TotalMinutesMeasured, Calories, TotalSleepRecords, TotalMinutesAsleep, TotalMinutesInBed, .keep_all=TRUE)

weightLogInfo <- weightLogInfo %>% distinct(Id, ActivityHour, ActivityDay, WeightKg, WeightPounds, Fat, BMI, IsManualReport, LogId, Weekday, .keep_all=TRUE)

## See the graphs again to confirm the issue is corrected.  It's worth noting that 4/12 still shows higher calorie than other days, however unique count was confirmed to be correct with total rows at 33 of each.

ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=ActivityDay, y=Calories, fill=Id), stat='identity') + labs(title="Calories by Day")


Export Processed data as new CSV files

To keep copies of the newly processed data, we can write the files as CSV for archiving or to migrate to other tools (such as to use with Tableau).

write.csv(weightLogInfo, "~/Data Analytics/R/RStudio/FitBit_CaseStudy/Exported_Data/weightLogInfo.csv")

write.csv(dailyActivity_Full, "~/Data Analytics/R/RStudio/FitBit_CaseStudy/Exported_Data/dailyActivity_Full.csv")

write.csv(dailyActivity_NA, "~/Data Analytics/R/RStudio/FitBit_CaseStudy/Exported_Data/dailyActivity_NA.csv")

Analyze

Weekday Data

I start with some bar charts to compare Weekday data to identify if there are any weekday trends as it correlates to smart device usage.

ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=Weekday, y=Calories, fill=Id), stat='identity') + labs(title="Calories by Day of Week")

ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=Weekday, y=TrackerDistance, fill=Id), stat='identity') + labs(title="Tracker Distance by Day of Week")

ggplot(data=dailyActivity_Full) + geom_bar(mapping=aes(x=Weekday, y=TotalMinutesAsleep, fill=Id), stat='identity') + labs(title="Sleep Time by Day of Week")


Weekday Data Findings

  • Calorie activity is pretty consistent regardless of day of week, with a note that Tuesdays have a higher average calorie count.
  • Weekends have a higher average sleep time, with particular note that Friday’s have the lowest total sleep recorded.

Timeline data

I then use Line (smooth/trend) to identify any trends in usage over time.

ggplot(data=weightLogInfo) + geom_smooth(mapping=aes(x=ActivityDay, y=WeightPounds)) + labs(title="Weight over time") + facet_wrap(~Id) + geom_point(mapping=aes(x=ActivityDay, y=WeightPounds))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data=dailyActivity_Full) + geom_smooth(mapping=aes(x=ActivityDay, y=TotalMinutesAsleep)) + labs(title="Asleep Minutes - Trend Only")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(data=dailyActivity_Full) + geom_smooth(mapping=aes(x=ActivityDay, y=TotalMinutesAsleep)) + geom_point(mapping=aes(x=ActivityDay, y=TotalMinutesAsleep)) + labs(title="Asleep Minutes - with Points") 
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(data=dailyActivity_Full) + geom_smooth(mapping=aes(x=ActivityDay, y=TotalMinutesMeasured)) + facet_wrap(~Id) + theme(axis.text.x = element_text(angle = 45)) + labs(title="Minutes Measured by User ID", subtitle="By Day")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


Timeline Data Findings

  • Looking at minutes measured by ID indicates 7 users that went extended periods without time tracked. This is likely due to devices not being used.
  • Small trends can be seen in various of the charts, however these trends become less apparent when zoomed out along with a plot graph. Given the trend change over the duration of two months is quiet small, there likely isn’t enough data to make a solid conclusion (in the range of a 2% trend).
  • Only 13 users recorded weight data. Of those, 8 users had 2 or less measurements.

Correlate Data

Finally, I’ll use point graphs to perform some correlation analysis.

ggplot(data=dailyActivity_Full) + geom_point(mapping=aes(x=Calories, y=StepTotal)) + geom_smooth(mapping=aes(x=Calories, y=StepTotal)) + theme(axis.text.x = element_text(angle = 45)) + labs(title="Step Total vs. Calories")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(data=dailyActivity_Full) + geom_point(mapping=aes(x=Calories, y=LightlyActiveMinutes + ModeratelyActiveMinutes + VeryActiveMinutes)) + geom_smooth(mapping=aes(x=Calories, y=LightlyActiveMinutes + ModeratelyActiveMinutes + VeryActiveMinutes)) + labs(title="Total Active Minutes vs. Calories", subtitle="Lightly, Moderately, and Very Tracking Combined", y="Total Active Minutes")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(data=dailyActivity_Full) + geom_point(mapping=aes(x=StepTotal, y=TotalMinutesAsleep)) + geom_smooth(mapping=aes(x=StepTotal, y=TotalMinutesAsleep)) + labs(title="Total Minutes Asleep vs. Step Total")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(data=dailyActivity_Full) + geom_point(mapping=aes(x=Calories, y=TotalMinutesAsleep)) + geom_smooth(mapping=aes(x=Calories, y=TotalMinutesAsleep)) + labs(title="Total Minutes Asleep vs. Calories")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


Correlate Data Findings

  • As expected, there appears to be a correlation between User activity and calories, as well as total steps and calories. This is indication that a majority of user activity comes from step related activity. Smaller correlation between these values would indicate that calories are being spend on other exercises (such as weights, biking, etc.)
  • Sleep doesn’t appear to have a direct correlation to user activity (StepTotal or Calories)

Conclusion

For the conclusion, I’ll go back to the three initial questions that were proposed: