Bellabeat is a succesful small company that manufactures health-focused smart products. Bellabeat products empowered women with knowledge regarding their own health and habits. They have the potential to become a larger player in the global smart device market. In this case study, I analyzed FitBit Fitness Tracked Data to help find insights and recommendations that might help unlock new growth opportunities.
***
This case study used the FitBit Fitness Tracked Data posted in Kaggle. By looking into the data we could get several takeaways:
#fundamental building blocks to support the entire end-to-end workflow
library(tidyverse)
#to provide summary statistics
library(skimr)
#examining and cleaning dirty data
library(janitor)
#Lubridate makes it easier to do the things R does with date-times
library(lubridate)
#Scales to better state the numbers shown in visualization
library(scales)
setwd("~/R/Case Study 2/Fitabase Data 4.12.16-5.12.16")
dailyActivity_merged <- read_csv("dailyActivity_merged.csv")
sleepDay_merged <- read_csv("sleepDay_merged.csv")
daily_activity <- clean_names(dailyActivity_merged)
sleepday <- clean_names(sleepDay_merged)
skim_without_charts(daily_activity)
| Name | daily_activity |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| activity_date | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| total_steps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| total_distance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| tracker_distance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| logged_activities_distance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| very_active_distance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| moderately_active_distance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| light_active_distance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| sedentary_active_distance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| very_active_minutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| fairly_active_minutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| lightly_active_minutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| sedentary_minutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
skim_without_charts(sleepday)
| Name | sleepday |
| Number of rows | 413 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| sleep_day | 0 | 1 | 20 | 21 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 5.000979e+09 | 2.06036e+09 | 1503960366 | 3977333714 | 4702921684 | 6962181067 | 8792009665 |
| total_sleep_records | 0 | 1 | 1.120000e+00 | 3.50000e-01 | 1 | 1 | 1 | 1 | 3 |
| total_minutes_asleep | 0 | 1 | 4.194700e+02 | 1.18340e+02 | 58 | 361 | 433 | 490 | 796 |
| total_time_in_bed | 0 | 1 | 4.586400e+02 | 1.27100e+02 | 61 | 403 | 463 | 526 | 961 |
#separate date time in sleepday merged
sleepday <- sleepday %>%
separate(sleep_day,c("sleep_date","time")," ")
#remove time column since it's only showing 12 am
sleepday$time <- NULL
#convert activity_date from chr to date
daily_activity$activity_date <- mdy(daily_activity$activity_date)
sleepday$sleep_date<- mdy(sleepday$sleep_date)
#convert id from numeric to character
daily_activity$id <- as.character(daily_activity$id)
sleepday$id <- as.character(sleepday$id)
n_distinct(daily_activity$id)
## [1] 33
n_distinct(sleepday$id)
## [1] 24
I decided to use ID as the basis to perform the analysis. This is related to the bussiness task thatinquire how people are already using their smart devices. First I looked into the daily_activity.
#group by id to observe what each id doing
da.id <- group_by(daily_activity, id)
#first we look at how many days the FitBit users participated in the data collection process
da.id.datelength <- summarize(da.id, min(activity_date), max(activity_date), length(activity_date))
#rename the column in da.id.datelength to make more sense
da.id.datelength <- rename(da.id.datelength, start_date = "min(activity_date)", end_date = "max(activity_date)", num_of_days = "length(activity_date)")
#count the number of IDs participated until 12th of May
da.id.fullparticipationcount <- count(da.id.datelength, num_of_days == 31)
#Visualize the participation graph
ggplot(data = da.id.datelength)+
geom_bar(mapping = aes(x=num_of_days, fill=id))+
ylim(0,30)+
labs(title="Numbers of IDs by Length of Participations", subtitle="Sample of 33 FitBit Tracker IDs", y="IDs Count", x="Length of Participations", legend="ID")+
annotate("text", x=28,y=29, label="21 FitBit IDs tracked data", size=2.5)+
annotate("text", x=28,y=27, label="Since the 12th of April", size=2.5)+
annotate("text", x=28,y=25, label="Up until the 12th of May", size=2.5)+
scale_fill_discrete(name="ID")
#21 IDs tracked data up until 12th of may
#while 12 IDs stopped tracking data before the end of data collection process
#I decided to filter out the 12 IDs that didn't track data until the 12th of May
da.id.complete <- da.id.datelength[,-(2:3)] %>% filter(num_of_days==31)
#I created table for the 21 IDs as the base of further analysis
da.id.base <- da.id.complete [,-2]
Armed with 21 IDs as Primary Keys, I continued my data journey into the data about daily activity. I wanted to see how the devices tracked time and distance related to calories burned.
First, I looked into the aspect of time.
# I filtered for the active minutes
da.id.active_min_cal <- da.id %>%
select(-(2:10)) %>%
summarize(sum(very_active_minutes),sum(fairly_active_minutes),sum(lightly_active_minutes),sum(sedentary_minutes),sum(calories))
#Next I joined the 21 IDs as the base to to the active min see how these devices are used by their users
da.id.complete.active_min_cal <- left_join(da.id.base, da.id.active_min_cal, by="id")
#I needed to rename the column header to ease the process in the next step
names(da.id.complete.active_min_cal)[2:6] <- c("very_am_sum","fairly_am_sum","lightly_am_sum","sedentary_m_sum","calories_sum")
#I thought it was necessary to add one column of total_active_minutes consists of summary of all activity classifications minutes
da.id.complete.active_min_cal <- mutate(da.id.complete.active_min_cal, total_minutes = very_am_sum + fairly_am_sum + lightly_am_sum + sedentary_m_sum)
#Then I visualized the general relationship between the minutes spent and the calories burned
ggplot(data = da.id.complete.active_min_cal)+
geom_point(mapping = aes(x=total_minutes,y=calories_sum, color=id))+
scale_y_continuous(name = "Calories Burned", labels=comma)+
scale_x_continuous(name = "Total Minutes Spent", labels=comma)+
labs(title="Calories Burned by Total Minutes Spent", subtitle="Sample of 21 FitBit Tracker IDs")
#there are two column with similar content, total distance and tracker distance
#checking whether or not they're filled with the same numerical data
dist.total_minus_tracker <- transmute(da.id, total_minus_tracker = total_distance - tracker_distance)
arrange(dist.total_minus_tracker, desc(total_minus_tracker))
## # A tibble: 940 x 2
## # Groups: id [33]
## id total_minus_tracker
## <chr> <dbl>
## 1 6962181067 1.83
## 2 7007744171 1.16
## 3 7007744171 1.16
## 4 7007744171 1.14
## 5 7007744171 1.07
## 6 7007744171 1.06
## 7 7007744171 1.05
## 8 7007744171 0.980
## 9 7007744171 0.900
## 10 7007744171 0.880
## # ... with 930 more rows
# Filtered for the active distance
# By this time I had learned to immediately name the new column so I didn't have to rename it later
# I also realized that I don't have to use select(-) to filter out other columns since the summarize function aready doing that by itself
da.id.active_dist_cal <- da.id %>%
summarize(very_ad_sum = sum(very_active_distance), fairly_ad_sum = sum(moderately_active_distance), lightly_ad_sum = sum(light_active_distance), sedentary_d_sum = sum(sedentary_active_distance), calories_sum = sum(calories))
#Did they add up?
transmute(da.id, wrong_if_not_zero = (sum(total_distance) - sum(very_active_distance) - sum(moderately_active_distance) - sum(light_active_distance) - sum(sedentary_active_distance)))
## # A tibble: 940 x 2
## # Groups: id [33]
## id wrong_if_not_zero
## <chr> <dbl>
## 1 1503960366 0.130
## 2 1503960366 0.130
## 3 1503960366 0.130
## 4 1503960366 0.130
## 5 1503960366 0.130
## 6 1503960366 0.130
## 7 1503960366 0.130
## 8 1503960366 0.130
## 9 1503960366 0.130
## 10 1503960366 0.130
## # ... with 930 more rows
#Because the total calculation of very_ad_sum + fairly_ad_sum + lightly_ad_sum + sedentary_d_sum in some of the IDs didn't add up
#I decided to not use either the value in total_distance column or tracker_distance column
#Instead, I calculated the sum of each distance column to a new colum called total distance
da.id.active_dist_cal <- mutate(da.id.active_dist_cal, total_distance = very_ad_sum + fairly_ad_sum + lightly_ad_sum + sedentary_d_sum)
#Then I joined it with the 21 base ID to filter out IDs that stopped tracking data before the 12th of May
da.id.complete.active_dist_cal <- left_join(da.id.base, da.id.active_dist_cal, by="id")
#I visualized the data to look at the relationship between distance tracked and calories burned
ggplot(data = da.id.complete.active_dist_cal)+
geom_point(mapping = aes(x=total_distance,y=calories_sum, color=id))+
scale_y_continuous(name = "Calories Burned", labels=comma)+
scale_x_continuous(name = "Total Distance Reached", labels=comma)+
labs(title="Calories Burned by Total Distance Reached", subtitle="Sample of 21 FitBit Tracker IDs")
From this general look of relationship between total distance tracked and calories burned we could see that the plots are lined up in a loose manner, as distance increased so did the calories burned.
Next, I delved into the sleepday data, but there was a finding that affected my analysis decision.
sleepday.id <- sleepday %>%
group_by(id) %>%
summarise(total_time_in_bed = sum(total_time_in_bed),
total_sleep_time = sum(total_minutes_asleep))
#let's join the 21 base id to the sleep data with 24 id
sleepday.id.complete <- left_join(da.id.base, sleepday.id, by="id")
Before recommending actions, let us review the products already covered by Bellabeat:
Tracking sleep data is not as attractive as tracking activity data. This is probably due to the nature of wearables that might cause discomfort when users are trying to sleep. Bellabeat might unlock new great opportinity in this path by creating a product that focus on tracking sleep data without causing discomfort or disturb the attention of the user who are trying to sleep. I suggest something along smart mattress or bed or bed frame to track sleep data.
Users were able to burn high amount of calories in either longer or shorter time of activities. But those users are the minority. Meanwhile most users burned relatively low or decent amount of calories in eitther longer or shorter time of activities. Ideally, it would be a great if users can burn high amount of calories in the shortest of time, push them forward to the upper left corner of the scatterplot. Bellabeat can tap into this notion of ideal world by: