The idea of this project occurred while I was doing my Bellabeat project for Google course. I encounter some problem in my first attempt to plot line chart from heart rate data frame. There is so much inconsistency in recorded data such as uneven recording in time interval for each user.
Instead of having steady time interval, let’s say every five seconds from very first day to the last day of participation, for each user to fill out heart rate values and “NA” for missing value whenever they are absent from participation, but the data frame just does not include time interval with missing value.
Whenever using the data frame with time intervals that contain missing value are skipped, the line chart will automatically connect the line of that recorded data across that absent interval instead of leaving gap. The result are misleading line chart with the line on the interval that is supposed to be empty.
This project will fix this problem by creating a data frame with steady time interval from the first day to the last day, then outer join this data frame with heart rate data frame. The time interval with absence of data will be automatically fill with “NA”. Then we will plot line chart with this merged data frame.
library("tidyverse")
library("lubridate")
library("viridis")
library("ggpubr")
library("ggrepel")
We only use heart rate data frame, and also daily activity data frame for categorized all users based on activity level and participation level
dailyActivity_merged <- read.csv("C:/Users/Family/Downloads/dailyActivity_merged.csv")
heartrate_seconds_merged <- read.csv("C:/Users/Family/Downloads/heartrate_seconds_merged.csv/heartrate_seconds_merged.csv")
Organize time recorded in “heartrate_seconds_merged” table from character class into date time class.
heartrate_seconds_merged$Time <- parse_date_time(
heartrate_seconds_merged$Time,
"%m/%d/%y %I:%M:%S %p"
)
Then do the same for the column “ActivityDate” in “dailyActivity_merged” data frame.
dailyActivity_merged$ActivityDate <-
mdy(dailyActivity_merged$ActivityDate)
Preparing the list of user number for each Id, and created two columns data frame call “users_id”.
thirtythree <- c("01", "02", "03", "04", "05", "06",
"07", "08", "09", 10:33)
Users <- paste("User", thirtythree, sep = " ")
Id <- unlist(distinct(dailyActivity_merged, Id))
users_id <- tibble(Users, Id)
Then left join this “users_id” to “dailyActivity_merged” in order to lebel each Id in daily activity table with user number.
dailyActivity_merged <- dailyActivity_merged %>%
left_join(users_id, by = "Id")
Create a data frame with 2 column called “ALPL”, which stands for Activity Level and Participation Level. This data frame can be attached to any data frames in this study to put tag on each user with activity level to see how they spend their time on intense activity, and participation level to see how often they participated in the program.
ALPL <- dailyActivity_merged %>%
group_by(Users, Id) %>%
summarize(
MedVeryActive = median(VeryActiveMinutes),
UsageRecords = n_distinct(ActivityDate),
.groups='drop'
) %>%
mutate(
ActivityLevel = case_when(
MedVeryActive < 4 ~ "Low",
MedVeryActive <= 32 ~ "Med",
MedVeryActive < 211 ~ "High"
),
ParticipationLevel = case_when(
UsageRecords < 14 ~ "Low Usage",
UsageRecords < 21 ~ "Moderate Usage",
UsageRecords < 31 ~ "High Usage",
UsageRecords == 31 ~ "Daily Usage",
),
ActivityLevel = factor(
ActivityLevel,
levels=c('Low', 'Med', 'High')
),
ParticipationLevel = factor(
ParticipationLevel,
levels=c('Low Usage','Moderate Usage',
'High Usage','Daily Usage')
)
) %>%
subset(select = -c(MedVeryActive,UsageRecords))
Create a new data frame called “heartrate_seconds_merged_ALPL” by attaching “ALPL” to “heartrate_seconds_merged”.
heartrate_seconds_merged_ALPL <- heartrate_seconds_merged %>%
left_join(ALPL, by = "Id") %>%
mutate(ActivityDay = date(Time),
Weekday = wday(Time,
label = T, week_start = 1),
Hour = hour(Time))
In this newly created data frame, users are categorized based on their time spending on intense activity each day.
Now, let’s plot the line chart using data from this “heartrate_seconds_merged_ALPL” to see how the heart rate trend of each user look like.
heartrate_seconds_merged_ALPL %>%
ggplot(aes(Time, Value,
color=Users
)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90,vjust=0.5),
strip.background = element_rect(fill = "palegoldenrod")) +
facet_wrap(~Users~ActivityLevel) +
labs(x='Time',
y='Heart Rate Value',
title = "Heart Rate Value vs Time",
subtitle = "By Users, Activity Level")
As we can see from some user line chart, the line automatically connect the part where the data are absent. Let’s take a closer look at user 07 for instance,
heartrate_seconds_merged_ALPL %>%
filter(Users == "User 07") %>%
distinct(Users, ActivityLevel,
ParticipationLevel,
Days_of_Participation=n_distinct(ActivityDay))
## Users ActivityLevel ParticipationLevel Days_of_Participation
## 1 User 07 Low Daily Usage 4
User 07 participated in heart rate record only 4 day. In case you may confuse this user participation level is daily usage, that’s because participation level is based on number of days in daily activity data frame, not this heart rate data frame. But the range of time from the chart of this user is longer than the time from April 17th to May 9th, which more than 4 days of the day on his record.
If we take a look at his chart individually, we can see the chart automatically connect the part that time recorded are not even included in the data frame.
heartrate_seconds_merged_ALPL %>%
filter(Users=="User 07") %>%
ggplot(aes(Time, Value,
color=Users
)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90,vjust=0.5),
strip.background = element_rect(fill = "palegoldenrod")) +
facet_wrap(~Users~ActivityLevel) +
labs(x='Time',
y='Heart Rate Value',
title = "Heart Rate Value vs Time",
subtitle = "User 07") +
annotate(geom = "segment",
linewidth = 1,
x = as.POSIXct(c("2016-04-21", "2016-04-29", "2016-05-06"),tz="GMT"),
y = c(83, 90, 80),
xend = as.POSIXct(c("2016-04-21", "2016-04-29", "2016-05-06"),tz="GMT"),
yend = c(76, 83, 73),
arrow = arrow(type = "closed")) +
annotate("text",
as.POSIXct(c("2016-04-21", "2016-04-29", "2016-05-06"),tz="GMT"),
y=c(85, 92, 82),
label = "missing part",
hjust = 0.5, size = 5, fontface = "bold")
The slope lines annotated with arrows are the missing parts. The time period on the heart rate data frame are just skipped these parts for this user. Not that the “Time” column in these part are recorded in every 5 seconds as it is supposed to, with “NA” filled in the “Value” column, but these time period are not present in the data frame at all.
When we plot line chart using this data frame, ggplot will automatically connect the missing part in the data frame instead of leaving it blank as it is supposed to. So the visualization from this chart could be misleading as there are heart rate value present in these parts.
To fix this problem having ggplot2 leave the missing part blank on the chart as it is supposed to be, the data frame we use need to have regular time record that consistent for every users, let’s say every participant will have exactly the same period from the beginning to the end, with equally recorded time every 5 seconds. For the periods that users are absent, the missing value will be filled with “NA”.
To add that constant time period recorded into out heart rate data frame, first we need to created a tibble called “TimeSeries”, with period of time from the very first day of heart rate data frame to the last day for every user.
We also create another tibble called “Users_heartrate”. This tibble contain all the user number participate in heart rate data frame
TimeSeries <- tibble(
Time = seq(as.POSIXct('2016-04-12 00:00:00'),
as.POSIXct('2016-05-12 16:20:00'),
by = '5 sec'))
Users_heartrate <- tibble(Users = unlist(
distinct(heartrate_seconds_merged_ALPL, Users)))
To make it more perfect, we left join “Users_heartrate” tibble with ALPL data frame. This “Users_heartrate_ALPL” will have data of activity and participation level of all user in heart rate data frame
Users_heartrate_ALPL <- Users_heartrate %>%
left_join(ALPL, by = "Users")
And now, we merge “Users_heartrate_ALPL” and “TimeSeries” together, to be the new data frame call “UIT”. In this data frame, every user will have exactly same period of participation recorded every 5 seconds.
UIT <-
merge(x = TimeSeries,
y = Users_heartrate_ALPL,
by = NULL)
Then we create new data frame called “Regular_heartrate” by merging “UIT” with our “heartrate_seconds_merged_ALPL”. Now, our newly created data frame is like modification of our original heart rate data frame, with more consistent time recorded for every user.
Regular_heartrate <-
merge(UIT,
heartrate_seconds_merged_ALPL,
by = c("Users","Id","ActivityLevel",
"ParticipationLevel","Time"),
all.x = TRUE)
Now we can plot line chart using regular heart rate data frame and let’s see whether or not we can get rid of those unwanted line on missing period of time.
Regular_heartrate %>%
ggplot(aes(Time, Value,
color=Users
)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90,vjust=0.5),
strip.background = element_rect(fill = "palegoldenrod")) +
facet_wrap(~Users~ActivityLevel) +
labs(x='Time',
y='Heart Rate Value',
title = "Heart Rate Value vs Time",
subtitle = "By Users, Activity Level")
The chart from regular heart rate data frame render correct plot with blank part during period of time with the absence of data.
We can use this method for any data frame with inconsistency time recorded. Creating data frame with consistent time recorded from the beginning to the end for every Id, and then merge this data frame with our original data frame to make time line more consistent.