What does a data geek do on his day off? Play around with a dataset that’s in need of inspection. This is a history of my weight, which I’ve been measuring every morning since I got home from the hospital after heart surgery in July 2015.
First I need to load some extra libraries on top of base R.
library(tidyverse)
library(lubridate)
library(plotly)
library(ggbeeswarm)
Now I need to read the data and convert the date, which was read in as a string, to a real date.
DailyWt <- read_csv("DailyWt.csv", col_types = cols(Date = col_character()))
DailyWt %>%
mutate(dateDt = dmy(Date)) -> DailyWt
That’s enough to have a first look at the data using ggplot2.
str(DailyWt)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1516 obs. of 3 variables:
## $ Date : chr "15-Jul-15" "16-Jul-15" "17-Jul-15" "18-Jul-15" ...
## $ Weight: num 219 218 218 215 212 ...
## $ dateDt: Date, format: "2015-07-15" "2015-07-16" ...
DailyWt %>% ggplot(aes(x=dateDt,y=Weight)) + geom_point()
## Warning: Removed 279 rows containing missing values (geom_point).
That rapid decrease at the beginning is the dumping of excess fluid left over from the surgery. I want to remove that spike because the part above 200 pounds is taking up half of the vertical space and is not useful for the questions I want to ask of the data. Getting rid of the spike is easy with the dplyr filter command. Note that we have a lot more vertical space for the important data. There’s also a lot of empty space at the end where the spreadsheet has a lot of rows empty with nothing but dates.
DailyWt %>% filter(Weight < 205) -> DailyWt
DailyWt %>% ggplot(aes(x=dateDt,y=Weight)) + geom_point()
Now I want to add some additional variables for the questions I want to answer.
DailyWt %>%
mutate(mo = month(dateDt),
yr = year(dateDt),
wday = weekdays(dateDt),
lagwt = lag(Weight),
gain = Weight - lagwt
) %>%
filter(!is.na(diff)) -> DailyWt
## Warning in is.na(diff): is.na() applied to non-(list or vector) of type
## 'closure'
Now let’s look at a revised version of the first plot with some enhancements.
I reduced the size of the points.
I added a loess smoothing curve. The span of .05 is the result of trial and error to give me the degree of smoothing I wanted.
I made the graph interactive using plotly.
wtPlot = DailyWt %>%
ggplot(aes(x = dateDt,y = Weight)) +
geom_point(size=.3) +
geom_smooth(method = "loess",span=.05)
ggplotly(wtPlot)
This plot suggest that there may be some seasonality. To look at this, we can repeat the same plot separately for each year, stacking the graphs in a column.
wtPlot = DailyWt %>% filter(yr >= 2016) %>%
ggplot(aes(x = dateDt,y = Weight)) +
geom_point(size=.3) +
geom_smooth(method = "loess",span=.05) +
facet_wrap(~yr,scales = "free",ncol=1)
ggplotly(wtPlot)
There is a hint of seasonality with an upturn in the fall.
Now I want to ask some more detailed questions about the ups and downs I’ve been experiencing since the beginning of 2016. I’ll call this subset of the data “recent.”
recent = filter(DailyWt,yr > 2015 )
summary(recent)
## Date Weight dateDt mo
## Length:1067 Min. :186.5 Min. :2016-01-01 Min. : 1.000
## Class :character 1st Qu.:192.7 1st Qu.:2016-09-23 1st Qu.: 3.000
## Mode :character Median :196.0 Median :2017-06-17 Median : 6.000
## Mean :195.0 Mean :2017-06-17 Mean : 6.373
## 3rd Qu.:197.3 3rd Qu.:2018-03-10 3rd Qu.: 9.000
## Max. :200.8 Max. :2018-12-02 Max. :12.000
## yr wday lagwt gain
## Min. :2016 Length:1067 Min. :186.5 Min. :-3.100000
## 1st Qu.:2016 Class :character 1st Qu.:192.7 1st Qu.:-0.700000
## Median :2017 Mode :character Median :196.0 Median : 0.000000
## Mean :2017 Mean :195.0 Mean : 0.009934
## 3rd Qu.:2018 3rd Qu.:197.3 3rd Qu.: 0.700000
## Max. :2018 Max. :200.8 Max. : 3.700000
The variable gain is the amount of weight I have gained (or lost if negative) each day. Is the mean or median gain different in different months? Do the months differ in other ways. There are a few different plots that would be useful. The first is a simple side-by-side boxplot, with a horizontal line at zero for reference.
recent %>% ggplot(aes(x=factor(mo),y=gain)) + geom_boxplot() + geom_hline(yintercept = 0,color="purple")
My observations:
1.In most months, the median gain is close to zero.
March and April have the greatest median values of gain, while September has the lowest.
March and October have a larger mid-range than the other months.
November and December have median gain values of nearly zero. I find this result curious.
The beeswarm and violin plots are also useful for comparing distributions.
recent %>% ggplot(aes(x=factor(mo),y=gain)) + geom_quasirandom(size=.2)
recent %>%
ggplot(aes(x=factor(mo),y=gain)) + geom_violin()