What does a data geek do on his day off? Play around with a dataset that’s in need of inspection. This is a history of my weight, which I’ve been measuring every morning since I got home from the hospital after heart surgery in July 2015.

First I need to load some extra libraries on top of base R.

library(tidyverse)
library(lubridate)
library(plotly)
library(ggbeeswarm)

Now I need to read the data and convert the date, which was read in as a string, to a real date.

DailyWt <- read_csv("DailyWt.csv", col_types = cols(Date = col_character()))
DailyWt %>% 
  mutate(dateDt = dmy(Date)) -> DailyWt

That’s enough to have a first look at the data using ggplot2.

str(DailyWt)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1516 obs. of  3 variables:
##  $ Date  : chr  "15-Jul-15" "16-Jul-15" "17-Jul-15" "18-Jul-15" ...
##  $ Weight: num  219 218 218 215 212 ...
##  $ dateDt: Date, format: "2015-07-15" "2015-07-16" ...
DailyWt %>% ggplot(aes(x=dateDt,y=Weight)) + geom_point() 
## Warning: Removed 279 rows containing missing values (geom_point).

That rapid decrease at the beginning is the dumping of excess fluid left over from the surgery. I want to remove that spike because the part above 200 pounds is taking up half of the vertical space and is not useful for the questions I want to ask of the data. Getting rid of the spike is easy with the dplyr filter command. Note that we have a lot more vertical space for the important data. There’s also a lot of empty space at the end where the spreadsheet has a lot of rows empty with nothing but dates.

DailyWt %>% filter(Weight < 205) -> DailyWt
DailyWt %>% ggplot(aes(x=dateDt,y=Weight)) + geom_point() 

Now I want to add some additional variables for the questions I want to answer.

DailyWt %>% 
  mutate(mo = month(dateDt),
         yr = year(dateDt),
         wday = weekdays(dateDt),
         lagwt = lag(Weight),
         gain = Weight - lagwt
         ) %>% 
  filter(!is.na(diff)) -> DailyWt
## Warning in is.na(diff): is.na() applied to non-(list or vector) of type
## 'closure'

Now let’s look at a revised version of the first plot with some enhancements.

  1. I reduced the size of the points.

  2. I added a loess smoothing curve. The span of .05 is the result of trial and error to give me the degree of smoothing I wanted.

  3. I made the graph interactive using plotly.

wtPlot = DailyWt %>% 
  ggplot(aes(x = dateDt,y = Weight)) + 
  geom_point(size=.3) +
  geom_smooth(method = "loess",span=.05)
ggplotly(wtPlot)

This plot suggest that there may be some seasonality. To look at this, we can repeat the same plot separately for each year, stacking the graphs in a column.

wtPlot = DailyWt %>% filter(yr >= 2016) %>% 
  ggplot(aes(x = dateDt,y = Weight)) + 
  geom_point(size=.3) +
  geom_smooth(method = "loess",span=.05) +
  facet_wrap(~yr,scales = "free",ncol=1)
ggplotly(wtPlot)

There is a hint of seasonality with an upturn in the fall.

Now I want to ask some more detailed questions about the ups and downs I’ve been experiencing since the beginning of 2016. I’ll call this subset of the data “recent.”

recent = filter(DailyWt,yr > 2015  )
summary(recent)
##      Date               Weight          dateDt                 mo        
##  Length:1067        Min.   :186.5   Min.   :2016-01-01   Min.   : 1.000  
##  Class :character   1st Qu.:192.7   1st Qu.:2016-09-23   1st Qu.: 3.000  
##  Mode  :character   Median :196.0   Median :2017-06-17   Median : 6.000  
##                     Mean   :195.0   Mean   :2017-06-17   Mean   : 6.373  
##                     3rd Qu.:197.3   3rd Qu.:2018-03-10   3rd Qu.: 9.000  
##                     Max.   :200.8   Max.   :2018-12-02   Max.   :12.000  
##        yr           wday               lagwt            gain          
##  Min.   :2016   Length:1067        Min.   :186.5   Min.   :-3.100000  
##  1st Qu.:2016   Class :character   1st Qu.:192.7   1st Qu.:-0.700000  
##  Median :2017   Mode  :character   Median :196.0   Median : 0.000000  
##  Mean   :2017                      Mean   :195.0   Mean   : 0.009934  
##  3rd Qu.:2018                      3rd Qu.:197.3   3rd Qu.: 0.700000  
##  Max.   :2018                      Max.   :200.8   Max.   : 3.700000

The variable gain is the amount of weight I have gained (or lost if negative) each day. Is the mean or median gain different in different months? Do the months differ in other ways. There are a few different plots that would be useful. The first is a simple side-by-side boxplot, with a horizontal line at zero for reference.

recent %>% ggplot(aes(x=factor(mo),y=gain)) + geom_boxplot() + geom_hline(yintercept = 0,color="purple")

My observations:

1.In most months, the median gain is close to zero.

  1. March and April have the greatest median values of gain, while September has the lowest.

  2. March and October have a larger mid-range than the other months.

  3. November and December have median gain values of nearly zero. I find this result curious.

The beeswarm and violin plots are also useful for comparing distributions.

recent %>%  ggplot(aes(x=factor(mo),y=gain)) + geom_quasirandom(size=.2)

recent %>% 
  ggplot(aes(x=factor(mo),y=gain)) + geom_violin()