Data Cleaning/Prep

We will start by doing a little data cleaning to get the data in a form we’ll need it in for this example. You might not have seen these tools before. Don’t worry, you’re not expected to at this time, but you’ll have seen as least most of these by the end of the semester!

Data description

The bike_full data set has 8 columns:

  1. date, month, day: The year-month-day, month, and day of the month
  2. trips: The number of biking trips recorded that day
  3. distance: Total distance traveled by bike on that day in miles
  4. move_time: Total amount of time that day on bike in minutes
  5. cumul_dist: Distance traveled by bike for all previous days that month
  6. cumul_time: Time spent on bike for all previous days that month
## # A tibble: 62 × 8
##    date       month   day trips distance move_time cumul_dist cumul_time
##    <date>     <ord> <int> <int>    <dbl>     <int>      <dbl>      <int>
##  1 2023-07-01 July      1     1     2.18        13       2.18         13
##  2 2023-07-02 July      2     2     7.36        46       9.54         59
##  3 2023-07-03 July      3     0     0            0       9.54         59
##  4 2023-07-04 July      4     1    11.2         67      20.7         126
##  5 2023-07-05 July      5     0     0            0      20.7         126
##  6 2023-07-06 July      6     0     0            0      20.7         126
##  7 2023-07-07 July      7     0     0            0      20.7         126
##  8 2023-07-08 July      8     0     0            0      20.7         126
##  9 2023-07-09 July      9     0     0            0      20.7         126
## 10 2023-07-10 July     10     0     0            0      20.7         126
## # ℹ 52 more rows

We want to make a graph to display the cumulative distance traveled that month by each day comparing July and August, like the graph seen in Brightspace

Line graphs

A line graph is a type a graph that uses a line to play “connect the dots” with the data points represented. It’s not required, but the x-axis on most line graphs represent time in some way.

Let’s create a blank graph below with day on the x-axis and cumul_dist on the y-axis. Have the x-axis label state “Day of the month” and the y-axis “Total distance traveled (mi)”. Uncomment the code below to change the tick marks on the x-axis.

Save the graph as gg_traveled

Now that we have a blank graph, how do we add a line or lines? Can we just use geom_smooth()?

Not quite. geom_smooth() fits a smooth, trend line across the graph. We want a geom that will connect the left-most dot with the next left-most dot, and so on. So which geom should we use?

It’s not much of a surprise that we should use geom_line()! Add it to gg_traveled that was created in the previous code chunk:

Uh, that’s not quite what we wanted. Why are there so many vertical lines?

The way geom_line() works is it will connect the dots from left to right. If there are 2 dots with the same x-value, it will draw a vertical line to connect them!

But is that what is going on in our graph?

Instead of geom_line(), add geom_point() to our graph

We get a better look at the dots being connected now! Since there is a dot for each day in both July and August, by default geom_line() will connect the 1st of July with the 1st of August, then repeat for each day. In the codechunk above, color the dots by month to paint a clearer picture.

So how do we fix it?

Multiple lines

From what we saw when we were looking at making bar charts, ggplot() will form groups in the data whenever an aesthetic is mapped to a categorical variable. So how does that help us here? geom_line() will only play connect-the-dots with points in the same group. So if we have ggplot() form groups in the data, it will draw multiple lines using 1 geom_line() function. We just need to map an appropriate aesthetic to the column(s) that forms the groups!

Some choices are:

  • color: the most popular aesthetic to use
  • linetype: changes how the line is drawn - solid, dashed, dotted, etc…
  • group: won’t the color or lineweight of the line is drawn, but will draw a separate line for each group.
    • Useful when constructing “spaghetti plots”. See the pdf in the internet usage file in Brightspace for a more detailed description.

Let’s map color to month and see what happens!

Try changing color to linetype and see what changes!

If we want our graph to look a little more similar to the one created by Strava, let’s change the colors of the lines to black and dark orange for July and August, respectively, using the correct scale_{aesthetic}_{type}() function.

Caution about geom_line() and groups

One caution about working with lines is you want to make sure that there aren’t too many groups, otherwise geom_line() may not work correctly!

Let’s look over the example below:

If I wanted all 31 days to appear on the x-axis? One quick way I could do that is by mapping the x-axis to factor(day) instead of day:

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Uh oh, what happened?

The warning message gives a hint- geom_line(): each group consists of only one observation. Do you need to adjust the group aesthetic?

What that means is since both x and color are now mapped to categorical columns, the groups are now formed for each combination of month and day. And since each combination of month and day only has 1 row, geom_line() doesn’t have 2 points it can play connect the dots with! geom_point() is unaffected since it just places a dot at each (x,y) coord combo. But if there aren’t 2 points in the same group, it can’t connect any of the dots!

So what could we do? We can use the group aesthetic to try and fix it!

Including group = month either inside ggplot() or geom_line() will override the initial groups formed to only define groups based on month.