Combine Data

Joining the Datasets

We want one master data file that contains data from all of the three sources. We can combine these data files as long as we have a unique, common element between the three. And we do: date. So we will use date as the key to pull our three datasets together.

First read in all your data. Change any file names if necessary.

library(tidyverse)

pge <- read_csv("pge_full_data.csv")
noaa <- read_csv("pdx_daily_noaa.csv")
extreme <- read_csv("pdx_climate_toolbox.csv")

Examine each dataset. Is the date column labeled the same way across all three? Is it the same data format, meaning numerical, character, etc.? We can join datasets so long as the column type is the same, even if they are named differently.

In the pge dataset the date is called period. It is a <date> format. In the noaa dataset the date is called date. It is a <date> format. In the extreme dataset the date is called date. It is a <date> format.

Next, we want to have some expectation of what our data will look like when combined, in order to know if our combine is successful.

See how many rows each dataset has. Are the dates all unique? Use the unique() function nested inside the length() function to find out.

This will look like length(unique(dataset$date_column)).

From using glimpse() or looking at the Environment pane you should see: pge: 2579 rows (3675 after we add the data from 2016-2018); length(unique(pge$period)) = 3675 noaa: 3493 rows; length(unique(noaa$date)) = 3493 extreme: 85891 rows; length(unique(extreme$date)) = 44195

Think

Do these numbers make sense? What length do we expect our combined dataset to be? Using max() or min() on the date columns from each dataset might help you think about it.

The extreme dataset spans a much larger time frame, so both of the smaller datasets will fit into it and we should have a final dataset with 85891 rows.

Now let’s combine the data using the full_join() function. This will keep all data from both sets. We can only join two things at a time, so we’ll have to do two joins. For more about joins, see this workshop. The gist is that we use the common value as a key to match up the data. For us, we’ll use the date columns.

full <- pge |>
  full_join(noaa, by = join_by("period" == "date")) |>
  rename(date = period) |>    # date makes more sense than period, so we'll change the name
  full_join(extreme, by = "date")

Examine your dataset. You can use the function select() if you want to move columns around or remove any and rename() if you want to change any column names.

full <- full |>
  select(-Year) |>           # we don't need two year columns
  select(date, year, Month, Day, everything()) |> # move things around so all the date stuff is at beginning
  rename(month = Month,      # consistent capitalization
         day = Day,          # consistent capitalization
         energy = value)     # energy usage column

Note the use of everything() to grab all remaining columns.

Now let’s make a plot to see if our data makes sense. The first one we can do is just a scatterplot to see if temp falls between tmin and tmax. Note how we plot multiple columns on one plot:

ggplot(full, aes(x = date)) +
  geom_point(aes(y = tmax), color = "red", alpha = 0.2) +
  geom_point(aes(y = tmin), color = "blue",  alpha = 0.2) +
  geom_point(aes(y = temp), color = "yellow",  alpha = 0.2) 

So much overlapping data obscures the seasonal patternes, but looks good overall. Our temp data shows up where it should and we can see that our timeline looks correct.

Save the Comined Data

In the same way we saved previous data, save the combined data to a dataset called full_data.csv.

write_csv(full, “full_data.csv”)