Shaping Data for Visuals

Note: In RStudio, I recommend viewing this notebook in the “Visual” style. You can view this notebook here if the images are not loading properly for you.

Reading and cleaning the data

What we’re going to go for here is a graph showing how generation of wind has changed over time by state. We want each state to have its own line, year to be on the x axis and generation to be on the y.

I downloaded the data here, from the U.S. Energy Information Administration.

I am first going to do one analysis piece which is to filter for just those states who are generating at least 1,000 thousand megawatt hours in 2024 (yes a thousand thousand..it’s confusing). This will allow us to narrow in just among those who have more substantial wind generation and make the graph less cluttered. I’ll do some other basic cleaning stuff here too

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(tidyr)

df <- read.csv('Net_generation_for_all_sectors.csv', skip = 4)

# just fyi, i couldn't get read.csv to work at first. the read_csv from the readr package can sometimes help. it still didn't read it in well, but it helped me see what the problem was.
#library(readr)
#df <- read_csv('Net_generation_for_all_sectors.csv', skip = )

What does the data look like?

df

df2 <- df %>%
  #I skipped 4 rows above bc EIA puts some (potentially useful) metadata there. but there was still one more row BELOW the header row that I now need to filter out
  filter(description != "Pinned series") %>% 
  # I'm going to remove the X from the column names 
  rename_with(~str_remove(., "^X"), starts_with("X")) %>%
  # convert the columns to numeric 
  mutate(across(c(`2001`:`2024`), as.numeric)) %>%
  # filter for just bigger players
  filter(`2024` >= 1000) %>%
  # it looks like all of the NAs are in years leading up to a small amount of generation - I think it is safe to convert these NAs to 0 
  mutate(across(c(`2001`:`2024`), ~replace_na(., 0))) %>%
  #there was also a fair number of rows I don't need - like solar generation 
  # this gets it down to net wind generation 
  filter(str_starts(description, "Net generation : wind ")) %>%
  # finally, there are a few totaled regional or nationwide elements to get rid of 
  filter(!str_detect(description, "total")) %>%
  filter(!str_detect(description, "United")) %>%
  # finally I am extracting the state name 
  mutate(state = word(description, 3, sep = " : ")) %>%
  # since i used AI to help me get that previous line of code, i'm going to double check that it extracted the state names correctly 
  select(state, everything()) %>%
  # looks good so going to get rid of columns i don't need 
  select(state, `2001`:`2024`)

## Warning: There were 24 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(c(`2001`:`2024`), as.numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 23 remaining warnings.

So how do we need to shape this data so that Flourish or Datawrapper can read it in correctly?

Let’s start with Flourish.

What I like about Flourish is that when you pick your chart type, you have the example data pre-loaded, which helps you understand how your data needs to be formatted.

For example, for the kind of plot we are going for, it looks like this:

And if we hit the Data tab we see:

So, we need years to be in one column, instead of a year for each column as we have now.

Since we are taking many columns and turning them into one long column, this is going to be a pivot longer, from tidyr.

df_flourish <- df2 %>%
  # I put the ` `  around the column name because the column name is only numbers. this is how R distinguishes it as a column name and not a number. 
  pivot_longer(cols = c(`2001`:`2024`), 
               # the next line tells it what I would like to call the new column where all of the current column names will go (i.e. the 2001-2024 headers)
               names_to = "Year",
               # the next line tells it what I would like to call the column where all of the data values will go. I often just call this value 
               values_to = "Value")

Let’s see how it looks now

df_flourish

We’ve got years in one column now, great! If we look back at how Flourish structures it, we see that each state needs to be its own column. So it’s going to be a pivot_wider situation.

df_flourish_2 <- df_flourish %>%
  # first we say where we want the new column headings to come from
  pivot_wider(names_from = "state", 
              # then we say which column we want the data to be taken from
              values_from = "Value")

Let’s take a look

df_flourish_2

Ok this is looking good ! I am going to export and plot in Flourish.

write.csv(df_flourish_2, 'wind_states.csv')

We will go over working in Flourish and Datawrapper next week. I’ve plotted it here to show you, but I haven’t done anything to make it a good graphic yet. It’s pretty terrible with how crowded it is. Here’s a screenshot.

Sometimes I plot graphics like this just to explore them, since that is harder to do in an interactive way in R at this point.

Ok so now Datawrapper

The difference with datawrapper is that you have to upload your data without seeing what kind of format it should be in.

Here’s what it looks like if I try to upload our dataset as it was in df2.

And if I proceed to graph it:

Datawrapper does have this handy transpose button, in two spots which I circled in purple.

Here’s what it looks like if I click that.

It’s closer… but the headers are in the data instead of being headers.

Let’s still see how it looks when we try to graph it.

Getting closer ! It does seem like from this exercise that the format we used for Flourish will also work in this case for datawrapper. I will upload that and see.

Ok now we are getting there! Again, this is not ready for publication, but we’ve got the basic workflow.

Note: Although you can use the transpose button in Datawrapper, I highly recommend getting your data completely into shape in R so you don’t have to do any manipulation in Datawrapper. This would include column names too when applicable. If you have to edit your data, it makes it so much simpler than trying to later remember what you did in R vs. Datawrapper. Plus, all of your manipulation is fully documented via the code.

Oh, one more thing. When you use write.csv, R includes an index column. This is just a column at the beginning that counts up from 1 for every row. It can be handy, but it can also get in the way. For Datawrapper in particular. To remove this index column, use the row.names = FALSE argument in the write function, like so:

write.csv(df_flourish_2, 'wind_states.csv', row.names = FALSE)

Next week we will work on making them pub ready.

What questions does this make you wonder about?