emissions in python

I have been learning python via the Posit Academy for 5 weeks now and just going to set myself a little challenge to check in re how much I have learned. I am going to pick a Tidy Tuesday dataset and see if I can reproduce the process I would typically do in R, in Python.

This data comes from the Tidy Tuesday challenge and contains data about the sources of carbon emissions. Lets see if we can reproduce this plot from https://carbonmajors.org/

load packages

In R, you call library(nameofpackage) to load all the functions in that package. After that, you don’t need to tell R which package a function comes from, it just knows.

In python, loading packages is a bit different. In some cases you need to “namespace” a function, so to save on typing, it is a good idea to import a package with an alias. Here I am importing pandas as “pd” so that down the track I can use pd.read_csv (for example).

You can also import just specific functions from a package. Here I am importing just a subset of the available functions from plotnine that I know I will need (well lets be honest I come back and add to this list as I make my plot because I don’t really know what I need in advance).

import pandas as pd

from plotnine import ggplot, aes, labs, geom_line, geom_area, scale_y_continuous, scale_x_continuous, theme_classic, scale_fill_manual, theme

read the data

Here I am reading in the emissions data from the Tidy Tuesday github and getting a sense for the variables in this dataset using the .info method. It is a bit like the glimpse function in R, but not quite as good because it doesn’t give you a preview of the values, but oh well.

emissions = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-21/emissions.csv')

emissions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12551 entries, 0 to 12550
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   year                    12551 non-null  int64  
 1   parent_entity           12551 non-null  object 
 2   parent_type             12551 non-null  object 
 3   commodity               12551 non-null  object 
 4   production_value        12551 non-null  float64
 5   production_unit         12551 non-null  object 
 6   total_emissions_MtCO2e  12551 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 686.5+ KB

select

I think I only need year, commodity and emissions to make this plot, so am going to select just those variables. There isn’t a select function in pandas per se, but you can use square brackets to select a set of variables. I am renaming the emissions variable at the same time to make it easier to type. The curly brackets in the .rename call is what python people call a dictionary, which is mostly commonly used for pairs of values.

emissions_select = emissions[['year', 'commodity', 'total_emissions_MtCO2e']] 

emissions_renamed = emissions_select.rename(columns={'total_emissions_MtCO2e': 'total_emissions'})

emissions_renamed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12551 entries, 0 to 12550
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             12551 non-null  int64  
 1   commodity        12551 non-null  object 
 2   total_emissions  12551 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 294.3+ KB

Now you can “pipe” these kind of operations together with what python peeps call method chaining. Instead of creating a new dataframe for each step, you can put multiple operations together like this. You have to put parentheses around the code steps so that python knows to keep running line to line.

emissions_select_rename = (emissions
                          [['year', 'commodity', 'total_emissions_MtCO2e']] 
                          .rename(columns={'total_emissions_MtCO2e': 'total_emissions'})
                          )
                          
emissions_select_rename.info()                         
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12551 entries, 0 to 12550
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             12551 non-null  int64  
 1   commodity        12551 non-null  object 
 2   total_emissions  12551 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 294.3+ KB

summarise

I think that total emissions line in the plot will require me to sum the emissions across all the different sources. Here I am grouping the data by year, selecting the total_emissions column, and using .sum() to get the total_emissions per year. Adding reset_index sets the indexing from 0 again and makes the output into a useful dataframe. I like how default behaviour in printing a dataframe is to give you the head and the tail.

emissions_by_year = (emissions_select_rename
  .groupby('year')
  ['total_emissions'].sum()
  .reset_index()
  
)

emissions_by_year
year total_emissions
0 1854 0.099198
1 1855 0.128996
2 1856 0.158793
3 1857 0.184580
4 1858 0.210367
... ... ...
164 2018 35731.881967
165 2019 36397.644590
166 2020 34926.134998
167 2021 36125.590504
168 2022 37733.456332

169 rows × 2 columns

plot geom line

OK now plotting…. plotnine is essentially ggplot converted into python so the syntax is very familiar.

A few plotnine specific differences:

  • you can’t pipe the data in, so need to use the data= argument
  • need to spell out mapping = aes(), rather that shortcut data %>% ggplot(aes())
  • x, y, fill variables need quotes
  • wrap the whole thing in () to python knows to keep executing line to line

Here I am plotting emissions by year, colouring the line red. I have changed the y axis labels from 0 - 30000 to 0 - 30k, by manually setting breaks and labels using a list. Also updating the x axis to display every 20 years, adding a theme, x and y axis labels.

The only thing I didn’t end up managing to work out here was the gridlines. In R you can add gridlines using theme(panel.grid.major.y = element_line etc etc). I think my problem here stemmed from not really getting which “functions” I need to import from plotnine in order to make that happen.

(
  ggplot(data = emissions_by_year, 
      mapping = aes(x = 'year', y = 'total_emissions')) +
      geom_line(colour = "red") +
  theme_classic() +
  labs(y = "Emissions (MtCO2)", x = "Year") +
     scale_y_continuous(breaks= [10000, 20000, 30000], labels= ['10k', '20k', '30k']) +
      scale_x_continuous(breaks=range(1860, 2020, 20)) 

)

plot geom area

OK the line was easy, what about an area plot. I haven’t made an area plot before, lets try geom_area(). I am going to go back to the emissions_select_rename dataframe, because it still has information about commodity.

(

  ggplot(data = emissions_select_rename, mapping= aes(x = 'year', y = 'total_emissions', fill = 'commodity')) +
  geom_area()
  
)

Eeekkk, this plot is weird…

To do…

  • recode commodity categories into the coal, oil, natural gas and cement
  • summarise emissions by year for each source before plotting

Using unique() to print the list categories and then making a list of commodities. Then working out how to recode the commodities variable into only 4 categories.

This recode resource was helpful.

The .replace method allows lists, so can I make a list of commodities and a list of corresponding categories and replace one with the other- handy!

emissions_select_rename['commodity'].unique()

commodities = ['Oil & NGL', 'Natural Gas', 'Sub-Bituminous Coal',
       'Metallurgical Coal', 'Bituminous Coal', 'Thermal Coal',
       'Anthracite Coal', 'Cement', 'Lignite Coal']
       
categories = ["Oil", "Natural Gas", "Coal", "Coal", "Coal", "Coal", "Coal", "Cement", "Coal"]

emissions_select_rename['category'] = (emissions_select_rename['commodity']
                                      .replace(commodities, categories)
                                  )
emissions_select_rename['category'].unique()
array(['Oil', 'Natural Gas', 'Coal', 'Cement'], dtype=object)

Now that we have 4 levels of emissions category, we can group by both year and category and use .sum to summarise the total emissions each year for each category.

emissions_year_category = (emissions_select_rename
  .groupby(['year','category'])
  ['total_emissions'].sum()
  .reset_index()

)

emissions_year_category
year category total_emissions
0 1854 Coal 0.099198
1 1855 Coal 0.128996
2 1856 Coal 0.158793
3 1857 Coal 0.184580
4 1858 Coal 0.210367
... ... ... ...
521 2021 Oil 10262.615104
522 2022 Cement 1197.790637
523 2022 Coal 18024.757438
524 2022 Natural Gas 7768.151843
525 2022 Oil 10742.756414

526 rows × 3 columns

OK try again… this time plot total emissions per year using geom_area with fill by category. This is better!

Things to change about this plot:

  • colour scheme
  • theme_grey is ugly
  • title, x and y axis labels
  • legend move to the bottom
  • order of the categories (coal should be on the bottom)
(

  ggplot(data = emissions_year_category, mapping= aes(x = 'year', y = 'total_emissions', fill = 'category')) +
  geom_area()
  
)

Start by fixing the x and y axis, adding a theme and labels, making the colours match.

note whereever you might have used c() in R to make a list, in plotnine you need to supply the list in square brackets []

(

  ggplot(data = emissions_year_category, mapping= aes(x = 'year', y = 'total_emissions', fill = 'category')) +
  geom_area() +
  theme_classic() +
  labs(y = "Emissions (MtCO2)", x = "Year") +
     scale_y_continuous(breaks= [10000, 20000, 30000], labels= ['10k', '20k', '30k']) +
      scale_x_continuous(breaks=range(1860, 2020, 20)) +
      scale_fill_manual(values = ["gray", "steelblue", "brown", "black"])
  
)

aside : how to control the order of a factor in python?

Now this plot looks good but it made me realise that the categories are out of order (I want coal to be on the bottom). I need to convert category to a factor and change the order.

This is confusing because turns out in python, factors might be called categories. I am going to change the name of my emissions category variable to “source” first to make this less difficult to get my head around.

emissions_year_source = (emissions_year_category
                          .rename(columns={'category': 'source'})
                        )
                        
emissions_year_source.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             526 non-null    int64  
 1   source           526 non-null    object 
 2   total_emissions  526 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 12.5+ KB

This Stack overflow thread was helpful…

pd.Categorical

Here I am checking on the data type first by printing the source column. It is dtype: object. Then I am using pd.Categorical to make the source variable class into a categorical. This is pretty similar to as.factor().

# check type = object
emissions_year_source['source'] 

# make source a category
emissions_year_source['source'] = pd.Categorical(emissions_year_source['source'])

# check type = category
emissions_year_source['source'] 
0             Coal
1             Coal
2             Coal
3             Coal
4             Coal
          ...     
521            Oil
522         Cement
523           Coal
524    Natural Gas
525            Oil
Name: source, Length: 526, dtype: category
Categories (4, object): ['Cement', 'Coal', 'Natural Gas', 'Oil']

I want the levels to be ordered. You can give pd.Categorical additional arguments to spelling out the levels and require that ordered = True.

# make source a category with a particular order

emissions_year_source['source'] = pd.Categorical(emissions_year_source['source'],
                      categories=["Cement", "Natural Gas", "Oil", "Coal"],
                      ordered=True)
          

# check
emissions_year_source['source'] 
0             Coal
1             Coal
2             Coal
3             Coal
4             Coal
          ...     
521            Oil
522         Cement
523           Coal
524    Natural Gas
525            Oil
Name: source, Length: 526, dtype: category
Categories (4, object): ['Cement' < 'Natural Gas' < 'Oil' < 'Coal']

Final Area Plot

Now that source is categorical and ordered, the geom_area colours match the plot!

(
  ggplot(data = emissions_year_source, mapping= aes(x = 'year', y = 'total_emissions', fill = 'source')) +
  geom_area() +
  theme_classic() +
     scale_y_continuous(breaks= [10000, 20000, 30000], labels= ['10k', '20k', '30k']) +
      scale_x_continuous(breaks=range(1860, 2020, 20)) +
      scale_fill_manual(values = ["gray", "steelblue", "brown", "black"]) +
       labs(y = "Emissions (MtCO2)", x = "Year", 
       title = "Carbon Majors & Gobal Fossil Fuel and Cement Emissions, 1854 - 2022", subtitle = "This graph shows that carbon dioxide emissions traced to the carbon fuels and cement produced by the Carbon Majors entities and \n compares them to total global fossil fuel and cement emissions.") +
       theme(legend_position="bottom")
)

Adding the red total emissions line- although it sits right on top of the geom_area- the total data must come from a different source.

(
  ggplot()  +
    geom_line(data = emissions_by_year, 
      mapping = aes(x = 'year', y = 'total_emissions'), colour = "red") +
      geom_area(data = emissions_year_source,
     mapping= aes(x = 'year', y = 'total_emissions', fill = 'source')) +
     theme_classic() +
     scale_y_continuous(breaks= [10000, 20000, 30000], labels= ['10k', '20k', '30k']) +
      scale_x_continuous(breaks=range(1860, 2020, 20)) +
      scale_fill_manual(values = ["gray", "steelblue", "brown", "black"]) +
       labs(y = "Emissions (MtCO2)", x = "Year", 
       title = "Carbon Majors & Gobal Fossil Fuel and Cement Emissions, 1854 - 2022", subtitle = "This graph shows that carbon dioxide emissions traced to the carbon fuels and cement produced by the Carbon Majors entities and \n compares them to total global fossil fuel and cement emissions.") +
       theme(legend_position="bottom")
       
      )