Data Visualization with ggplot2

The Grammar of Graphics

Ggplot2 is a powerful R package for data visualization based on the Grammar of Graphics. This approach allows you to build plots by combining different building blocks, including data, aesthetic mappings, geometries, and more.

  • Philosophy : creates separate building blocks of a diagram and combine these pieces to create any graph that you want.
  • Graph : mapping from the data space to the visual space.
  • Aesthetic : anything that we can “see” in the diagram (Axes (position),Size, Shape, Colour, Fill, etc.)
  • Graphs are made up of layers : graph = dataset + aesthetics + geoms + facets + statistical transformations + themes

Function and Parameters in ggplot

ggplot() produces a graphical object and it requires 3 essential elements data, aesthetics and geometries

data = your_data_frame_name specifies a dataframe

aes() describes the aesthetics

geom_xxx() specifies which geometric object will be drawn

There are also four optional elements:

facet_xxx() Facetting splits the data into subsets and displays the same graph for every subset

Themes() Influence the “non-ink” part of the figure. For example, you can change the graphics background, axis size, or header

Statistics Let you transform the data (add mean, median, quartile)

Coordinates Transforms axes (changes spacing of displayed data)

Aesthetics (aes)

Aesthetics control how data variables map to visual properties, such as:

Position (x, y) Color (color = variable_name) Size (size = variable_name) Shape (shape = variable_name) Fill (for bar charts or area plots) Example:

library(ggplot2)
ggplot(data = mtcars) +  
  aes(x = mpg, y = hp, color = cyl) +  
  geom_point()

#This creates a scatter plot of mpg (miles per gallon) vs. hp (horsepower), with points colored by the number of cylinders (cyl).

Geometric Objects (geoms)

Each graph type is created using a geom function. Some common ones:

Points: geom_point() (for scatter plots) Lines: geom_line() (for line graphs) Bars: geom_bar() (for bar charts) Histograms: geom_histogram() (for distributions) Example (bar chart):

ggplot(data = diamonds) +  
  aes(x = cut) +  
  geom_bar()

Layers in ggplot2

Plots can have multiple layers, adding elements like:

Facets: facet_wrap(~ variable_name) to split plots into subplots Statistical transformations: stat_summary() to apply calculations Themes: theme_minimal() to customize appearance Example (multiple layers):

ggplot(data = mpg) +  
  aes(x = class, fill = drv) +  
  geom_bar() +  
  facet_wrap(~ year) +  
  theme_minimal()

Dataset for the seminar

For the seminar, you will use a dataset on Greenhouse gas emission from the OECD. You can access the data on the seminar folder on Minerva.

You can find updated emission data in this #link. Export the data as csv file and rename it “Greenhouse_gas_emissions_OECD.csv”.

You are asked to perform the following tasks:

Step 1: Import the data in the R Environment ( I called the object dataset)

(hint:use read.csv() and make sure that the dataset is in your working directory)

dataset<-read.csv(“Greenhouse_gas_emissions_OECD.csv”)

Step 2: Understand your dataset

You can execute several commands to have a first idea of the dataset such as:

  • dim()-Retrieves the dimensions of an object
  • names()-Returns the names of the variables
  • head()-Returns the first parts of a vector, matrix, table, data frame or function
  • str()-Displays the internal structure of an R object
  • summary()-Returns descriptive statistics of the variables

The names(dataset) returns the following values: COU, Country, POL, Pollutant, VAR, Variable, Year, Unit.Code, Unit, PowerCode.Code, PowerCode, Reference.Period.Code, Reference.Period, Value, Flag.Codes, Flags

Step 3: Subset your Dataset

From the variables described above select only the Country, Pollutant, Variable, Year and Value

(hint:use dplyr and the function Select. Make sure that the dplyr is installed and loaded in your R. Obviously you can use alternative syntax or other functions to produce similar results)

dataset<- dataset %>% select(Country, Pollutant, Variable, Year, Value)

Step 4: Filter your Dataset

The dataset$Variable contains values from different categories of emissions as you can see in the table below.

Uniques Values
Total emissions excluding LULUCF
Total GHG excl. LULUCF, Index 1990=100
Total GHG excl. LULUCF per capita
5 - Waste
2- Industrial processes and product use
1 - Energy
3 - Agriculture
6 - Other
Total GHG excl. LULUCF per unit of GDP
1A1 - Energy Industries
Land use, land-use change and forestry (LULUCF)
1A4 - Residential and other sectors
1A5 - Energy - Other
1B - Fugitive Emissions from Fuels
1A2 - Manufacturing industries and construction
1A3 - Transport
Total GHG excl. LULUCF, Index 2000=100
Total emissions including LULUCF
1C - CO2 from Transport and Storage
Agriculture, Forestry and Other Land Use (AFOLU)
1A4 - Residential and other sectors

To avoid duplicate values, as some of the categories are aggregated measures you should filter the dataset keeping only the category “Total emissions excluding LULUCF”. Make sure that you type the correct values (there is an extra space) else it will return 0 observations.

(hint:use dplyr and the function Filter and make sure that you type correct the value you want to keep)

dataset<- dataset %>% filter(Variable==“Total emissions excluding LULUCF”))

Step 5: Let’s calculate and save the total emission per year for all countries in the dataset. I called the new dataframe dataset_year and the new variable name is total_emissions

(hint:use dplyr and the functions group_by and summarize)

dataset_year<- dataset%>% group_by(Year)%>% summarize(total_emissions=sum(Value))

Step 5: Let’s plot the total annual emission

Use the ggplot function to plot a scatterplot of the total emission (y label) to the years (x label)

(hint:You need to define data, aesthetics and the correct geometry and most importantly you need to load the library ggplot)

If you have done it correctly, you should see the following plot

ggplot(dataset_year, aes(Year,total_emissions)) + geom_point()

Step 6: Smooth line and Label

Repeat the previous plot but this time add also a smooth line. Also, define a title for the plot (“Total Annual Emissions”), x (“Total emissions excluding LULUCF”) and y (“Year”) labels and a caption (“Source OECD”). Change also the theme of the plot to bw (theme_bw())

(hint:You need to use the function labs() and to define the arguments title, caption, define data, aesthetics and the correct geometry and most importantly you need to load the library ggplot)

ggplot(dataset_year, aes(Year, total_emissions)) + geom_point() + geom_smooth() + labs(title = “Total Annual Emissions”, x= “Total emissions excluding LULUCF”, y=“Year”, caption= “Source OECD”) + theme_bw()

Step 7: Combine Several plots

The par and mfrow functions can be used to combine multiple plots from the graphics package.We have used mfrow function in one of our first seminars. This is not suitable for ggplot graphs. There are several packages and functions that you can use instead, such as the function grid.arrange() from the library gridExtra

Your task is to create the following two plots in ggplot:

  • plot1<- The plot from the previous step
  • plot2<- The distribution plot of total emissions/1000 with parameters (fill =“blue”, color = “black” ). Define also a title for the plot (“Frequency of annual Emissions”), x (“total emissions expressed in 1000s”) and y (“Frequency”) labels.

ggplot(dataset_year, aes(total_emissions/1000)) + geom_histogram(fill =“blue”, color = “black” ) + labs(title=“Frequency of annual Emissions”, x=“total emissions expressed in 1000s”, y=“Frequency”)

combine the two plots using the function grid.arrange()

(hint:You need to install first and load the library gridExtra)

grid.arrange(plot1, plot2, ncol = 2, nrow = 1)

Step 8: Challenge – Visualizing Emissions Data on a World Map Now that you have summarized the emissions data by year, let’s take it a step further and visualize emissions by country on a world map. This will help us understand the geographic distribution of greenhouse gas emissions.

To plot a world map, you need to install and load the maps package.

library(maps)
library(viridis) # this if the colours used
# Select the latest available year
latest_year <- max(dataset$Year)

# Summarize emissions by country
dataset_country <- dataset %>%
  filter(Year == latest_year) %>%
  group_by(Country) %>%
  summarize(total_emissions = sum(Value, na.rm = TRUE))

Task: Create the Map Visualization Now, we create a choropleth map where countries are shaded based on their total emissions.The map_data(“world”) function provides country boundaries, but country names in the emissions dataset may not perfectly match those in the map. Some manual adjustments may be needed.

# Load world map data
world_map <- map_data("world")

# Rename columns for consistency
colnames(world_map)[colnames(world_map) == "region"] <- "Country"

# Merge emissions data with map data
map_data <- left_join(world_map, dataset_country, by = "Country")

ggplot(map_data, aes(x = long, y = lat, group = group, fill = total_emissions)) +
  geom_polygon(color = "gray") +
  scale_fill_viridis(option = "magma", na.value = "white") +  # Use a color scale
  labs(
    title = paste("Total Greenhouse Gas Emissions by Country in", latest_year),
    fill = "Emissions"
  ) +
  theme_minimal()

Second Exercise (if there is available time /No answers are provided this time)

For this exercise you will use the diamonds dataset a built-in dataset. Diamonds is a dataset containing the prices and other attributes of almost 54,000 diamonds. The variables in the dataset are described below:

  • price : price in US dollars ($326–$18,823)
  • carat : weight of the diamond (0.2–5.01)
  • cut : quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color : diamond colour, from D (best) to J (worst)
  • clarity : a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • c:length in mm (0–10.74)
  • y :width in mm (0–58.9)
  • z :depth in mm (0–31.8)
  • depth : total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
  • table : width of top of diamond relative to widest point (43–95)

Step 1: Understand your dataset

Execute the commands below to have a clear picture of the variables, dimensions and observations of the dataset:

dim(diamonds)
names(diamonds) head(diamonds) str(diamonds) summary(diamonds)

Step 2: Declare data and aesthetics: We will use the cut variable. Both chunks of code below produce the same results:

ggplot(aes(x = cut), data = diamonds)

or

ggplot(diamonds, aes(cut))

At this point ggplot knows which data to use and has already mapped the cut variable but still does not know what to plot. To produce the plot you need to define the geometries.

Step 3: Create a bar graph for the cut variable (the geometry is the geom_bar())

  • Create a histogram for the same variable (geometry is geom_histogram())

  • Create a histogram for the variable carat

(hint:• You can change the binwidth with the following syntax: geom_histogram(binwidth = 0.1))

  • Repeat the last step but now set limits to the values in x axis with range (0,3). You can do this by adding to your existing code the following: Your previous code + xlim(0,3)

Step 4: Plot the histogram for the variable depth

  • Set the binwidth to 0.2. Execute the code

  • Repeat but now set limit to the values in the x-axis to be in the range (55,70).Execute the code

  • Do the same but now map the variable cut in aesthetics. Use the argument fill=cut.Execute the code

  • Remove the cut from the aesthetics and present the variable using the facet_wrap(). Execute the code

  • Create a scatterplot mapping carat to x and price to y axes (the geometry is geom_point().Execute the code

  • Change the colour to blue (this should be done at geometry level with the argument colour=”blue”). Add also smooth line.Execute the code

  • Change the colour of the smooth line to red (the argument should be inside the function you have used to plot the smooth line).Execute the code

Step 4: Follow the example

The following code plots a smooth line for x = carat, y = price where the colour is function of the variable cut. This as you can see is defined at the geometry level:

ggplot(diamonds, aes(x = carat, y = price)) + geom_point(aes(colour= cut)) + 
geom_smooth()

  • Follow the example but this time define the colour as a function of the cut at the aesthetics of ggplot level. Can you spot any difference?

  • Include title, caption x and y labels. Choose whatever titles you want.

Seminar Conclusion

These were just some basic functions that you can use to produce plots in R language. There are many more functionalities and opportunities to build some very interesting plots. For example packages such as gganimate or plotly can be used to build animated plots as the one in the code below. However, the more complex the code the more likely is to receive error messages because something is missing from your pc. Always search online for the error message received to find solutions.

library(plotly)
library(gapminder)

df <- gapminder 
fig <- df %>%
  plot_ly(
    x = ~gdpPercap, 
    y = ~lifeExp, 
    size = ~pop, 
    color = ~continent, 
    frame = ~year, 
    text = ~country, 
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers'
  )
fig <- fig %>% layout(
    xaxis = list(
      type = "log"
    )
  )

fig