This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. We will cover this in the next lecture.
ggplot() produces a graphical object and it requires 3 essential elements data, aesthetics and geometries
data = your_data_frame_name specifies a dataframe aes() describes the aesthetics geom_xxx() specifies which geometric object will be drawn
There are also four optional elements and you have seen the following two:
facet_xxx() Facetting splits the data into subsets and displays the same graph for every subset Themes() influence the “non-ink” part of the figure. For example you can change the graphics background, axis size, or header Statistics Let’s you transform the data (add mean, median, quartile) Coordinates Transforms axes (changes spacing of displayed data)
For the seminar you will use a dataset for Greenhouse gas emission from OECD. You can access the data on the seminar folder on Minerva.
#Alternatively, you can find the data in this #link. Export the data as csv file #and rename it “Greenhouse_gas_emmisions_OECD.csv”.
You are asked to perform the following tasks:
Step 1: Import the data in the R Environment under the name dataset
(hint:use read.csv() and make sure that the dataset is in your working directory)
dataset<-read.csv(“Greenhouse_gas_emmisions_OECD.csv”)
Step 2: Understand your dataset
You can execute several commands to have a first idea of the dataset such as:
The names(dataset) returns the following values: COU, Country, POL, Pollutant, VAR, Variable, Year, Unit.Code, Unit, PowerCode.Code, PowerCode, Reference.Period.Code, Reference.Period, Value, Flag.Codes, Flags
Step 3: Subset your Dataset
From the variables described above select only the Country, Pollutant, Variable, Year and Value
(hint:use dplyr and the function Select. Make sure that the dplyr is installed and loaded in your R. Obviously you can use alternative syntax or other functions to produce similar results)
dataset<- dataset %>% select(Country, Pollutant, Variable, Year, Value)
Step 4: Filter your Dataset
The dataset$Variable
contains values from different
categories of emmisions as you can see in the table below.
Uniques Values |
---|
Total emissions excluding LULUCF |
Total GHG excl. LULUCF, Index 1990=100 |
Total GHG excl. LULUCF per capita |
5 - Waste |
2- Industrial processes and product use |
1 - Energy |
3 - Agriculture |
6 - Other |
Total GHG excl. LULUCF per unit of GDP |
1A1 - Energy Industries |
Land use, land-use change and forestry (LULUCF) |
1A4 - Residential and other sectors |
1A5 - Energy - Other |
1B - Fugitive Emissions from Fuels |
1A2 - Manufacturing industries and construction |
1A3 - Transport |
Total GHG excl. LULUCF, Index 2000=100 |
Total emissions including LULUCF |
1C - CO2 from Transport and Storage |
Agriculture, Forestry and Other Land Use (AFOLU) |
1A4 - Residential and other sectors |
To avoid duplicates values as some of the categories are aggregated measures you should filter the dataset keeping only the category “Total emissions excluding LULUCF”
(hint:use dplyr and the function Filter and make sure that you type correct the value you want to keep)
dataset<- dataset %>% filter(Variable==“Total emissions excluding LULUCF”))
Step 5: Let’s calculate and save the total emission per year for all countries in the dataset. The new dataframe should be called dataset_year and the new variable name is total_emissions
(hint:use dplyr and the functions group_by and summarize)
dataset_year<- dataset%>% group_by(Year)%>% summarize(total_emissions=sum(Value))
Step 5: Let’s plot the total annual emission
Use the ggplot function to plot a scatterplot of the total emission (y label) to the years (x label)
(hint:You need to define data, aesthetics and the correct geometry and most importantly you need to load the library ggplot)
if you have done it correctly you should have produced the following plot
ggplot(dataset_year, aes(Year,total_emissions)) + geom_point()
Step 6: Smooth line and Label
Repeat the previous plot but this time add also a smooth line. Define also a title for the plot (“Total Annual Emissions”), x (“Total emissions excluding LULUCF”) and y (“Year”) labels and a caption (“Source OECD”). Change also the theme of the plot to bw (theme_bw())
(hint:You need to use the function labs() and to define the arguments title, caption, define data, aesthetics and the correct geometry and most importantly you need to load the library ggplot)
ggplot(dataset_year, aes(Year, total_emissions)) + geom_point() + geom_smooth() + labs(title = “Total Annual Emissions”, x= “Total emissions excluding LULUCF”, y=“Year”, caption= “Source OECD”) + theme_bw()
Step 7: Combine Several plots
In the lecture you saw the par and mfrow functions where you can combine several plots from the graphics package. This is not suitable for ggplot graphs. There are several packages and functions that you can use instead such as the function grid.arrange() from the library gridExtra
Your task here is to create the following two plots in ggplot:
ggplot(dataset_year, aes(total_emissions/1000)) + geom_histogram(fill =“blue”, color = “black” ) + labs(title=“Frequency of annual Emissions”, x=“total emissions expressed in 1000s”, y=“Frequency”)
combine the two plots using the function grid.arrange()
(hint:You need to install first and load the library gridExtra)
grid.arrange(plot1, plot2, ncol = 2, nrow = 1)
The R code solutions for both parts will be available on Minerva after the end of the face-to-face seminars
For this exercise you will use the diamonds dataset a built-in dataset. Diamonds is a dataset containing the prices and other attributes of almost 54,000 diamonds. The variables in the dataset are described below:
Step 1: Understand your dataset
Execute the commands below to have a clear picture of the variables, dimensions and observations of the dataset:
dim(diamonds)
names(diamonds) head(diamonds) str(diamonds) summary(diamonds)
Step 2: Declare data and aesthetics: We will use the cut variable. Both chunks of code below produce the same results:
ggplot(aes(x = cut), data = diamonds)
or
ggplot(diamonds, aes(cut))
At this point ggplot knows which data to use and has already mapped the cut variable but still does not know what to plot. To produce the plot you need to define the geometries.
Step 3: Create a bar graph for the cut variable (the geometry is the geom_bar())
Create a histogram for the same variable (geometry is geom_histogram()
Create a histogram for the variable carat
(hint:• You can change the binwidth with the following syntax: geom_histogram(binwidth = 0.1))
Step 4: Plot the histogram for the variable depth
Set the binwidth to 0.2. Execute the code
Repeat but now set limit to the values in the x-axis to be in the range (55,70).Execute the code
Do the same but now map the variable cut in aesthetics. Use the argument fill=cut.Execute the code
Remove the cut from the aesthetics and present the variable using the facet_wrap(). Execute the code
Create a scatterplot mapping carat to x and price to y axes (the geometry is geom_point().Execute the code
Change the colour to blue (this should be done at geometry level with the argument colour=”blue”). Add also smooth line.Execute the code
Change the colour of the smooth line to red (the argument should be inside the function you have used to plot the smooth line).Execute the code
Step 4: Follow the example
The following code plots a smooth line for x = carat, y = price where the colour is function of the variable cut. This as you can see is defined at the geometry level:
ggplot(diamonds, aes(x = carat, y = price)) + geom_point(aes(colour= cut)) +
geom_smooth()
Follow the example but this time define the colour as a function of the cut at the aesthetics of ggplot level. Can you spot any difference?
Include title, caption x and y labels. Choose whatever titles you want.
These were just some basic functions that you can use to produce plots in R language. There are many more functionalities and opportunities to build some very interesting plots. For example packages such as gganimate or plotly can be used to build animated plots as the one in the code below. However, the more complex the code the more likely is to receive error messages because something is missing from your pc. Always search online for the error message received to find solutions.
library(plotly)
library(gapminder)
df <- gapminder
fig <- df %>%
plot_ly(
x = ~gdpPercap,
y = ~lifeExp,
size = ~pop,
color = ~continent,
frame = ~year,
text = ~country,
hoverinfo = "text",
type = 'scatter',
mode = 'markers'
)
fig <- fig %>% layout(
xaxis = list(
type = "log"
)
)
fig