This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This document is an effort to get insight into Copenhagen’s data regarding CO2 emission.
The data is an open data coming from Open Data of Denmark, in spreadsheet format of ODS. The data can be found here.
The available visualizations in the spreadsheet file are very limited, and thus do not reward that much insight into data. Specially, all suggested visualizations are either uni- or bi-dimensional, which means considering a very small portion of the dataset. A systemic approach to the dataset suggests to look at the big-picture, to keep the components, here variables, and their relations in order to figure out possible emergent properties. I know no other tool to do so except high-dimensional data visualization.
Using high-D data visualization is not rejection of simple graphs, but it is a complementary approach. Using the high-D visualization techniques, we may or may not see anything further that what we have aleary seen, but looking at the big-picture and not seeing any special thing is better than ignoring the holistic view. Nevertheless, one will be surprised by the results of the high-D visualizations in this study, as I got surprised in the first encounter!
In the spreadsheet file, there are some graphs including segmented bar-charts, a time-serie line chart of total measure, and some multiple-bar plots. While these graphs can convey some information and provide some insights, this study tries to get insight into the data using other visualizations methods such as PCA bi-plot
These graphs are definitely useful, they summarize the digits into nice easily digestable graphs. However, each graph is limited into one or two variables. We know that emergent proporties happen in the presence of all components and their relations. Is it possible to keep all the variables? The next section is dedicated to answer this question.
If we consider every emission section as a variable, which define each year since 2005, then we can investigate two important things.
The correlation between variables. In other words, how CO2 emissions of different sections are related to each other? Has Electricity Consumption and Heating Consumption related over the time period? What about road traffic and air traffic?
How does the behaviour of Copenhagen chage over time period based on CO2 omission behavior? This question is more important. Through a dimension-reduction method we can plot each year as a point, whose characteristic is ideally defined by all variables simultaneously.
data <- read_csv(file = "/Users/Shaahin/Downloads/co2.csv",
col_types = "cnnnnnnnn")
data_t <- t(as.matrix(data[,-1]))
data_t <- data.frame(data_t)
colnames <- data.frame(data)[,1]
colnames(data_t) <- colnames
kable(data_t)
| Electricity consumption | Heating Consumption | Individual heating, Trade and Service and households | Individual heating and process heat, Industry | Individual heating, agriculture and horticulture | Built for cooking | road traffic | Train traffic (including electric trains) | Air traffic | Ship traffic | fishing | Non-road industry | Non-road agriculture and forestry | Non-road garden / household | Process emissions, industry | solvents | Agriculture and forestry | land use | Landfill | Sewage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2005 | 1314013 | 611315 | 31232 | 190 | 0 | 19188 | 396529 | 48783 | 11918 | 43670 | 0 | 30797 | 0 | 4153 | 3743 | 3082 | 0 | 387 | 1377 | 15360 |
| 2008 | 1257798 | 569458 | 29811 | 181 | 0 | 17923 | 406741 | 57648 | 10039 | 42678 | 0 | 36723 | 0 | 3244 | 3095 | 8193 | 0 | 260 | 710 | 15982 |
| 2009 | 1323397 | 622666 | 37294 | 165 | 0 | 16202 | 387279 | 62762 | 15258 | 42146 | 0 | 50856 | 0 | 3223 | 2447 | 6233 | 0 | 220 | 550 | 16502 |
| 2010 | 1281291 | 611830 | 26602 | 2682 | 0 | 15718 | 378217 | 44197 | 16141 | 44640 | 0 | 62880 | 0 | 3320 | 205 | 8421 | 0 | 135 | 700 | 16725 |
| 2011 | 1049348 | 477725 | 22179 | 3132 | 0 | 16441 | 374672 | 36438 | 15353 | 58360 | 0 | 49024 | 0 | 3345 | 5063 | 6102 | 0 | 292 | 830 | 15355 |
| 2012 | 870498 | 486358 | 19894 | 2440 | 0 | 20600 | 359801 | 31157 | 14579 | 35148 | 0 | 73839 | 0 | 5551 | 3969 | 15078 | 0 | 37 | 800 | 19137 |
| 2013 | 979544 | 475198 | 22239 | 11939 | 0 | 14958 | 355814 | 36635 | 13500 | 29300 | 0 | 47900 | 0 | 3355 | 4696 | 14600 | 0 | 44 | 750 | 19193 |
| 2014 | 806715 | 384200 | 17616 | 10672 | 0 | 13947 | 349674 | 29794 | 14400 | 40300 | 0 | 59100 | 0 | 3370 | 3733 | 19500 | 0 | 25 | 750 | 16588 |
years <- rownames(data_t)
data_biplot <- data_t[,!colnames(data_t) %in%
c("fishing",
"Individual heating, agriculture and horticulture",
"Non-road agriculture and forestry",
"Agriculture and forestry")]
Now the data is ready for our purpose. Usually, it is difficult to distinguish the variables in datasets. Here, I consider the years as observations and the different consumption sectors as variables. The reason lies in the purpose of the visualization as I want to explore the relation between years considering all consumption sectors simultaneously.
In the first step, I remove the variables,i.e. emission sections, that their values are all zero. This includes four variables of fishing,Individual heating, agriculture and horticulture,Non-road agriculture and forestry, and Agriculture and forestry.
Now it is the time to test the correlations among variables. In order to do so, I use corrplot package.
#chart.Correlation(as.matrix(data_biplot), histogram=TRUE, pch = 19)
corrplot(cor(as.matrix(data_biplot)), order = "original", tl.col = "black", tl.srt = 15 , type = "lower",tl.cex = 0.4)
The findings are interesting! For instance, sewage has very high negative correlation with ship traffic! Or land use is strongly and positively correlated with the road traffic. Well, this one is understandable at least, since when the land use increases, the road traffic can increase as well. We need more domain knowledge to go deep into possibly interpretations of these correlations.
Moreover, We need to look at the scatterplots in order to get further insight into these findings. The correlation coefficient, while very popular and common, is very deceptive, and we should not trust on it without further investigation about the possible relation.
Now let’s go for answering the second question, the ambitious one! Is it possible to show the behaviour of Copenhagen on CO2 emission over time? Is there any change in fact to be shown? By behaviour, I mean the profile of CO2 emission over different categories. Has this profile changed over time based on the dataset? Let’s make a bi-plot over the data.
#summary(data_t)
#apply(data_biplot,2 , sum)
#data_biplot_prop <- data_biplot[,-17]/(apply(data_biplot[,-17], MARGIN = 1 , FUN = sum))
#apply(data_biplot_prop, MARGIN = 1 , FUN = sum)
data_scaled <- scale(data_biplot)
data_pca <- prcomp(data_scaled)
components_load<- summary(data_pca)$importance[2,1:2]
components_load <- round(components_load,2)
coordinates <- data.frame(years = years, data_pca$x[,1:2])
vectors <- data.frame(Variables = colnames(data_biplot),
data_pca$rotation[, 1:2])
ggplot() +
geom_point(data = coordinates, aes(x = PC1 , y = PC2),
size = 4,
color = "blue",
alpha = 0.5) +
geom_text_repel(data = coordinates,
aes(x = PC1 , y = PC2, label = years)) +
geom_segment(data = vectors , aes(x = PC1*4 , y = PC2*4, xend=0, yend=0),
color = "red",
alpha =1) +
geom_text_repel(data = vectors ,
aes(PC1*4 , PC2*4 , label = Variables ),
color = "red",
alpha = 0.5,
size = 2)+
xlab(label = paste0("PC1(",components_load[1]*100,"%)"))+
ylab(label = paste0("PC2(",components_load[2]*100,"%)"))+
theme_linedraw() +
coord_fixed(ratio = 1) +
ggtitle("Absolute CO2 Emission of Copenhagen Since 2005 to 2014")
Figure 2 is a bi-plot. It includes the years, solid circles, and vectors of variables. Vectors show the direction in which the value of that variable increases. For instance, if we project the the points, i.e. years, on the Air traffic, then 2010 would have the highest amount and 2012, and 2005 the lowests. As another example, based on the land use vector and projection of the points on it, 2005 has the highest amount in land use followed by 2008, while 2013 and 2014 have the lowest amounts. Generally, the closeness of years is a sign of the similarity of their CO2 emissions on corresponding sections. The important point here is, I have used the absolute values of the CO2 emission, and not the proportional values. We come back to this point soon.
Is this map 100% precise? No. Bi-plots are based on Principal Component Analysis(PCA), and the dimension reduction costs precision. Here, the original dimension of each year is equal to number of variables that define the years, i.e. 16. In order to show a point from 16-dimensional space in a bi-dimensional space, I have used PCA, and the cost of it is 33% of the precision of the map. There are lot’s of discussions about PCA, and I am not a big fan of it, but for now it serves my purpose well. There are some non-linear alternatives such as MDS and SOM, that I use parallely in my studies. Here, to keep the report concise, PCA should suffice.
Beside the techinical aspects of bi-plots, the interpretation of figure2 is very interesting. Clearly, the behaviour of Copenhagen has changed over years. On 2005 on the top-left of the map, to 2010 on the center-bottom of the map, to 2011 on the center of the map, 2012 on top-right of the map, and finally 2013&214 on the right end. What does this change mean? It can be interpreted based on the vectors, i.e. variables. For instance, on 2005 the emphasis of the city has been on built for cooking, landfill , process emission, industry and non-road garden, household. On 2010, the city is on the opposite side based on these variables! All the mentioned variables have lowered in 2010, and variables such as air traffic and non-road industry have increased. Other variables should be taken into account as well. One good step would be clustering of the variables, so we deal with fewer variables.
As it was said earlier, the values of the figure2 are absolute values. However, we may want to measure the similarity of the years based on proportional values. In other words, what percentage of each year’s CO2 emission is due to non-road industry for example? These percentages compose a percentage profile for each year. If we build a bi-plot based on this data, then the result would be different. Figure3 shows such bi-plot.
data_biplot_prop <- data_biplot/(apply(data_biplot, MARGIN = 1 , FUN = sum))
#data_scaled <- scale(data_biplot)
data_scaled <- data_biplot_prop
data_pca <- prcomp(data_scaled)
components_load<- summary(data_pca)$importance[2,1:2]
components_load <- round(components_load,2)
coordinates <- data.frame(years = years, data_pca$x[,1:2])
vectors <- data.frame(Variables = colnames(data_biplot),
data_pca$rotation[, 1:2])
ggplot() +
geom_point(data = coordinates, aes(x = PC1 , y = PC2),
size = 4,
color = "blue",
alpha = 0.5) +
geom_text_repel(data = coordinates,
aes(x = PC1 , y = PC2, label = years)) +
geom_segment(data = vectors , aes(x = PC1*0.02 , y = PC2*0.02,
xend=0, yend=0),
color = "red",
alpha =1) +
geom_text_repel(data = vectors ,
aes(PC1*0.02 , PC2*0.02 , label = Variables ),
color = "red",
alpha = 0.5,
size = 2)+
xlab(label = paste0("PC1(",components_load[1]*100,"%)"))+
ylab(label = paste0("PC2(",components_load[2]*100,"%)"))+
theme_linedraw() +
coord_fixed(ratio = 1) +
ggtitle("Proportional CO2 Emission of Copenhagen Since 2005 to 2014")
Here, it is seen than based on contribution percentage of the sections, the behaviour of the city is more erratic. While the high percentage of the CO2 emission has been due to Electricity consumption in 2008, and then heating consumption raised in 2010, in 2012 the road traffic vector seems the defining factor of CO2 emission.
It is rational to ask now, which data should be used? Absolute or proportional? My answer unfortunately is: it depends. It depends on the goal of the study. What we want to see? In my idea, here we want to see the shift in the consumption in different sections, so the percentage is more appropirate. Moreover, I have used PCA here as it is very well-known in statistics, however, in a more serious study I would use Multidimensional Scaling(MDS) or Self-Organizing Maps(SOM) due to their non-linearity characterisitcs.
At last, the graphs that I would like to see are time-series of absolute and proportional values of different sections. We need to prepare our data first and then plot the variables over time-span.
data_time <- data_biplot
data_time$year <- rownames(data_biplot)
#colnames(data_time)
time_series <- data_time %>%
gather(colnames(data_time[,-17]),key = "section", value = "CO2")
time_series$section <- as.factor(time_series$section)
ggplot(data = time_series) +
geom_line(aes(x = year , y = CO2,
group = section, color = factor(section) ),
alpha = 1) +
theme_linedraw() +
scale_color_manual(values = primary.colors(16) )+
ggtitle("Absolute CO2 Emission of Copenhagen")
It is clearly seen that the three top sections are Electricity consumption, heating consumption, and road traffic. The rest of sections have relatively small values, so it is difficult to understand them from this graph. We need to zoom in. In other words, we have to follow the visualization mantra:
Visual Data Exploration usually follows a three step pro- cess: Overview first, zoom and filter, and then details-on- demand (which has been called the Information Seeking Mantra) (Keim,2002)
So we may like to consider the bottom part of figure4 independently. However, I rather see the percentage time-series. How the percentage share of each section has changed over time since 2005?
data_biplot_prop <- data_biplot/(apply(data_biplot, MARGIN = 1 , FUN = sum))
data_time <- data_biplot_prop
data_time$year <- rownames(data_biplot)
#colnames(data_time)
time_series <- data_time %>%
gather(colnames(data_time[,-17]),key = "section", value = "CO2")
time_series$section <- as.factor(time_series$section)
ggplot(data = time_series) +
geom_line(aes(x = year , y = CO2,
group = section, color = factor(section) ),
alpha = 1) +
theme_linedraw() +
scale_color_manual(values = primary.colors(16) )+
ggtitle("Proportional CO2 Emission of Copenhagen")
It is very interesting to see that, while road traffic is reduced in 2014 in figure4, its contribution to the total CO2 emission of 2014 has increased. Now let’s zoom in to the bottom of the graph.
data_biplot_prop <- data_biplot/(apply(data_biplot, MARGIN = 1 , FUN = sum))
data_time <- data_biplot_prop
data_time$year <- rownames(data_biplot)
#colnames(data_time)
data_time <- data_time[,-c(1,2,6)]
#colnames(data_time)
time_series <- data_time %>%
gather(colnames(data_time[,-14]),key = "section", value = "CO2")
time_series$section <- as.factor(time_series$section)
ggplot(data = time_series) +
geom_line(aes(x = year , y = CO2,
group = section, color = factor(section) ),
alpha = 1) +
theme_linedraw() +
scale_color_manual(values = primary.colors(16) )+
ggtitle("Proportional CO2 Emission of Copenhagen")
To me, the striking points of figure6 are rapid increase of Air Traffic, decline in Train traffic, and rapid increase of solvents, whatever it is!
It is possible to go further and use other tools such as parallel coordinates. I postpone using them to another report.
The goal of this report was using some Danish open dataset, and use high-diemsional data visualization in order to augment the simple conventional plots. Through new visualizations, I tried to consider all variables simultanously, and to have a holistic approach to the problem.
Inline with the study’s goals, PCA-biplots and multiple time-series used for two variations of the dataset. Through the biplots, it became possible to investigate the dynamic behaviour of Copenhagen in CO2 emission, something that would remained concealed through the conventional bi-variate plots. Also time-series showed the change in the proportion and absolute values of different sections in emission of CO2.
I believe that a systemic approach to data visualization is the way to exploit the potentials of it, a potential that would remain mostly untouched through reductionist approaches. Reductinist approaches only consider one or two variables in visualization, however, we are all aware of the meaning of emergent properties.
Shahin Ashkiani Contact@shahin-ashkiani.com Oct-2017