This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

1.Introduction

This document is an effort to get insight into Copenhagen’s data regarding CO2 emission.

The data is an open data coming from Open Data of Denmark, in spreadsheet format of ODS. The data can be found here.

The available visualizations in the spreadsheet file are very limited, and thus do not reward that much insight into data. Specially, all suggested visualizations are either uni- or bi-dimensional, which means considering a very small portion of the dataset. A systemic approach to the dataset suggests to look at the big-picture, to keep the components, here variables, and their relations in order to figure out possible emergent properties. I know no other tool to do so except high-dimensional data visualization.

Using high-D data visualization is not rejection of simple graphs, but it is a complementary approach. Using the high-D visualization techniques, we may or may not see anything further that what we have aleary seen, but looking at the big-picture and not seeing any special thing is better than ignoring the holistic view. Nevertheless, one will be surprised by the results of the high-D visualizations in this study, as I got surprised in the first encounter!

2.Original Visualization

In the spreadsheet file, there are some graphs including segmented bar-charts, a time-serie line chart of total measure, and some multiple-bar plots. While these graphs can convey some information and provide some insights, this study tries to get insight into the data using other visualizations methods such as PCA bi-plot

original plots1

These graphs are definitely useful, they summarize the digits into nice easily digestable graphs. However, each graph is limited into one or two variables. We know that emergent proporties happen in the presence of all components and their relations. Is it possible to keep all the variables? The next section is dedicated to answer this question.

3.Suggested visualizations

If we consider every emission section as a variable, which define each year since 2005, then we can investigate two important things.

The correlation between variables. In other words, how CO2 emissions of different sections are related to each other? Has Electricity Consumption and Heating Consumption related over the time period? What about road traffic and air traffic?
How does the behaviour of Copenhagen chage over time period based on CO2 omission behavior? This question is more important. Through a dimension-reduction method we can plot each year as a point, whose characteristic is ideally defined by all variables simultaneously.

data <- read_csv(file = "/Users/Shaahin/Downloads/co2.csv", 
                 col_types = "cnnnnnnnn")

data_t <- t(as.matrix(data[,-1]))
data_t <- data.frame(data_t)
colnames <- data.frame(data)[,1]
colnames(data_t) <- colnames

kable(data_t)

	Electricity consumption	Heating Consumption	Individual heating, Trade and Service and households	Individual heating and process heat, Industry	Built for cooking	road traffic	Train traffic (including electric trains)	Air traffic	Ship traffic	Non-road industry	Non-road garden / household	Process emissions, industry	solvents	land use	Landfill	Sewage
2005	1314013	611315	31232	190	19188	396529	48783	11918	43670	30797	4153	3743	3082	387	1377	15360
2008	1257798	569458	29811	181	17923	406741	57648	10039	42678	36723	3244	3095	8193	260	710	15982
2009	1323397	622666	37294	165	16202	387279	62762	15258	42146	50856	3223	2447	6233	220	550	16502
2010	1281291	611830	26602	2682	15718	378217	44197	16141	44640	62880	3320	205	8421	135	700	16725
2011	1049348	477725	22179	3132	16441	374672	36438	15353	58360	49024	3345	5063	6102	292	830	15355
2012	870498	486358	19894	2440	20600	359801	31157	14579	35148	73839	5551	3969	15078	37	800	19137
2013	979544	475198	22239	11939	14958	355814	36635	13500	29300	47900	3355	4696	14600	44	750	19193
2014	806715	384200	17616	10672	13947	349674	29794	14400	40300	59100	3370	3733	19500	25	750	16588

years <- rownames(data_t)
data_biplot <- data_t[,!colnames(data_t) %in%
               c("fishing",
                 "Individual heating, agriculture and horticulture",
                 "Non-road agriculture and forestry",
                 "Agriculture and forestry")]

Now the data is ready for our purpose. Usually, it is difficult to distinguish the variables in datasets. Here, I consider the years as observations and the different consumption sectors as variables. The reason lies in the purpose of the visualization as I want to explore the relation between years considering all consumption sectors simultaneously.

In the first step, I remove the variables,i.e. emission sections, that their values are all zero. This includes four variables of fishing,Individual heating, agriculture and horticulture,Non-road agriculture and forestry, and Agriculture and forestry.

Now it is the time to test the correlations among variables. In order to do so, I use corrplot package.

#chart.Correlation(as.matrix(data_biplot), histogram=TRUE, pch = 19)

corrplot(cor(as.matrix(data_biplot)), order = "original", tl.col = "black", tl.srt = 15 , type = "lower",tl.cex = 0.4)

The findings are interesting! For instance, sewage has very high negative correlation with ship traffic! Or land use is strongly and positively correlated with the road traffic. Well, this one is understandable at least, since when the land use increases, the road traffic can increase as well. We need more domain knowledge to go deep into possibly interpretations of these correlations.

Moreover, We need to look at the scatterplots in order to get further insight into these findings. The correlation coefficient, while very popular and common, is very deceptive, and we should not trust on it without further investigation about the possible relation.

Now let’s go for answering the second question, the ambitious one! Is it possible to show the behaviour of Copenhagen on CO2 emission over time? Is there any change in fact to be shown? By behaviour, I mean the profile of CO2 emission over different categories. Has this profile changed over time based on the dataset? Let’s make a bi-plot over the data.

#summary(data_t)
#apply(data_biplot,2 , sum)
#data_biplot_prop <- data_biplot[,-17]/(apply(data_biplot[,-17], MARGIN = 1 , FUN = sum))
#apply(data_biplot_prop, MARGIN = 1 , FUN = sum)



data_scaled <- scale(data_biplot)
data_pca <- prcomp(data_scaled)

components_load<- summary(data_pca)$importance[2,1:2]
components_load <- round(components_load,2)


coordinates <- data.frame(years = years, data_pca$x[,1:2]) 
vectors <- data.frame(Variables = colnames(data_biplot),
                      data_pca$rotation[, 1:2])

ggplot() + 
        geom_point(data = coordinates, aes(x = PC1 , y = PC2),
                   size = 4,
                   color = "blue", 
                   alpha = 0.5) + 
         geom_text_repel(data = coordinates,
                        aes(x = PC1 , y = PC2, label = years)) +
        geom_segment(data = vectors , aes(x = PC1*4 , y = PC2*4, xend=0, yend=0), 
                     color = "red",
                     alpha =1) +
        geom_text_repel(data = vectors ,
                        aes(PC1*4 , PC2*4 , label = Variables ),
                        color = "red", 
                        alpha = 0.5, 
                        size = 2)+
        xlab(label = paste0("PC1(",components_load[1]*100,"%)"))+
        ylab(label = paste0("PC2(",components_load[2]*100,"%)"))+
        theme_linedraw() + 
        coord_fixed(ratio = 1) + 
        ggtitle("Absolute CO2 Emission of Copenhagen Since 2005 to 2014")

Figure 2 is a bi-plot. It includes the years, solid circles, and vectors of variables. Vectors show the direction in which the value of that variable increases. For instance, if we project the the points, i.e. years, on the Air traffic, then 2010 would have the highest amount and 2012, and 2005 the lowests. As another example, based on the land use vector and projection of the points on it, 2005 has the highest amount in land use followed by 2008, while 2013 and 2014 have the lowest amounts. Generally, the closeness of years is a sign of the similarity of their CO2 emissions on corresponding sections. The important point here is, I have used the absolute values of the CO2 emission, and not the proportional values. We come back to this point soon.

Is this map 100% precise? No. Bi-plots are based on Principal Component Analysis(PCA), and the dimension reduction costs precision. Here, the original dimension of each year is equal to number of variables that define the years, i.e. 16. In order to show a point from 16-dimensional space in a bi-dimensional space, I have used PCA, and the cost of it is 33% of the precision of the map. There are lot’s of discussions about PCA, and I am not a big fan of it, but for now it serves my purpose well. There are some non-linear alternatives such as MDS and SOM, that I use parallely in my studies. Here, to keep the report concise, PCA should suffice.

Beside the techinical aspects of bi-plots, the interpretation of figure2 is very interesting. Clearly, the behaviour of Copenhagen has changed over years. On 2005 on the top-left of the map, to 2010 on the center-bottom of the map, to 2011 on the center of the map, 2012 on top-right of the map, and finally 2013&214 on the right end. What does this change mean? It can be interpreted based on the vectors, i.e. variables. For instance, on 2005 the emphasis of the city has been on built for cooking, landfill , process emission, industry and non-road garden, household. On 2010, the city is on the opposite side based on these variables! All the mentioned variables have lowered in 2010, and variables such as air traffic and non-road industry have increased. Other variables should be taken into account as well. One good step would be clustering of the variables, so we deal with fewer variables.

As it was said earlier, the values of the figure2 are absolute values. However, we may want to measure the similarity of the years based on proportional values. In other words, what percentage of each year’s CO2 emission is due to non-road industry for example? These percentages compose a percentage profile for each year. If we build a bi-plot based on this data, then the result would be different. Figure3 shows such bi-plot.

data_biplot_prop <- data_biplot/(apply(data_biplot, MARGIN = 1 , FUN = sum))

#data_scaled <- scale(data_biplot)
data_scaled <- data_biplot_prop
data_pca <- prcomp(data_scaled)

components_load<- summary(data_pca)$importance[2,1:2]
components_load <- round(components_load,2)


coordinates <- data.frame(years = years, data_pca$x[,1:2]) 
vectors <- data.frame(Variables = colnames(data_biplot),
                      data_pca$rotation[, 1:2])

ggplot() + 
        geom_point(data = coordinates, aes(x = PC1 , y = PC2),
                   size = 4,
                   color = "blue", 
                   alpha = 0.5) + 
         geom_text_repel(data = coordinates,
                        aes(x = PC1 , y = PC2, label = years)) +
        geom_segment(data = vectors , aes(x = PC1*0.02 , y = PC2*0.02,
                                          xend=0, yend=0), 
                     color = "red",
                     alpha =1) +
        geom_text_repel(data = vectors ,
                        aes(PC1*0.02 , PC2*0.02 , label = Variables ),
                        color = "red", 
                        alpha = 0.5, 
                        size = 2)+
        
        xlab(label = paste0("PC1(",components_load[1]*100,"%)"))+
        ylab(label = paste0("PC2(",components_load[2]*100,"%)"))+
        theme_linedraw() + 
        coord_fixed(ratio = 1) + 
        ggtitle("Proportional CO2 Emission of Copenhagen Since 2005 to 2014")

Here, it is seen than based on contribution percentage of the sections, the behaviour of the city is more erratic. While the high percentage of the CO2 emission has been due to Electricity consumption in 2008, and then heating consumption raised in 2010, in 2012 the road traffic vector seems the defining factor of CO2 emission.

It is rational to ask now, which data should be used? Absolute or proportional? My answer unfortunately is: it depends. It depends on the goal of the study. What we want to see? In my idea, here we want to see the shift in the consumption in different sections, so the percentage is more appropirate. Moreover, I have used PCA here as it is very well-known in statistics, however, in a more serious study I would use Multidimensional Scaling(MDS) or Self-Organizing Maps(SOM) due to their non-linearity characterisitcs.

At last, the graphs that I would like to see are time-series of absolute and proportional values of different sections. We need to prepare our data first and then plot the variables over time-span.

data_time <- data_biplot
data_time$year <- rownames(data_biplot)
#colnames(data_time)
time_series <- data_time %>%
        gather(colnames(data_time[,-17]),key = "section", value = "CO2")
time_series$section <- as.factor(time_series$section)


ggplot(data = time_series) +
        geom_line(aes(x = year , y = CO2, 
                      group = section, color = factor(section) ), 
                  alpha = 1) + 
        theme_linedraw()  +
        scale_color_manual(values = primary.colors(16) )+
        ggtitle("Absolute CO2 Emission of Copenhagen")

It is clearly seen that the three top sections are Electricity consumption, heating consumption, and road traffic. The rest of sections have relatively small values, so it is difficult to understand them from this graph. We need to zoom in. In other words, we have to follow the visualization mantra:

Visual Data Exploration usually follows a three step pro- cess: Overview first, zoom and filter, and then details-on- demand (which has been called the Information Seeking Mantra) (Keim,2002)

So we may like to consider the bottom part of figure4 independently. However, I rather see the percentage time-series. How the percentage share of each section has changed over time since 2005?

data_biplot_prop <- data_biplot/(apply(data_biplot, MARGIN = 1 , FUN = sum))

data_time <- data_biplot_prop
data_time$year <- rownames(data_biplot)
#colnames(data_time)
time_series <- data_time %>%
        gather(colnames(data_time[,-17]),key = "section", value = "CO2")
time_series$section <- as.factor(time_series$section)


ggplot(data = time_series) +
        geom_line(aes(x = year , y = CO2, 
                      group = section, color = factor(section) ), 
                  alpha = 1) + 
        theme_linedraw()  +
        scale_color_manual(values = primary.colors(16) )+
        ggtitle("Proportional CO2 Emission of Copenhagen")

It is very interesting to see that, while road traffic is reduced in 2014 in figure4, its contribution to the total CO2 emission of 2014 has increased. Now let’s zoom in to the bottom of the graph.

data_biplot_prop <- data_biplot/(apply(data_biplot, MARGIN = 1 , FUN = sum))

data_time <- data_biplot_prop
data_time$year <- rownames(data_biplot)
#colnames(data_time)
data_time <- data_time[,-c(1,2,6)]
#colnames(data_time)
time_series <- data_time %>%
        gather(colnames(data_time[,-14]),key = "section", value = "CO2")
time_series$section <- as.factor(time_series$section)


ggplot(data = time_series) +
        geom_line(aes(x = year , y = CO2, 
                      group = section, color = factor(section) ), 
                  alpha = 1) + 
        theme_linedraw()  +
        scale_color_manual(values = primary.colors(16) )+
        ggtitle("Proportional CO2 Emission of Copenhagen")

To me, the striking points of figure6 are rapid increase of Air Traffic, decline in Train traffic, and rapid increase of solvents, whatever it is!

It is possible to go further and use other tools such as parallel coordinates. I postpone using them to another report.

4.Conclusion

The goal of this report was using some Danish open dataset, and use high-diemsional data visualization in order to augment the simple conventional plots. Through new visualizations, I tried to consider all variables simultanously, and to have a holistic approach to the problem.

Inline with the study’s goals, PCA-biplots and multiple time-series used for two variations of the dataset. Through the biplots, it became possible to investigate the dynamic behaviour of Copenhagen in CO2 emission, something that would remained concealed through the conventional bi-variate plots. Also time-series showed the change in the proportion and absolute values of different sections in emission of CO2.

I believe that a systemic approach to data visualization is the way to exploit the potentials of it, a potential that would remain mostly untouched through reductionist approaches. Reductinist approaches only consider one or two variables in visualization, however, we are all aware of the meaning of emergent properties.

Shahin Ashkiani Contact@shahin-ashkiani.com Oct-2017

Visual Exploration of Copenhagen CO2 Emission

Shahin Ashkiani: Contact@Shahin-Ashkiani.com

10/23/2017

1.Introduction

2.Original Visualization

3.Suggested visualizations

4.Conclusion