General Problems

The most problematic aspect of the current publishing practices is the lack of versioning and the lack of archiving. In some cases, this makes scientific or high-quality policy evaluation work impossible. For example, when Eurostat finished the transition from NACE Rev2 to NACE, it simple deleted from the warehouse some of its earlier products, that may have been referenced in many policy evaluations and scientific texts. Removing the statistics, which is not even available on non-machine-readable format, like in the form of books, is the worst possible practice. This is such a serious breach of basic statistical standards that it should be avoided, and if current regulation allows it, the regulation should be amended.

Partial data removal

One problem that is plaguing regional statistics is the partial removal, which is, in my understanding, is due to lack of versioning and incomplete metadata, or lack of consistency in the data table. This is the case, when the statistics refers to a statistical level, for example, NUTS2, which is in fact changes on average every 3 years. For example, French, Polish and Lithuanian national accounts data prior to 2015 when they changed boundaries in NUTS2016. This is not simply a conversion problem.

  • One solution could be to publish the archived NUTS2013 data in a separate table from the NUTS2016 data

  • Another solution would be the addition of column which clarifies for each row if it follows NUTS2016,NUTS2013, NUTS2010 or NUTS2006 definitions.

  • The deletion of NUTS2013 data from a defined statistical product is the worst possible solution. Even keeping archived versions would help.

Inconsistent labelling and metadata

Sometimes, inconsistent labelling is present, and the actual data table and its description is inconsistent. For example, data products labelled as NUTS3 data actually contain the NUTS2, NUTS1 and NUTS0 level data, but in other cases they don’t. Of course, anybody can aggregate them upwards, but still, this is not a good practice.

However, when only one level is present, for example, the NUTS2 level, the smaller countries are inconsistently treated. In these cases, NUTS0 = NUTS1 = NUTS2 because of the country size. Sometimes, the data is correctly presented in the case of Cyprus, Estonia, Luxembourg, Malta, and sometimes not. I will show an example when the data clearly exists for NUTS0 = NUTS2 level for Estonia, but it can only be found in a different product.

A separate issue is the mixed level data, where the underlying microdata does not permit a NUTS2 level breakup for larger countries, for example, for Germany. I think that in this case, generally Eurostat correctly presents the data, although adding a NUTS_LEVEL variable would avoid confusion among less experienced users. The issue gets tricky when the NUTS1 level boundaries change, as in the case of France.

Boundary changes

The boundary changes, without proper metadata, i.e. additional data columns, such as NUTS_LEVEL, or NUTS_DEF = 2016, 2013, cannot be resolved by the user, especially, because we have no idea when is the data backfilled with the new boundaries, and when it is not. The ambiguity is created by the fact, that sometimes the NUTS labels change, but the boundaries not. For example, when France redefined the regions, it gave a new ISO code to all of them, but some of these regions are the same as in the previous definition. In this case, the NUTS2013 data can be perfectly presented in the NUTS2016 framework. Again, there are two choices of data presentation: Eurostat may backfill data following the NUTS2016 boundaries, or Eurostat may simply present the NUTS2013 data as history. (Or, in the worst case scenario, delete earlier published data.)

The NUTS Converter is an open, web-based tool allowing the conversion of European regional statistical data between different versions of the Nomenclature of Territorial Units for Statistics (NUTS) classification.

It has serious limitations for the problems that I am going to show. The NUTS converter cannot create you better data when the original data is errorneous. In most of the cases I will show, the original Eurostat product has some problem that cannot be solved with the converter.

Another limitation is that the NUTS Converter does not have a documentation and the source code is not available. So it is a black box that you can hardly use in scientific research or policy evaluation. I would like to create similar conversion tools and release them on rOpenGov, but it would require sorting out the missing data / metadata issues with the Eurostat products first.

The Correspondence Table that shows boundary changes does not contain Slovenia and Greece. The table can be relatively easily modelled in most cases, except for Lithuania were very unfortunate. The country has only one NUTS1 level region, and the changes were between NUTS2 and NUTS3 levels. Eurostat has very few NUTS3 level data, so analysts adjustments are not possible. It is likely that the Lithuanian statistical authority could help here if it wanted, because the change is relatively easy (Vilnius was removed from its region.) Without backfilling a lot of data about Lithuania’s new territorial units, Lithuania will not be present in regional analysis for a long time to come.

Acknowledgements

I do not show how the data was downloaded, processed and put on map or data animation. It was always done with the eurostat R package that creates an excellent framework to retrieve and analyze European statistical data in a reproducible manner. See more details here: Leo Lahti, Janne Huovari, Markus Kainu and Przemysław Biecek: Retrieval and Analysis of Eurostat Open Data with the eurostat Package It is not a product that is anyway associated with Eurostat, but it provides a very consistent approach to use Eurostat’s statistical products.

All the examples below use the 2016 EuroGeographics boundaries from the Eurostat website.

All examples relate to the year 2013.

All the rest of the work, and any possible mistakes is mine.

Land Area

Area by NUTS 3 region is a currently very problematic Eurostat product. It is not a NUTS3 total and land area table, in fact, it contains NUTS3, NUTS2, NUTS1, and NUTS0 level data.

Cyprus, Estonia, Luxembourg and Malta are missing on NUTS1, NUTS2 & NUTS0 levels - this is an inconsistency, because in these countries NUTS0 = NUTS1 = NUTS2. You can however impute the data by aggregating all NUTS3 data from these countries.

A more serious error is the lack of data for regions that changed boundaries or codes between NUTS2013 and NUTS2016.

The following regiosn can be corrected: FRB, FRC, FRD, FRE, FRF, FRG, FRH, FRI, FRJ, FRL, FRM, PL7, PL9, FRC1, FRC2, FRD1, FRD2, FRE1, FRE2, FRF1, FRF2, FRF3, FRG0, FRH0, FRI1, FRI2, FRI3, FRJ1, FRJ2, FRK1, FRK2, FRL0, FRM0, FRY1, FRY2, FRY3, FRY4, FRY5, PL71, PL72, PL81, PL82, PL84

## Warning in correct_nuts_labelling(.): Some of the data has obsolete NUTS
## 2013 codes.

Population

The Population on 1 January by NUTS 2 region after the 13/09/19 update contains data for all NUTS2 units, and it is ready to use without further imputation.

Disposable income

In this case, Eurostat is publishing data only in NUTS2016 format. This means that French, Polish data is not available prior to 2015, because of boundary changes, and Estonian data is not available, because Estonia has no NUTS2 regions due to its small size.

Disposable income of private households by NUTS 2 regions - tgs00026

The treatment of small countries is not consistent. Luxembourg and Cyprus are correctly is treated as a NUTS2 region (LU00, CY00) but Estonia and Malta are not. However, the national level data is available for Estonia (but not for Malta), as you can see in tec00113

The problems and the treatments, except for Estonia, are the same as with the next product.

GDP

Gross domestic product (GDP) at current market prices by NUTS 3 regions - nama_10r_3gdp suffers from similar problems as Disposable income of private households by NUTS 2 regions - tgs00026. This is, again, contrary to what the name suggests, is not a NUTS3 data table - it contains data on all NUTS0, NUTS1, NUTS2 and NUTS3 levels. In this case, even Estonia is there.

What should Eurostat do?

It should decide if this is a NUTS2 product, or a NUTS2016-2 product.
- If it is a NUTS2016-2 product, it should keep somewhere in a public archive the NUTS2013-2, NUTS2010-2 and NUTS2006-2 products. - If it is a NUTS2 product, as the name suggests, it should add separate data columns, i.e. geo16, geo13, geo10, geo06 for each table row.

Both approaches would make conversion possible. Furthmore, in cases when boundaries did not change, far more data would be available.

Removing earlier data, and replacing it with differnet data under the same product code and webpace is the worst possible practice, it makes any reproduction impossible in scientific and policy evaluation applications.

What You Can Do?

Nothing really - for France there is no NUTS2016 or NUTS2013 data is available. Not that the data does not exists - you can find it, for example, in the OECD Regional database. I am pretty sure that this was earlier available on Eurostat, too, but got deleted.

Tertiary educational attainment

Tertiary educational attainment, age group 25-64 by sex and NUTS 2 regions - tgs00109 It is important to know that data until 2013 are classified according to ISCED 1997 and data as from 2014 according to ISCED 2011 (coding of educational attainment), so the data table is not fully usable in a panel. (The Metadata description contains out-of-date urls to the difference of the ISCED codings.)

This is, by the way, an excellent quality dataset.

What Eurostat should do?

Correct the Metadata file.

Tertiary educational attainment

Human resources in science and technology (HRST) by NUTS 2 regions - tgs00038 is a complete indicator as of October 2019.

Intramural R&D expenditure (GERD)

Intramural R&D expenditure (GERD) by NUTS 2 regions - tgs00042. The small countries are correctly labelled. However, countries affected by the NUTS2013 to NUTS2016 are missing for the year 2013.

## Warning in classInt::classIntervals(x, n = n, style = style): var has
## missing values, omitted in finding classes
## Warning in classIntervals(x, n = n, style = style): var has missing values,
## omitted in finding classes

What Should Eurostat Do?

Again, Eurostat should either publish separate NUTS2013 and NUTS2016 tables, or create geo16 and geo13 regional labels. Furthermore, it shoudl add Slovenia and Greece to the currently published correspondence tables.

In this case, much of the missing data is present, just wrongly labelled.

What You Can Do?

My correct_nuts_labelling() function, which can be found at the end of this article, corrects the errors on an as-is basis. As soon as a bit more clarity will be available, I will release the function in a more proper way, possibly as a tiny separate package, or as part of eurostat.

## Warning in correct_nuts_labelling(.): Some of the data has obsolete NUTS
## 2013 codes.
## Warning in classInt::classIntervals(x, n = n, style = style): var has
## missing values, omitted in finding classes
## Warning in classIntervals(x, n = n, style = style): var has missing values,
## omitted in finding classes

Researchers

Researchers, all sectors by NUTS 2 regions - tgs00043

## Warning in classInt::classIntervals(x, n = n, style = style): var has
## missing values, omitted in finding classes
## Warning in classIntervals(x, n = n, style = style): var has missing values,
## omitted in finding classes

Requires very similar treatment than the previous one.

## Warning in correct_nuts_labelling(.): Some of the data has obsolete NUTS
## 2013 codes.
## Warning in classInt::classIntervals(x, n = n, style = style): var has
## missing values, omitted in finding classes
## Warning in classIntervals(x, n = n, style = style): var has missing values,
## omitted in finding classes

Internet Use & Internet Activities

Individuals who used the internet, frequency of use and activities - isoc_r_iuse_i is a mixed-level regional statistic, it contains mainly NUTS2 but sometimes NUTS1 level statistics.

What You Can Do

This is more complex problem. Firstly, you need to decide what to do with areas that do not have NUTS2 level data, because no such resolution is permitted from the microdata.

One partial solution that you can (after testing I will release this code on rOpenGov) is simply imputing the NUTS1 values to NUTS2 regions, when the actual NUTS1 data is in fact a weighted avarage of the NUTS2 data. Of course, the weights are unknown, but this is a reasonable good simple imputation algorithm. A more difficult solution would be modelling the variance of the data with common underlying variables, such as population density, GDP density, etc. and estimate the values of the NUTS2 data.

This is certainly not a role for Eurostat but for the analyst.

What Eurostat Should Do?

The addition of a clear NUTS_LEVEL variable, harmonized with the maps released by Eurostat would create clarity, and would make data joining and validation easier. It would be a good practice with all regional statistical products. This would be an enhancement, not a bug fix, the current data appears to be fine.

The re-addition of earlier French data would be necessary. As mentioned earlier, Eurostat has two clear choices: creating separate NUTS2016 and NUTS2013 tables, or creating joined tables with an additional column clearly defining if the data follows NUTS2016, NUTS2013 or potentially NUTS2010, NUTS2006 or NUTS2003 data elements.

A simple, but less good solution would be the re-addition of the earlier French data where the NUTS boundaries did not change, only the NUTS labels. In these cases, the history of French data is no different from Dutch or Belgian or any other data.

What You Can Do

Exactly the same as with the previous ICT data.

Internet Use & Internet Activities

Individuals who ordered goods or services over the internet for private use - isoc_r_blt12_i is a mixed-level regional statistic, it contains mainly NUTS2 but sometimes NUTS1 level statistics.

Conversion Tools

Read In Correspondence Tables

The following code is not run now. Please save the latest correspondance table to your own system.

Function to correct NUTS labelling

This code is not evaluated now. You need the tidyverse and eurostat packages to make it work from CRAN.

Function to impute NUTS1 data to NUTS2