Reprex: Reproducible Consulting

Daniel Antal, CFA

3/22/2021

Reprex Reproducible Consulting

You can jump between our main topics 1 Data Analytics & Consulting & 2 Focus on Material Costs and Non-Billable Hours with three use cases & 3 Shared Data Assets with the to the next main topic with the ⯇ ⯈ arrow on your keyboard or on the screen or by clicking the highlighted blue text. In each topic you can go deeper with pressing ⯆ on your keyboard, touchpad or on the screen - we offer more arguments, examples, visualizations. You can try it out to see a few definitions. The visual elements in the Use Case ⯈ Part 1, ⯈ Part 2, ⯈ Part 3 are linked to documented data, reproducible documents and app examples. Make sure to ↗ click out of this presentation!

Definitions

Open Data is data that is freely available to everyone to use and republish without legal or other restrictions. The most important sources of open data are open science data connected to scientific activities that allow the replication of scientific achievements. In Europe, the re-use of public sector information, in other jurisdictions, freedom of information regulations make various public institutions’ and taxpayer funded datasets available for reuse. Open data is a very important source of information for business, scientific and policy uses.
Reproducible research: The quality control of open data is focusing on reviewable, reproducible and confirmable findings. Auditability is a requirement in most high-level business, scientific or policy applications.
Open Source: In most cases, when the data processing code and procedure is not a well-documented, open-source algorithm, reproducibility and confirmability is limited, or impossible.
Press ⯅ to go back to the opening slide or ⯈ to jump to the first topic.

Data Analytics & Consulting

Everybody realized the need to adopt data science and AI in the consulting portfolio, but there are no clear business models.
The largest consultancies have acqui-hired analytics companies and are trying to build it into their billable hour model with problems: research automation per se saves billable hours.
Large tech firms (Microsoft, Google, IBM) are offering competing products
AI Consulting Firms are competing with Management/Policy Consulting, and startups have various partial new offerings that may kill consulting projects.
Innovative ⯆incentives should be calibrated.

Incentives

Data analytics usually requires one-time or flat investment and has exponentially growing benefits vs Policy/management consulting is a linear business and there is drive to sell more expert hours / days.
Even very experienced (and expensive!) senior data scientist and engineers cannot exactly forecast the number of hours needed to solve particular problems in the production.
Data scientists do not create data tables, visualizations and models for a project, but code, that can be re-used any number of times. Their value added is not related to a single project, but over many repeated projects.
Because data science cannot be treated as a profit center in the management consulting, billable hours model, it is either a cost-center or an outsourced activity.

Proposal

Service	Business Model	👉🏼 Internal or ↗ External Link
Trustworthy Data
access to well processed open data	Data-as-service	👉🏾 Use Case 1 data processing
well-processed API/big data	Data-as-service	↗ Example (330m documents processed)
automated data integration with in-house data	Solution-as-service	👉🏼 Use Case 3 demo app
Software & Automation
eliminating non-billable hours in projects	Exclusive support, hybrid licensing, f.e. [iotables]	👉🏻 Focus on Material Costs and Non-Billable Hours
use for modeling, AI		↗ iotables package
use for data gathering	Leave elements open-source, f.e. [regions] or [retroharmonize]	↗ regions pacakge
Data Ecosystems
share data, competitive edge with data access	Co-founding or sponsoring our data observatories	↗ Net Zero Data Observatory
find new consulting clients in ecosytem	Starting new data observatories or joining one of the existing 70 ones	👉🏾 Shared Data Assets
R&D, Sales, Marketing
nonbillable hours, pre-sales	Solution-as-service (partly project-based)	👉🏻 Focus on Material Costs and Non-Billable Hours
non-billable hours, after-sales	Solution-as-service (partly project-based)	👉🏻 Use Case 2 reproducible research project

Focus on Material Costs and Non-Billable Hours

Among material costs in a research product, data acquisition is often the largest item, and it is not subject to the need to be converted into billable hours. This creates an incentive to outsource data work.

Open data is an untapped resource, because it has zero acquisition cost, but it has huge processing and documentation cost. The acquisition of well-processed open data can almost certainly lead to better value for money in material costs.

Non-billable hours of a project include data curation, data cleaning, processing, validation, documentation, citation management, and archiving. These are very time consuming non-billable activities that are often neglected.

Our solutions can provide a higher quality, automated, and far more efficient solution to this. Reproducibility makes the data inflow constant and instantaneous processing and documentation supports re-use and saves further non-billable hours after the project.

After sales, project generation, and renewal are typical non-billable R&D and marketing costs.

After and before the next project re-usability, i.e. the constant data inflow, the automated processing and documentation, the self-refreshing nature of research product make renewed sales, new sales. We can make a research report automatically re-freshing, making re-cast available anytime, with very little cost.

Use Case for Open Source Software: Systematically Wrong Eurostat Datasets

There is a growing need and supply for provincial, regional, and metropolitan area level data. Both the OECD and the Eurostat are trying to keep up with the demand. Working on sub-national levels is impossible on a non-continuous basis, because within the European Union alone, up to 1000 changes occur every year.

Eurostat has no mandate to follow this process and re-cast data, and member states do not have the obligation.
The problem with the systematically wrong Eurostat data is a typical software problem: developing, commercializing proprietary software, and providing product warranty would make the investment nonviable. Open-source statistical software development may be a good answer for this problem but only with correct monetary incentives.

Click through see the data release ↗ in a blogpost
Or in the EU↗ Zenodo data repository

>

↗ pdf download
↗ see the code

Monetary incentives for open-source software development

Connected to a ⯈ ⯈research product for the UvA IVIR we created an ↗ open-source software that handles this problem. The labor input in hours was more than the entire research projec, but the results can be used in many-many further research products.
Our solution, that left kept the software open-source allowed very high quality - it is a peer-reviewed scientific project that meets the highest documentation standards -, but its development costs is not covered by any consulting project fee or scientific grant. It found about 800 users globally in the first month, and we are just planning a new release with a PR campaign via ↗ rOpenGov that will bring thousands of new users.
We would like to pilot a hybrid consulting-open source model where we are not selling software, but data-as-service, where we make sure that the software is faultless, and gains new features on demand, and this work is remunerated from the use of the data. For example, ↗ Github, where we store the software code and documentation offers donated funding option for software products. Ecorys could be a sponsor of this software development.
The approach can be fine-tuned with less or more permissive software licenses, and exclusivity of certain exploitation as long as it bridges the needs of remuneration for software development and documentation and the business case for exploitation - for example, exclusive client support for Ecorys, first exploitation of new uses for new research products, etc.

Use Case for Data-as-Service

We created a ↗ demo application that solved several problems related to our first problem. We wanted to find a solution to match ill-treated Eurostat statistical information and match it with Google’s almost real-time social distancing data.
Eurostats’ datasets are not consistently geocoded. Google uses a global geocoding standard, ISO 3166-2, which is far less useful for statistical purposes. Our solution is an application that constantly reads in and re-processes these two difficult data sources, and provides map, chart or Excel table outputs. (In this example we only show the acquisition, reprocessing, visualization and citation of Eurostat’s regional data.)

↗ get data and citations in app
allow time to initialize the app with fresh data [Highlighted: Arrondissement of Tongeren [border region with Dutch South Limburg]

↗ In-App Downloads
- the baseline data
- the excess death (calculated) data
- the pre-defined graphic plot
-the bibliographic citation

Ideal Workflow
1. Programatically download, process, correct the data.
2. Document the data.
3. Create visualization, tables, models.
4. Programatically place and validate tables, visualizations, citations in the .docx, PDF, research documents, Excel or SPSS or Stata files.
↗ Make sure to click to the app

Monetary incentives for data-as-service

The creation of an application like our example requires very significant investment, however, this application can be very easily modified, adopted to new data sources.
A project-based remuneration would lead to under-investment. Various options: a framework contract, intent for next project, fix retainers, or some claw-in to later uses should be explored.

Use Case for a Reproducible Research Project

We created the ↗ Central European Music Industry Report in a fully reproducible manner: all analysis, visualizations, and charts are created by our software
Our software can downloads and read in the data; it creates new visualizations with the latest data, and places it in ↗ PDF, ↗ EPUB and ↗ HTML and Word formats. It also creates accompanying presentation slides, too. The ⯆ table below contains the same links. We are releasing a similar research product next week which is bilingual.
The prototype of this product was a product created with Kantar to present evidence in a very lengthy competition litigation. Courtroom, scientific or regulatory evidence must be presented in consistent, repeated, and faultless ways.

browse the report
get in touch

>

pdf download
epub download

Monetary incentives for reproducible research products

Pre-sales and after-sales can be incentivized with success fees, or fixed fees for pre-defined efforts. The highest ROI is expected on this element, because reproducible research techniques aim for effortless re-use.

After the competition of a complex research product, we can make sure that all its data tables, visualization, int-text references for calculated model values, and citations are constantly maintained.
We can do this against a small maintenance fee, or an appropriate success fee in case a re-cast or re-use of the research product or its components is achieved with a paying client.
We can even create a small subset of research products on a monthly or quarterly basis in the form of a policy journal, newsletter or corporate blogposts to maintain interest in the topic that was already covered in a paid project.

Shared Data Assets

Observatories are permanent observation and data collection points. In the past centuries, they usually referred to a brick-and-mortar building where observation data was collected. In the Internet age, data observatories are online platforms that systematically and permanently collect research data.

⯆ Data observatories have been recognized and promoted by the EU, OECD and UNESCO as a good way to foster business-science-policy cooperation to maintain long-term data collection programs and ensure the best utilization. We believe that reproducible science and research automation can make data observatories better, cheaper and more reliable. Research-driven consultancies like Ecorys are often participating in the creation of data observatories, because it gives instant access to data and potential business and policy clients to put the data in use. Examples of our data observatories: ↗ Demo Music Observatory, ↗ Net Zero Data Observatory and other data observatories: ↗ European Alternative Fuels Observatory, the ↗ EUIPO Observatory , the ↗ European Observatory on Homelessness or the ↗ Wine Market Observatory

The EU wants to encourage data altruism, the use of data made available voluntarily by data subjects based on their consent or, where it concerns non-personal data, made available by legal persons, for purposes of general interest within the Data Governance Act.

Data Observatories: Big Data & Small Organizations

Permanent data collection: in social and natural sciences alike, many scientific discoveries, hypothesis testing, or scientific proofs require consistent data collected over a longer period. Only the largest, usually public institutions have the organizational capacity and budget to organize such a data collection program.
Funding cooperation: a long-term data collection program has many advantages for all scientific, policy or business uses, and offers many cost savings, but requires eventually some basic funding. Almost all the data observatories that we have reviewed receive some sort of public funding, and the ones that ceased to exist usually stopped their data collection program because of the availability of further public funds. Nevertheless, some sort of co-funding from participants or users is usually present.
Better value for money: the 17-years old European open data regime recognized that the value for money in most of the data investments can be significantly improved by data sharing and reuse. Experience tells that most publicly collected datasets are only exploited in a small fraction for the primary, first use, but they can provide value for businesses, researchers or the public sector when reused.
Organization: In social or economics sciences, often an hoc large cross-sectional data collection, such as an international comparative data collection, requires significant organizational investments, and this is even more the invest into longitudinal data collection that repeats regularly or irregularly over time. A permanent observatory structure, as an institution or a partnership of institutions is necessary for complex, longitudinal data collection. This can be provide by a multi-disciplinary team of domain experts, statisticians, data scientists, engineers and computer scientists.

Cooperation Proposal

Service	Business Model	👉🏼 Internal or ↗ External Link
Trustworthy Data
access to well processed open data	Data-as-service	👉🏾 Use Case 1 data processing
well-processed API/big data	Data-as-service	↗ Example (330m documents processed)
automated data integration with in-house data	Solution-as-service	👉🏼 Use Case 3 demo app
Software & Automation
eliminating non-billable hours in projects	Exclusive support, hybrid licensing, f.e. [iotables]	👉🏻 Focus on Material Costs and Non-Billable Hours
use for modeling, AI		↗ iotables package
use for data gathering	Leave elements open-source, f.e. [regions] or [retroharmonize]	↗ regions pacakge
Data Ecosystems
share data, competitive edge with data access	Co-founding or sponsoring our data observatories	↗ Net Zero Data Observatory
find new consulting clients in ecosytem	Starting new data observatories or joining one of the existing 70 ones	👉🏾 Shared Data Assets
R&D, Sales, Marketing
nonbillable hours, pre-sales	Solution-as-service (partly project-based)	👉🏻 Focus on Material Costs and Non-Billable Hours
non-billable hours, after-sales	Solution-as-service (partly project-based)	👉🏻 Use Case 2 reproducible research project

⯆ Further information ⯆ Exclusive elements in an open collaboration

Cooperation Proposal

The open collaboration leaves project sponsors with a flexible reserve of developers and researchers, as well as data capacity that they do not have pay for, even on a retainer basis. Financing the core of the collaboration, for example, in the form of sponsoring open-source specialist software or the maintenance of permanent data collection platforms, a high level of service level and exclusivity can be achieved without bringing basic research costs in house.

Open-source software development: ↗ Github Sponsorship or other forms of support for maintenance and further development. Available for sponsoring: ↗ regions - sub-national statistical indicators, 342 average monthly downloads, ↗ retroharmonize - individual survey data harmonization, 309 monthly downloads or ↗ iotables economic and environmental impact assessment, multipliers, direct and indirect effects for all EU countries, 482 monthly users. software packages; atypical licensing/support agreements for first exploitation, exclusive servicing.
Open data observatories: Co-founding or sponsoring open data observatories, with opportunities for early, embargoed exploitation of new, valuable data assets. We have many data assets for a ↗ Music, ↗ Climate Policy Data Observatory and other data observatories, and planned Digital Media, Culture Heritage EU projects, among others. Participation in observatories gives access to more data, and opens data sharing and consultancy sales opportunities within the domain-specific data ecoysytem of the observatory.
Research contributions: on a project-basis, with proportional consulting day basis.
Pre- and after sales: success fee or flat fee basis.

Dilemmas of in-house, exclusive or open collaboration ⯆ below

Open Collaboration & Exclusive Services

Open collaboration is an agile project management method originating in open-source software development and reproducible scientific practice. The open collaboration leaves the project sponsors with a flexible reserve developer/researcher and data capacity that they do not have pay for even on a retainer basis. It is built on - modularization: split up the tasks into independent components, - information commons that are well structured and make contributions easy, - incentives to participate, even with small contributions on an individual level.

While more and more activities are coded and automated, software development is an activity that is best kept outside of a consultancy due to product liability and business model consideration. This element is best kept open-source.
Data collection is ongoing and expensive, and a hybrid model that encourages exploitation of open data, data sharing, data altruism is best combined with exclusive, early access rights. Our triangular observatory concept allows the utilization of scientific, policy and business research assets to be combined, because these are usually these are non-competing interests.
Consultancy is not about data but insight. Any work that requires exclusive access to the insights of data analysts, statisticians, geographers, data scientists, can be internalized on a project (external consultant) or internal (part- or full-time employment) basis. Probably the best value for money is an external service contract with certain levels of non-competition clauses.
Pre-sales and after-sales can be incentivized with success fees, or fixed fees for pre-defined efforts. The highest ROI is expected on this element, because reproducible research techniques aim for effortless re-use.

Our Team

Click through for LinkedIn, Github profiles.

Contact: Daniel keybase - linkedin - github - twitter