I applied for one project (this one) in GSoC-22, being confident that I understood the problem and could work out an automated solution for this problem.
The main idea was how you handle LaTeX components and generate equivalent HTML components for the same. I had been writing articles and documents in markdown and LaTeX. So I had a clear picture of the challenges that I was going to face.
I had to write some custom LaTeX commands like a command to renew the title as an author name. This way, I could fetch most of the metadata in the LaTeX source file. I had some help and inspiration from Mitchell O’Hara-Wild’s previous work. It saved me a lot of time and work on the other challenges.
R-journal defines various verbatim environments for code blocks like
example .Sinput, smallverbatim
etc. My first idea was to write a Lua filter to solve this issue, but it
was not the best solution. Hence, I implemented a stream editor to find
and replace these custom environments with a standard
verbatim environment.
I had to ignore the math issues and wait for MathJax V3, which bought many changes like support for nested math environments in equations. I also wrote a pandoc Lua filter to bring equation labelling and numbering to the web article using the bookdown feature.
Tables were also a challenge initially. They refused to convert to markdown tables using pandoc. After a bit of discussion with my mentors and Christophe Dervieux, I learnt about pandoc reader extensions for markdown and added simple-table and piped-tables extensions. It solved the problem for most table environments.
For other table environments, I use stream editor or other workarounds. While writing a pandoc Lua filter for table numbering, I received a helping hand from Albert Krewinkel for some logic behind the Lua filter.
The problem with the figures was that they did not have numbering, proper sizing and captions. I tried to fix this and used many methods to solve the issue. I used pandoc Lua filters for added figure numbering and adjusting their sizes. I designed a pre-processor in R to generate an object model for images. This pre-processor helped me add support for tikz and algorithm2e images. They are converted to PDF first and then to PNG.
In LaTeX, we have multiple ways to declare bibliography entries, which include embedding bibliography entries in the document itself or isolating that section to an external file commonly called a bbl file. Then we have the BibTeX format, which has its structure and is much superior in managing bibliographic entries.
LaTeX has no problem dealing with multiple formats and works well with all these bibliographic formats. However, some web publishing journals, like RJ web articles from RJournal, do not support embedded bibliographies. As a result, they need a BibTeX equivalent of the same.
The rebib package converts embedded LaTeX bibliographies into a close BibTeX equivalent. This package eases the author from the manual conversion of the documents and aids the texor project in automating the conversion of past RJournal articles to the RJ-web-article format.
The last part of the project was to wrap up all these individual functions and create a workflow for automated conversion by calling these functions over a list of directories. There was also some logging using the logger package to identify failure points and better understand the issues.
I intend to continue maintaining the package(s) and improving its functionality. I wish to see all the legacy PDF Rjournal articles converted to the new web format as the new web format brings accessibility and quality of life enhancements to the readers.
This project marks a milestone in my open source journey and remains close to my heart for the memorable experience I had, which I cannot express. I am grateful to my mentors, friends, and everyone who was involved in this project directly or indirectly. I learnt a lot and enjoyed the whole process.
All work/packages are open source under MIT license or specified otherwise.