Converting scientific reviews to EPUB

May 7 2018 pub thesis

First, check on PubMed Central
Convert the HTML page to EPUB
Clean up the HTML before conversion
Other EPUB resources
Loose Ends by Sidney Brenner

Edit Apr 6 2019: Added the compilation of Loose Ends columns by Sidney Brenner.

Third post on the series of “Things I did instead of writing my thesis to help me write my thesis”: how to find/convert reviews in the EPUB format to read in an ebook reader.

To motivate myself to write the thesis introduction and get some “style” inspiration, I read a few reviews. Reviews are longer and often more wordy than original research articles so I’d sometimes prefer to use my e-reader. It’s more compact and the e-ink is more comfortable to read, especially outside. But I find PDFs a bit awkward to use on an e-reader so it would be better to have the reviews in a e-reader format like EPUB.

In my experience, this is the best strategy to get a review in EPUB format:

Check if available on PubMed Central.
If not, download HTML page and convert with Pandoc.
- If Nature Reviews or Annual Reviews, clean up the HTML file before conversion.

For Nature Reviews and Annual Reviews, the script I wrote can also merge several reviews into one EPUB ebook.

First, check on PubMed Central

In the top-right corner of the article page there is a ePub option (example). This feature has been in beta version for years but the EPUB files are really good.

Unfortunately, many articles/reviews are not available through PMC, including the majority of reviews from dedicated journals like Nature Reviews Genetics and Annual Review of Genetics. For example, there are more than a thousand reviews from Nature Reviews Genetics on PubMed, but only 200-300 available on PubMed Central. Same for Annual Review of Genetics which has ~70 reviews available on PMC but almost 1K indexed on PubMed.

If it’s not available in PMC (and the journal doesn’t provide an EPUB file), no choice but to manually convert to EPUB.

Convert the HTML page to EPUB

Rather than trying to convert the PDF, I found it easier to convert the HTML page. EPUB is a XML-based format which is quite close to the HTML anyway so it should be the way to go.

After saving the webpage locally, the HTML file can be converted with Pandoc:

pandoc webpage-i-manually-downloaded.html -o webpage-i-manually-downloaded.epub

Usually, the result is readable but not very practical because the interface of the website is polluting the ebook, many links are external (pointing at webpages), and content is sometimes missing. It’s annoying not being able to click on the reference/figure links (or clicking inadvertently and opening a web browser), having to skip dozens of pages to get to the content, or missing information because they needed to be “expanded” on the website.

There are exceptions. For example the “Opportunities and obstacles for deep learning in biology and medicine”, a cool (and long) review which was published in the Journal of the Royal Society Interface, is also formatted in a GitHub HTML page. The manuscript is actually written and build using Markdown/GitHub (I want to learn how to do that at some point). Anyway, the GitHub webpage was straightforward to convert with Pandoc and produced a perfect EPUB version of the review.

wget -p https://greenelab.github.io/deep-review/
cd greenelab.github.io/deep-review
pandoc index.html -o DeepLearningBioMed.epub

Clean up the HTML before conversion

For Nature Reviews and Annual Reviews, I wrote a small R script that downloads a HTML page and:

Keeps only the article content (remove the rest of the webpage).
Fixes the table of content.
Fixes links to reference/figure/glossary in the text.
Makes the figures not clickable.
Removes external links around references and images.
Simplify complex tables to allow for EPUB display.
Shows the content of a box even if hidden in the webpage.
Replaces low-resolution figure with high-resolution (for Annual Reviews).

I put the two R scripts on GitHub. Using the URL of the review, run this to clean up the HTML and call Pandoc:

Rscript epubify-annualreviews.R -i https://www.annualreviews.org/doi/full/10.1146/annurev-genet-120215-034854
## or
Rscript epubify-naturereviews.R -i https://www.nature.com/articles/nrg.2015.16

The scripts rely on the rvest package (install.packages('rvest') to install in R) and the knitr package (install.packages('knitr') to install in R). It also assumes that pandoc is installed.

Compiling several reviews into one EPUB document

Some reviews or articles are part of collections, like the “Points of significance” serie in Nature Methods. It would be nice to convert these articles into one EPUB.

To do this both scripts can loop over URLs in a list and produce one EPUB. The -i argument should be a text file with one URL per line, and the -list argument should be added to switch on the “list” mode. Optional, you can specify a title using the -title argument.

I don’t think I’m allowed to share publicly the “Points of Significance” EPUB but I can share the commands I used (see instructions on GitHub). There are also commands to download collections of Nature Reviews (e.g. Genome Editing or Computational Tools) or search results for Annual Reviews (e.g. Microbiome & Sequencing).

VPN, paywall and Pandoc

The script and Pandoc will try to download the text/images so they should be used behind the university’s VPN (or at the university). For Nature Reviews, the page can also be downloaded locally and used instead of the URL argument. That might be useful if the articles are accessed by logging in through the web browser (e.g. McGill’s EZproxy system). For Annual Reviews however, the Pandoc command will try to download high-res figures so it’s better to run with a VPN.

Limitations

Tables are “simplified” which makes them uglier but compatible with EPUB. For example, a multi-rows label is repeated in each row.
If/when the websites change, the scripts will need to be adapted.
This is tinkering so it’s possible that some pages won’t work properly. The ones I tried worked but I didn’t test that many.

Methods

The scripts use functions from the rvest package to download the HTML code of a webpage. Then they use three different strategies to clean up the HTML code. Sometimes rvest’s functions can directly select the relevant parts. Many times, the HTML code is modified using gsub and regular expressions. A few times the script writes new HTML code based on information retrieved using rvest.

Tables are converted into data.frames with html_table, then converted back to HTML usint knitr’s kable function. In list mode, the script loops through URLs from the input file and concatenate the cleaned HTML code. At the end, the final HTML code is written in a file which is converted to EPUB by Pandoc.

Other EPUB resources

Books from the Methods in Molecular Biology series can be downloaded in the EPUB format.
The Translational Bioinformatics collection in PLoS Computational Biology has an ePub option.

Loose Ends by Sidney Brenner

Added Apr 6 2019.

Sidney Brenner passed away yesterday. I knew his name and that he is a legend in the field but never had the chance to really read about him. I saw this tweet from Yaniv Erlich:

His loose ends book was instrumental for me to get through grad school. You can read it online here:https://t.co/d4Ni1vosYN
— Yaniv Erlich 💔 (@erlichya) April 5, 2019

So I decided to compile these monthly columns from 1994-2000 that Sidney Brenner wrote into EPUB/MOBI files to read on my ebook reader. I put the files on GitHub if anyone else is interested.

Hippocamplus My Second Memory