PdfMasher

PdfMasher is a tool to convert PDF articles (newspaper, academic) to MOBI or EPUB documents. Most ebook readers support PDF files natively, but it's often a real pain to read those documents because we don't have font size control over the document like we have with native ebooks. In many cases, we have to use the zooming feature and it's just a pain. Another drawback of PDFs on ebook readers is that annotations are not supported.

There are already tools to convert PDFs to ebooks like Calibre, but what they do is that they try to guess the role of each piece of text in the PDF (and that's if you're lucky). I think that in all but the simplest cases, it's a mistake to think that anything short of an AI can do that kind of guessing.

Enter PdfMasher. PdfMasher asks the user about the role of each piece of text, and does it in an efficient manner. Your PDF has a header on each page and you don't want them to litter your text? Sort text elements by Y-position (thus grouping them all together), shift select the elements and flag them as ignored. They will not appear on your final HTML. Your PDF has footnotes on many pages? Sort your elements by text content (thus grouping all elements with the text starting with a number together) and flag them as footnotes. They will be moved to the end of the document, and PdfMasher will try to create hyperlinks to footnote references.

(Sorry about the glitch. It's youtube's fault, it lasts 11 seconds.)

There are more screencasts on this page.

PdfMasher doesn't preserve style and images. PDF is evil and just the task of extracting text from it while preserving the flow is daunting. I receive e-mails from people disappointed that their styling is lost in the process. Sorry, PdfMasher's focus is not on style preservation (hence the "masher" part of the name) and if it's something you need, PdfMasher is probably not for you.

Early Development

Although it's quite capable already, PdfMasher is still in early development. The ultimate goal is to be able to mash the Monde Diplomatique with it. Since v0.4.0, it's actually possible, but it's not as convenient as it should, and there's still glitches here and there. In any case, any feedback and bug reports are appreciated, even more so if you have sample PDFs to send. Please let me know by contacting support or posting in the forums.

Requirements

  • Mac OS X: 10.6 and up (Snow Leopard, Lion or Mountain Lion).
  • Windows (64-bit): Win7/Win8.
  • Windows (32-bit): XP/Vista/Win7.
  • Linux: Ubuntu 12.04 and 12.10

Old Versions

Fork me on GitHub

This site is best viewed with Firefox while listening to The White Stripes