Ebook Conversion
From Blue-IT.org Wiki
[Update] Section about MHT --Apos 21:44, 4 January 2012 (CET)
Contents
A word before we start
When I bought an eBook-Reader last chrismas (2011), an PocketBook Pro 912 with an open sourced linux operating system and firmware and an e-Ink Display. By the way:
- this article is based on the prerequisites of a reader with an e-Ink display
- this article uses Linux (ubuntu 11.10) and a lot of bash shell scripting, so be prepared
You might be able to adjust some things for other devices, but this is mainly intended for working with devices with e-Ink displays, and mainly the ones of a 10 size!
I didn't know, how much work it will cost myself to dive deep into the theme. The e-Ink-technology was what kept my interest. What I didn't know: this technology also needs a special way of reading your documents. It is not possible to simply throw a PDF to your device being sacrificed.
Years ago, when a began dealing with LaTeX - my preferred text processing method - I already had been aware of the problem, that the information about the semantics of the given text, or the "metainformations" is low, if not exists at all any more. In technical term one would say, the entropy of a PDF-document is - compared to HTML, SGML, or XML, or even LaTeX - very low. This is, why PDF is only a media for printing but not a document format used by computer programs, thereby viewed on a screen or an e-reader.
Concrete:
- you cannot alter the font size
- a program cannot guess the text flow
- on a reader with e-ink display, it might be impossible to read the document, because the fonts are to small
That's why you might want to convert your documents.
How do you like to do this?
- unattended
- batch-like
- automatically
- repeatedly
- with open source software ;)
Let's have a look at the general workflow.
Test it
If you like to test you should have the following documents on your side:
- Best a LaTeX or LyX document in a separate directory with
- titlepage
- table of content
- footnotes
- pictures and all that other fancy stuff ;)
- Convert this document to HTML, RTF, PDF
- Test all the conversion programs
- Test your reader software, if it can show
- titlepage
- table of content
- footnotes
- pictures and all that other fancy stuff ;)
If you like, you can use this Lyx-file: File:Lyx whysiwym editor.zip. Be aware you need Lyx for this file to run.
The process
Reading plain PDF
There are mainly 3 points of interest, when it comes to read an PDF on an e-Ink device:
- Keep all metainformation - like the table of content - after conversion
- Cut as much of unnecessary space (title, white borders, ...)
- Batch processing
Cropping the pages is the most sophisticated solution
Remember ... we are talking about e-Ink devices!
Briss
I had different approaches to cut the pages my PDFs, from the commercial acrobat writer, over imagemagick and other "crop"-tools. After all, and keeping the 3 points from above in mind the only tool I can recommend is briss.
Wrapper for the java program briss
Download and extract it somewhere. Written in java, and so that you can use it as any other program, you should write a little wrapper an put in in your path (e.g. ~/bin, /usr/local/bin, /usr/bin):
$> sudo youreditor /usr/local/bin/briss #!/bin/bash version="0.0.13" cd "$(pwd)" java -jar /whereever/is/installed/briss-${version}/briss-${version}.jar "${@}" $> sudo chmod 755 /usr/local/bin/briss
Nautilus script
Wouldn't it be nice to open a PDF or even a symbolic link in nautilus?
Her we go: put this file into
gedit "~/.gnome2/nautilus-scripts/Open with briss ..."
and make it executable
chmod "~/.gnome2/nautilus-scripts/Open with briss ..."
This is the content:
#!/bin/bash # Convert any given Video to a youtube compatible format ... and upload it # Uses https://code.google.com/p/youtube-upload/ to upload to youtube ######################################################################################## cd "$NAUTILUS_SCRIPT_CURRENT_URI" MYFILE="${1}" MYBASENAME="$(basename ${MYFILE} .${MYTYPE})" if file -L ${MYFILE} | grep -v grep | grep "PDF document" then briss ${MYFILE} else zenity --info --title "Fehler" --text "${MYFILE} scheint keine ${MYTYPE}-Datei zu sein! Bitte überprüfen." exit 0 fi
Batch conversion
This way prepared, I wrote a batch conversion program. The main problem when writing a conversion script is, that many ebook titles contain chars like "[" or "&". This is something the bash does not like at all! Speaking shortly: I know this script has a lot of duplicate code in it. But believe me when I say: I tried more than once to change this.
The script (assumed you name it crop_with_briss.sh) mainly does the following:
- General: all found PDF's are cropped and prefixed with "_cropped.pdf" (this is the default way briss works in batch mode "-s")
- crop_with_briss.s myPDF.pdf: The given PDF will be automatically cropped to myPDF_cropped.pdf
- crop_with_briss.s -l : Scan for all PDF-files in the local directory. Files which are formerly cropped (a file with the name *_cropped.pdf" exists, will not be cropped again!
- crop_with_briss.s -lf : Scan for all PDF-files in the local directory. All (!!!) PDF's are cropped again.
- crop_with_briss.s -r : Same as -l, but the script recurses into all (!) subdirectories.
- crop_with_briss.s -rf : Same as -lf, but the script recurses into all (!) subdirectories.
So, here we go:
#!/bin/bash # version 0.0.1, 12-03-2012, Axel Pospischil, http://blue-it.org # version 0.0.2 # - added parenthesis when doing filetype check for singe-mode: file "${1}" which briss > /dev/null || echo "Briss must be installed to run this script." which briss > /dev/null || exit 0 [ "${1}" == "" ] && echo "Please specify -l (local path only) or -r (recursive) as parameter." && exit 1 MODE="" FORCE="false" [ "${1}" == "-r" ] && MODE="recursive" [ "${1}" == "-rf" ] && MODE="recursive" [ "${1}" == "-rf" ] && FORCE="true" [ "${1}" == "-l" ] && MODE="local" [ "${1}" == "-lf" ] && MODE="local" [ "${1}" == "-lf" ] && FORCE="true" file "${1}" | grep -v grep | grep "PDF document" && MODE="single" if [ "${MODE}" == "single" ] then cd $(pwd) briss -s "${1}" exit 1 fi if [ "${MODE}" == "recursive" ] then if [ "${FORCE}" == "true" ] then # First scan the local dir, then recursive find '.' -name "*.pdf" | grep -v "cropped" | awk '{print $0}' | sed -e 's/^\.\///g' | sed -e 's/\.pdf$//g' | sed -e 's/(/\\(/g' | sed -e 's/)/\\)/g' | sed -e "s/'/\\\'/g" | sed -e 's/\[/\\[/g' | sed -e 's/\]/\\]/g' | sed -e 's/\&/\\&/g' | sed -e 's/\ /\\ /g' | awk '{system("briss -s " $0 ".pdf");}' exit 1 else find '.' -name "*.pdf" | grep -v "cropped" | awk '{print $0}' | sed -e 's/^\.\///g' | sed -e 's/\.pdf$//g' | sed -e 's/(/\\(/g' | sed -e 's/)/\\)/g' | sed -e "s/'/\\\'/g" | sed -e 's/\[/\\[/g' | sed -e 's/\]/\\]/g' | sed -e 's/\&/\\&/g' | sed -e 's/\ /\\ /g' | awk '{system("\[ -f " $0 "_cropped.pdf \] \|\| briss -s " $0 ".pdf");}' exit 1 fi fi if [ "${MODE}" == "local" ] then if [ "${FORCE}" == "true" ] then ls -b *.pdf | grep -v "cropped" | awk '{print $0}' | sed -e 's/\.pdf$//g' | sed -e 's/(/\\(/g' | sed -e 's/)/\\)/g' | sed -e "s/'/\\\'/g" | sed -e 's/\[/\\[/g' | sed -e 's/\]/\\]/g' | sed -e 's/\&/\\&/g' | sed -e 's/\ /\\ /g' | awk '{system("briss -s " $0 ".pdf");}' exit 1 else ls -b *.pdf | grep -v "cropped" | awk '{print $0}' | sed -e 's/\.pdf$//g' | sed -e 's/(/\\(/g' | sed -e 's/)/\\)/g' | sed -e "s/'/\\\'/g" | sed -e 's/\[/\\[/g' | sed -e 's/\]/\\]/g' | sed -e 's/\&/\\&/g' | sed -e 's/\ /\\ /g' | awk '{system("\[ -f " $0 "_cropped.pdf \] \|\| briss -s " $0 ".pdf");}' exit 1 fi fi
Contrast enhancement
Briss is doing a very good job cropping the documents. But e-Ink devices normally only can show 16 gray scale "colors". So it would be handy to convert a document to grayscale and thereby enhance the contrast ;)
The solutions:
- convert the PDF with imagemagick ("convert" or "mogrify" are the corresponding commands)
- use a reader software with the capability of enhance the gamma or contrast of the content
The problems:
- imagemagick
- does not preserve the metacontent of our PDF (table of content)
- the document is not searchable any more, or a TTS is not working any more
- the document becomes significantly bigger
- the result on my PocketBook Pro 912 is not what I expected, when it comes to quality and contrast enhancement. The PDf's seam to be not that crispy, clear.
- reader software
- I did not find any software, that satisfied me
- There is mainly one that can handle PDFs: a fork of fbreader, it's called fbreader-bw and you will find it, when you search the http://www.mobileread.com forum.
- Just for note: coolreader can not (!) display PDF files.
Do it with Imagemagick:
convert -density 600 -contrast -gamma 0.1 -colorspace GRAY input.pdf output.pdf
A word on djvu an rescanning of PDF
DJVU is a very good format for keeping your scanned documents. It is NOT a good format for reading text on an e-Ink device. It has the same disadvantages as PDF.
There are a lot of converters out there. Mainly
- pdf2djvu (contained in any linux distribution)
- djvudigital, which uses ghostscript. Because there are licence issues, you have to compile it by yourself, which isn't very much fun.
If you are interested, please read the corresponding webpage or the onlinemanuals ;)
From LaTeX to PDF or HTML
My LaTeX-documents can easily be altered to produce appropriate output for an e-Ink device.
But generating a PDF will be only suitable for a certain e-Ink device (when it comes to the font size). By the way: there is no direct possibility to create a ebub or mobi document from LaTex (as far as I know at the moment).
So my preferred output format is HTML! There is nothing more to say about.
- The table of content is preserved
- No problems with font-sizing
- Easy conversion to other ebook formats (epub, mobi, ...)
Elyxer
I am using elyxer to convert my LaTeX files. I have to admit, that I am working - exclusively (!) - with LyX. So everything here (elyxer, scripts) is only suitable, if you are working with lyx. You can alter the scripts for usage with plain latex, there should not be any problem.
The next script converts all lyx-files, either locally ( -l ) or recursively ( -r ):
#!/bin/bash [ ! -f /usr/bin/elyxer.py ] && echo "Elyxer must be installed to run this script." && exit 0 [ "${1}" == "" ] && echo "Please specify -l (local path only) or -r (recursive) as parameter." && exit 1 if [ "${1}" == "-r" ] then for myfile in "$(find '.' -name "*.lyx" | awk '{print $0}' | sed -e 's/ /\\ /g' | sed -e 's/.lyx//g')" do echo "${myfile}" | awk '{system("elyxer.py " $0 ".lyx > " $0 ".html");}' done fi if [ "${1}" == "-l" ] then for myfile in "$(ls *.lyx | awk '{print $0}' | sed -e 's/ /\\ /g' | sed -e 's/.lyx//g')" do echo "${myfile}" | awk '{system("elyxer.py " $0 ".lyx > " $0 ".html");}' done fi
The same script for producing PDF-files using pdflatex ( lyx --export pdf2 ). You can easily adopt this:
#!/bin/bash [ "${1}" == "" ] && echo "Please specify -l (local path only) or -r (recursive) as parameter." && exit 1 if [ "${1}" == "-r" ] then for myfile in "$(find '.' -name "*.lyx" | awk '{print $0}' | sed -e 's/ /\\ /g')" do echo "${myfile}" | awk '{system("lyx --export pdf2 -f " $0);}' done fi if [ "${1}" == "-l" ] then for myfile in "$(ls *.lyx | awk '{print $0}' | sed -e 's/ /\\ /g')" do echo "${myfile}" | awk '{system("lyx --export pdf2 -f " $0);}' done fi
A word on MHT
There would be an ideal solution for archiving webpages in one single file: the mht-format. There are plugins for firefox to view these files.
Unfortuneately none of the existant readers of the Pocketbook is able to read mht-files. One excuse: the coolreader, but there are errors displaying complex files and also no "table of content".
Articles about using and creating mht files, probably future software versions can handle this format:
- MHT - How to save web pages as an mht file in Ubuntu. | Ubuntu Manual
- Firefox addon UnMHT
- batch convert html to mht with html2mht.pl and dir2html.pl
- SourceForge.net: HTML to MHT converter
From PDF to epub, reflow and rescanning of PDF
Generally: no good idea, but possible. Why? The reason is simple: PDF has almost no met information about the document structure any more. So all tools more or less have to guess - of course a very clever guess - about the document structure, what is a heading, what is text, which kind of heading do we have, 2 or more column pages. What should I say: I leave it and read my PDf-Files - if they are too big for my screen - in landscape format. Any 10 device is capable to turn the pages 90° to the left or right so you can read the pages.
It doesn't matter, how you generate your epub out of an PDF or use a reflow software - either standalone or integrated in your reader: the more complex your PDF document is, the more disappointing the result will be. So better don't waste your time. There might be a time when these converters are that smart, that they can produce acceptable results out of complex documents, but my approach would be, to buy an appropriate format (like HTML or EPUB) before I run into this trouble!
The reflow software of the most readers out there seams to do a quiet good job. But there are some problems, when it comes to complex documents.
For those, who like to try it out:
- pdfreflow (part of every Linux distribution)
- soPDF at mobileread.com and the soPDF project page. A clever software, which does not extract the text, but cuts out the complete PDF-content and reassembles it. A really ood approach.
- k2pdfopt from Mr. Willus
- pdfr from the same author. Almost like k2pdfopt
From HTML to epub or mobi
One of the most sophisticated formats when it comes to eBooks is epub.
HTML is - from my point of view - the best starting point for conversion. As described above it can easily be created using LaTeX or even other word processors. There are a lot of tools out there converting from e.g. HTML. But almost none is capable of keeping the "table of content". And this is - when it comes to ereading - one of the most important part.
My prerequisites for a conversion tool
- should be opensource
- crossplatform (Windows, Linux, Mac, ...)
- commandline batch processing
- should preserve most of the text structure (table of content, footnotes, ...)
So here are the candidates:
- The free tool kindlegen from amazon.
- For private use free ist eCub, a simple version of the next tool
- The shareware Jutoh, but for demonstration purposes you can test it.
Prepare the HTML
Before you are going on any further, you should be aware to work with a so called "well formed" html document. You can do this by using the software tidy:
tidy -m -asxhtml -utf8 <yourfile>.html
You also can use the free online software tidy service. But be aware, that sending confidential content to a web service might not be a good idea!
If you are under windows, this site might be from interest for you.
Kindlegen
Is crossplattform and does a very good job. I hav nothing to complain. It produces files in the "*.mobi"-format.
The only problem I have:
- the "table of content" goes away!
eCub
Did not succeed the "table content" test, but was quiet handy. It is only free for private usage and has the same limitations like Jutoh. It seams to be a simpler version of that program.
For this and the next program and more complex documents it is important to edit the settings and add a correct css file. Why it is not possible to use the style links in the HTML document is a miracle for me.
Jutoh
This is not free software. But it is a full featured Editor for ebook generation.
Advantages:
- Support of almost any ebook format
- Fully WHYSIWYG editor
But I don't need the latter, because I want to edit my documents with the processor of my choice.
Disadvantages:
- Project based. So you first have to import a document via GUI in a project file.
- This is a GUI-tool, no commandline tool. But it has a batch mode, if you formerly created a project file.
- Generation of the "table of content" will not allays succeed with my LaTeX documents.