Difference between revisions of "Ebook Conversion"

From Blue-IT.org Wiki

(Contrast enhancement)
Line 1: Line 1:
= A word before we start =
+
== From HTML to epub ==
When I bought an eBook-Reader last chrismas (2011), an PocketBook Pro 912 with an open sourced linux operating system and firmware and an e-Ink Display. By the way:
+
One of the most sophisticated formats when it comes to eBooks is epub.
* this article is based on the prerequisites of a reader with an e-Ink display
 
* this article uses Linux (ubuntu 11.10) and a lot of bash shell scripting, so be prepared
 
You might be able to adjust some things for other devices, but this is mainly intended for working with devices with e-Ink displays, and mainly the ones of a 10'' size!
 
  
I didn't know, how much work it will cost myself to dive deep into the theme. The e-Ink-technology was what kept my interest. What I didn't know: this technology also needs a special way of reading your documents. It is not possible to simply throw a PDF to your device being sacrificed.
+
There are a lot of tools out there converting from e.g. '''HTML'''. But almost none is capable of keeping the "table of content". And this is - when it comes to ereading - one of the most important part.
  
Years ago, when a began dealing with LaTeX - my preferred text processing method - I already had been aware of the problem, that the information about the semantics of the given text, or the "metainformations" is low, if not exists at all any more. In technical term one would say, the ''entropy'' of a PDF-document is - compared to HTML, SGML, or XML, or even LaTeX - very low. This is, why PDF is only a media for printing but not a document format used by computer programs, thereby viewed on a screen or an e-reader.
+
My prerequisites for a conversion tool
 +
* should be opensource
 +
* crossplatform (Windows, Linux, Mac, ...)
 +
* commandline batch processing
  
Concrete:
+
So here are the candidates:
* you cannot alter the font size
+
# The free tool [http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621|kindlegen] from amazon.
* a program cannot guess the text flow
+
# The shareware [http://www.jutoh.com|Jutoh], but for demonstration purposes you can test it.
* on a reader with e-ink display, it might be impossible to read the document, because the fonts are to small
 
  
That's why you might want to convert your documents.
+
=== Kindlegen ===
 +
Is crossplattform and does a very good job. I hav nothing to complain. The only problem I have:
 +
* the "table of content" goes away!
  
How do you like to do this?
+
=== Jutoh ===
* unattended
+
This is not free software. But it is a full featured Editor for ebooks.
* batch-like
 
* automatically
 
* repeatedly
 
* with open source software ;)
 
  
Let's have a look at the general workflow.
+
== From HTML to epub ==
 
+
One of the most sophisticated formats when it comes to eBooks is epub.
= The process =
 
== Reading plain PDF ==
 
There are mainly 3 points of interest, when it comes to read an PDF on an e-Ink device:
 
# Keep all metainformation - like the table of content - after conversion
 
# Cut as much of unnecessary space (title, white borders, ...)
 
# Batch processing
 
 
 
=== Cropping the pages is the most sophisticated solution ===
 
 
 
I had different approaches to cut the pages my PDFs, from the commercial acrobat writer, over imagemagick and other "crop"-tools. After all, and keeping the 3 points from above in mind the only tool I can recommend is [http://sourceforge.net/projects/briss|briss].
 
 
 
Download and extract it somewhere. Written in java, and so that you can use it as any other program, you should write a little wrapper an put in in your path (e.g. ~/bin, /usr/local/bin, /usr/bin):
 
#!/bin/bash
 
version="0.0.13"
 
cd "$(pwd)"
 
java -jar /local/share/pdfr/briss-${version}/briss-${version}.jar "${@}"
 
 
 
This way prepared, I wrote a batch conversion program. The main problem when writing a conversion script is, that many ebook titles contain chars like "[" or "&". This is something the bash does not like at all! Speaking shortly: I know this script has a lot of duplicate code in it. But believe me when I say: I tried more than once to change this.
 
 
 
The script (assumed you name it crop_with_briss.sh) mainly does the following:
 
* General: all found PDF's are cropped and prefixed with "_cropped.pdf" (this is the default way ''briss'' works in batch mode "-s")
 
* crop_with_briss.s myPDF.pdf: The given PDF will be automatically cropped to myPDF_cropped.pdf
 
* crop_with_briss.s -l : Scan for all PDF-files in the local directory. Files which are formerly cropped (a file with the name *_cropped.pdf" exists, will not be cropped again!
 
* crop_with_briss.s -lf : Scan for all PDF-files in the local directory. All (!!!) PDF's are cropped again.
 
* crop_with_briss.s -r : Same as -l, but the script recurses into all (!) subdirectories.
 
* crop_with_briss.s -rf : Same as -lf, but the script recurses into all (!) subdirectories.
 
 
 
So, here we go:
 
 
 
#!/bin/bash
 
which briss > /dev/null || echo "Briss must be installed to run this script."
 
which briss > /dev/null || exit 0
 
 
[ "${1}" == "" ] && echo "Please specify -l (local path only) or -r (recursive) as parameter." && exit 1
 
MODE=""
 
FORCE="false"
 
[ "${1}" == "-r" ]  && MODE="recursive"
 
[ "${1}" == "-rf" ] && MODE="recursive"
 
[ "${1}" == "-rf" ] && FORCE="true"
 
[ "${1}" == "-l" ]  && MODE="local"
 
[ "${1}" == "-lf" ] && MODE="local"
 
[ "${1}" == "-lf" ] && FORCE="true"
 
file ${1} | grep -v grep | grep "PDF document" && MODE="single"
 
 
if [ "${MODE}" == "single" ]
 
then
 
cd $(pwd)
 
briss -s "${1}"
 
exit 1
 
 
fi
 
 
 
if [ "${MODE}" == "recursive" ]
 
then
 
 
if [ "${FORCE}" == "true" ]
 
then
 
# First scan the local dir, then recursive
 
find '.' -name "*.pdf" | grep -v "cropped" | awk '{print $0}' | sed -e 's/^\.\///g' | sed -e 's/\.pdf$//g' | sed -e 's/(/\\(/g' | sed -e 's/)/\\)/g' | sed -e "s/'/\\\'/g" | sed -e 's/\[/\\[/g' | sed -e 's/\]/\\]/g' | sed -e 's/\&/\\&/g' | sed -e 's/\ /\\ /g' | awk '{system("briss -s " $0 ".pdf");}'
 
exit 1
 
 
else
 
find '.' -name "*.pdf" | grep -v "cropped" | awk '{print $0}' | sed -e 's/^\.\///g' | sed -e 's/\.pdf$//g' | sed -e 's/(/\\(/g' | sed -e 's/)/\\)/g' | sed -e "s/'/\\\'/g" | sed -e 's/\[/\\[/g' | sed -e 's/\]/\\]/g' | sed -e 's/\&/\\&/g' | sed -e 's/\ /\\ /g' | awk '{system("\[ -f " $0 "_cropped.pdf \] \|\| briss -s " $0 ".pdf");}'
 
exit 1
 
 
fi
 
 
fi
 
 
if [ "${MODE}" == "local" ]
 
then
 
 
if [ "${FORCE}" == "true" ]
 
then
 
ls -b *.pdf | grep -v "cropped" | awk '{print $0}' | sed -e 's/\.pdf$//g' | sed -e 's/(/\\(/g' | sed -e 's/)/\\)/g' | sed -e "s/'/\\\'/g" | sed -e 's/\[/\\[/g' | sed -e 's/\]/\\]/g' | sed -e 's/\&/\\&/g' | sed -e 's/\ /\\ /g' | awk '{system("briss -s " $0 ".pdf");}'
 
exit 1
 
 
else
 
ls -b *.pdf | grep -v "cropped" | awk '{print $0}' | sed -e 's/\.pdf$//g' | sed -e 's/(/\\(/g' | sed -e 's/)/\\)/g' | sed -e "s/'/\\\'/g" | sed -e 's/\[/\\[/g' | sed -e 's/\]/\\]/g' | sed -e 's/\&/\\&/g' | sed -e 's/\ /\\ /g' | awk '{system("\[ -f " $0 "_cropped.pdf \] \|\| briss -s " $0 ".pdf");}'
 
exit 1
 
 
fi
 
 
fi
 
 
 
=== Contrast enhancement ===
 
Briss is doing a very good job cropping the documents. But e-Ink devices normally only can show 16 gray scale "colors". So it would be handy to convert a document to grayscale and thereby enhance the contrast ;)
 
 
 
The solutions:
 
# convert the PDF with ''imagemagick'' ("convert" or "mogrify" are the corresponding commands)
 
# use a reader software with the capability of enhance the gamma or contrast of the content
 
 
 
The problems:
 
# ''imagemagick''
 
## does not preserve the metacontent of our PDF (table of content)
 
## the document is not searchable any more, or a TTS is not working any more
 
## the document becomes significantly bigger
 
## the result on my Pocketbook 912 is not what I expected, when it comes to quality and contrast enhancement. The PDf's seam to be not that crispy, clear.
 
# reader software
 
## I did not find any software, that satisfied me
 
## There is mainly one that can handle PDFs: a fork of ''fbreader'', it's called ''fbreader-bw'' and you will find it, when you search the http://www.mobileread.com forum.
 
## Just for note: ''coolreader'' can not (!) display PDF files.
 
  
Do it with Imagemagick:
+
There are a lot of tools out there converting from e.g. '''HTML'''. But almost none is capable of keeping the "table of content". And this is - when it comes to ereading - one of the most important part.
convert -density 600 -contrast -gamma 0.1 -colorspace GRAY input.pdf output.pdf
 
  
== From LaTeX to PDF or HTML ==
+
My prerequisites for a conversion tool
My LaTeX-documents can easily be altered to produce appropriate output for an e-Ink device.
+
* should be opensource
 +
* crossplatform (Windows, Linux, Mac, ...)
 +
* commandline batch processing
  
But generating a PDF will be only suitable for a certain e-Ink device (when it comes to the font size).
+
So here are the candidates:
By the way: there is no direct possibility to create a ebub or mobi document from LaTex (as far as I know at the moment).
+
# The free tool [http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621|kindlegen] from amazon.
 +
# The shareware [http://www.jutoh.com|Jutoh], but for demonstration purposes you can test it.  
  
So my preferred output format is HTML! There is nothing more to say about.
+
=== Kindlegen ===
* The table of content is preserved
+
Is crossplattform and does a very good job. I hav nothing to complain. It produces files in the "*.mobi"-format.
* No problems with font-sizing
 
* Easy conversion to other ebook formats (epub, mobi, ...)
 
  
I am using [http://www.nongnu.org/elyxer|elyxer] to convert my LaTeX files. I have to admit, '''that I am working - exclusively (!) - with [http://www.lyx.org|LyX]'''. So everything here (elyxer, scripts) is only suitable, if you are working with lyx. You can alter the scripts for usage with plain latex, there should not be any problem.
+
The only problem I have:
 
+
* the "table of content" goes away!
The next script converts all lyx-files, either locally ( -l ) or recursively ( -r ):
 
 
 
#!/bin/bash
 
[ ! -f /usr/bin/elyxer.py ] && echo "Elyxer must be installed to run this script." && exit 0
 
 
[ "${1}" == "" ] && echo "Please specify -l (local path only) or -r (recursive) as parameter." && exit 1
 
 
if [ "${1}" == "-r" ]
 
then
 
for myfile in "$(find '.' -name "*.lyx" | awk '{print $0}' | sed -e 's/ /\\ /g' | sed -e 's/.lyx//g')"
 
do
 
        echo "${myfile}" | awk '{system("elyxer.py " $0 ".lyx > " $0 ".html");}'
 
done
 
fi
 
 
if [ "${1}" == "-l" ]
 
then
 
for myfile in "$(ls *.lyx | awk '{print $0}' | sed -e 's/ /\\ /g' | sed -e 's/.lyx//g')"
 
do
 
        echo "${myfile}" | awk '{system("elyxer.py " $0 ".lyx > " $0 ".html");}'
 
done
 
fi
 
 
 
 
 
The same script for producing PDF-files using '''pdflatex''' ( lyx --export pdf2 ). You can easily adopt this:
 
#!/bin/bash
 
 
[ "${1}" == "" ] && echo "Please specify -l (local path only) or -r (recursive) as parameter." && exit 1
 
 
if [ "${1}" == "-r" ]
 
then
 
for myfile in "$(find '.' -name "*.lyx" | awk '{print $0}' | sed -e 's/ /\\ /g')"
 
do
 
        echo "${myfile}" | awk '{system("lyx --export pdf2 -f " $0);}'
 
done
 
fi
 
 
if [ "${1}" == "-l" ]
 
then
 
for myfile in "$(ls *.lyx | awk '{print $0}' | sed -e 's/ /\\ /g')"
 
do
 
        echo "${myfile}" | awk '{system("lyx --export pdf2 -f " $0);}'
 
done
 
fi
 
 
 
== From HTML to epub ==
 
One of the most sophisticated formats when it comes to eBooks is epub.
 
  
There are a lot of tools out there converting from e.g. HTML. But almost none is capable of keeping the "table of content". And this is - when it comes to ereading - one of the most important part.
+
=== Jutoh ===
 +
This is not free software. But it is a full featured Editor for ebook generation.
  
= PocketBook Pro =
+
Advantages:
== Dictionary ==
+
* Support of almost any ebook format
 +
* Fully WHYSIWYG editor
 +
But I don't need the latter, because I want to edit my documents with the processor of my choice.
  
== Fbreader ==
+
Disadvantages:
 +
* Project based. So you first have to import a document via GUI in a project file.
 +
* This is a GUI-tool, no commandline tool. But it has a batch mode.
 +
* Generation of the "table of content" will not allays succeed with my LaTeX documents.

Revision as of 15:14, 4 January 2012

From HTML to epub

One of the most sophisticated formats when it comes to eBooks is epub.

There are a lot of tools out there converting from e.g. HTML. But almost none is capable of keeping the "table of content". And this is - when it comes to ereading - one of the most important part.

My prerequisites for a conversion tool

  • should be opensource
  • crossplatform (Windows, Linux, Mac, ...)
  • commandline batch processing

So here are the candidates:

  1. The free tool [1] from amazon.
  2. The shareware [2], but for demonstration purposes you can test it.

Kindlegen

Is crossplattform and does a very good job. I hav nothing to complain. The only problem I have:

  • the "table of content" goes away!

Jutoh

This is not free software. But it is a full featured Editor for ebooks.

From HTML to epub

One of the most sophisticated formats when it comes to eBooks is epub.

There are a lot of tools out there converting from e.g. HTML. But almost none is capable of keeping the "table of content". And this is - when it comes to ereading - one of the most important part.

My prerequisites for a conversion tool

  • should be opensource
  • crossplatform (Windows, Linux, Mac, ...)
  • commandline batch processing

So here are the candidates:

  1. The free tool [3] from amazon.
  2. The shareware [4], but for demonstration purposes you can test it.

Kindlegen

Is crossplattform and does a very good job. I hav nothing to complain. It produces files in the "*.mobi"-format.

The only problem I have:

  • the "table of content" goes away!

Jutoh

This is not free software. But it is a full featured Editor for ebook generation.

Advantages:

  • Support of almost any ebook format
  • Fully WHYSIWYG editor

But I don't need the latter, because I want to edit my documents with the processor of my choice.

Disadvantages:

  • Project based. So you first have to import a document via GUI in a project file.
  • This is a GUI-tool, no commandline tool. But it has a batch mode.
  • Generation of the "table of content" will not allays succeed with my LaTeX documents.