Archive for December 2008
Scanning of the book “The Grammar of the Arabic Language”
Currently, I have with me (borrowed) both parts of the book “Gramatika arapskog jezika” (The Grammar of the Arabic Language), written in Bosnian, in the years 1936/1937. The books are pretty old, but are used, to this day, in newer editions, for studying the Arabic language in Bosnian/South-Slavic curricula.
As you can see from the links above, the books were scanned and posted on the web archive, ready for download. This post acts as a reminder of the tools/actions I used during scanning.
I used MFP Samsung SCX-4200 for the scanning. I ussualy use Ubuntu as my OS, but SCX4200 has problematic proprietary drivers for it, so I did the scanning on WXPH, with the SmarThru app, supplied by Samsung. I have set it to scan in 24 bits, 300dpi, I have saved the images as JPEGs, at maximum available quality.
After the scanning I have used the following (fixpages.py) Python script (uses PIL) for “cutting” the pages:
import os
from PIL import Image as Im
import datetime
now = datetime.datetime.now
SRC_DIR = './raw'
DEST_DIR = './single'
if not os.path.exists(DEST_DIR):
os.mkdir(DEST_DIR)
d_left = 50
d_right = 200
d_up = 20
d_down = 210
t1 = now()
for fname in sorted(os.listdir(SRC_DIR)):
im = Im.open(os.path.join(SRC_DIR, fname))
im = im.crop((d_left,d_up,im.size[0] - d_right,im.size[1] - d_down))
bare, ext = fname.rsplit('.',1)
pagenum = int(bare)
if pagenum % 2:
# odd page
im = im.rotate(180)
im.save(os.path.join(DEST_DIR, fname), quality=100)
t2 = now()
print t2 - t1
Of course, the settings will be different for a picture of different dimensions. After fixing the pages I have used the following script for generating the color DjVu file:
import os
import sys
import subprocess
import datetime
now = datetime.datetime.now
DJVUFILES = []
PROCESSEDFILES = []
# A flag to indicate that no conversion from .jpg (etc.) to .djvu files
# should not be done
SKIPCONVERSION = None
# Remove old single .djvu files
REMOVEOLD = None
def removeold(indir):
"""Remove old .djvu files recursively."""
for basename in sorted(os.listdir(indir)):
fname = os.path.join(indir, basename)
if os.path.isfile(fname):
ext = os.path.splitext(fname)[1].lower()
if ext == ".djvu":
os.remove(fname)
def dir_iter(indir, outfile):
"""Convert all JPEG, PPM and PBM images in a directory to
(separate) DjVu files.
"""
for basename in sorted(os.listdir(indir)):
fname_src = os.path.join(indir, basename)
fname_dst = os.path.join(indir, basename + ".djvu")
if os.path.isfile(fname_src):
convert_one_img(fname_src, fname_dst)
# firstfile = DJVUFILES.pop(0)
# shutil.copy(firstfile, outfile)
# for djvufile in DJVUFILES:
# pass
DJVUFILES.insert(0, outfile)
DJVUFILES.insert(0, '-c')
DJVUFILES.insert(0, 'djvm')
t12 = now()
print t12-t1
subprocess.call(DJVUFILES)
def convert_one_img(fname_src, fname_dst):
ext = os.path.splitext(fname_src)[1].lower()
if ext == ".djvu":
# This covers the case of existing .djvu files.
# They should be added to the document
basename_src = os.path.splitext(fname_src)[0]
if basename_src in PROCESSEDFILES:
return
elif SKIPCONVERSION:
DJVUFILES.append(fname_src)
return
if SKIPCONVERSION:
return
if os.path.exists(fname_dst):
DJVUFILES.append(fname_dst)
PROCESSEDFILES.append(fname_src)
return
args = [fname_src, fname_dst]
if ext in (".jpg", ".jpeg", ".ppm"):
args = "c44 -slice 74,89".split() + args
elif ext == ".pbm":
args.insert(0,"cjb2")
else:
return
print args
subprocess.call(args)
DJVUFILES.append(fname_dst)
PROCESSEDFILES.append(fname_src)
if __name__ == '__main__':
import optparse
parser = optparse.OptionParser(
usage='%prog [-c] <indir> <outfile>'
)
parser.add_option("-s", "--skipconversion",
action="store_true", dest="skipconversion", default=False,
help="Skip conversion of images to (single) .djvu files")
parser.add_option("-r", "--removeold",
action="store_true", dest="removeold", default=False,
help="Remove old (single) .djvu files")
options, args = parser.parse_args()
SKIPCONVERSION = options.skipconversion
REMOVEOLD = options.removeold
try:
indir = args[0]
outfile = args[1]
except IndexError:
print parser.usage
t1 = now()
if REMOVEOLD:
removeold(indir)
dir_iter(indir, outfile)
t2 = now()
print t2 - t1
As can be seen, I have used only Open Source software for creating/manipulating the DjVu files.
For converting to PDF I used djview, that is djview4. This application can even do the bitonal conversion, but I wasn’t satisfied with it, because the covers were completely black.
So, in order to generate a bitonal (black-and-white) DjVu with the desired properties I had to convert the pictures to PBM format, mandated by the cjb2 program. For JPEG/PBM conversion I used the following bash command:
anytopnm jpegs/image.jpg |ppmtopgm |pgmtopbm -threshold -value 0.15>image.pbm
“-threshold” option is used for the good-old thresholding. I found a threshold of 0.15 to be satisfying. There are more black dots, but I don’t mind.
I have tried generating the color DjVu with the following setting:
c44 -slice 74,89,99
but it gives a twice as large file.
This is all that I can remember.