edin1’s blog

Just another WordPress.com weblog

Archive for the ‘Python’ Category

Scanning of the book “The Grammar of the Arabic Language”

without comments

Currently, I have with me (borrowed) both parts of the book “Gramatika arapskog jezika” (The Grammar of the Arabic Language), written in Bosnian, in the years 1936/1937. The books are pretty old, but are used, to this day, in newer editions, for studying the Arabic language in Bosnian/South-Slavic curricula.

As you can see from the links above, the books were scanned and posted on the web archive, ready for download. This post acts as a reminder of the tools/actions I used during scanning.

I used MFP Samsung SCX-4200 for the scanning. I ussualy use Ubuntu as my OS, but SCX4200 has problematic proprietary drivers for it, so I did the scanning on WXPH, with the SmarThru app, supplied by Samsung. I have set it to scan in 24 bits, 300dpi, I have saved the images as JPEGs, at maximum available quality.

After the scanning I have used the following (fixpages.py) Python script (uses PIL) for “cutting” the pages:

import os

from PIL import Image as Im
import datetime
now = datetime.datetime.now

SRC_DIR = './raw'
DEST_DIR = './single'

if not os.path.exists(DEST_DIR):
    os.mkdir(DEST_DIR)

d_left = 50
d_right = 200
d_up = 20
d_down = 210

t1 = now()

for fname in sorted(os.listdir(SRC_DIR)):
    im = Im.open(os.path.join(SRC_DIR, fname))
    im = im.crop((d_left,d_up,im.size[0] - d_right,im.size[1] - d_down))

    bare, ext = fname.rsplit('.',1)
    pagenum = int(bare)
    if pagenum % 2:
        # odd page
        im = im.rotate(180)
    im.save(os.path.join(DEST_DIR, fname), quality=100)

t2 = now()
print t2 - t1

Of course, the settings will be different for a picture of different dimensions. After fixing the pages I have used the following script for generating the color DjVu file:

import os
import sys
import subprocess
import datetime
now = datetime.datetime.now

DJVUFILES = []
PROCESSEDFILES = []
# A flag to indicate that no conversion from .jpg (etc.) to .djvu files
# should not be done
SKIPCONVERSION = None
# Remove old single .djvu files
REMOVEOLD = None

def removeold(indir):
    """Remove old .djvu files recursively."""
    for basename in sorted(os.listdir(indir)):
        fname = os.path.join(indir, basename)
        if os.path.isfile(fname):
            ext = os.path.splitext(fname)[1].lower()
            if ext == ".djvu":
                os.remove(fname)

def dir_iter(indir, outfile):
    """Convert all JPEG, PPM and PBM images in a directory to
    (separate) DjVu files.
    """
    for basename in sorted(os.listdir(indir)):
        fname_src = os.path.join(indir, basename)
        fname_dst = os.path.join(indir, basename + ".djvu")
        if os.path.isfile(fname_src):
            convert_one_img(fname_src, fname_dst)
#     firstfile = DJVUFILES.pop(0)
#     shutil.copy(firstfile, outfile)
#     for djvufile in DJVUFILES:
#         pass
    DJVUFILES.insert(0, outfile)
    DJVUFILES.insert(0, '-c')
    DJVUFILES.insert(0, 'djvm')
    t12 = now()
    print t12-t1
    subprocess.call(DJVUFILES)

def convert_one_img(fname_src, fname_dst):
    ext = os.path.splitext(fname_src)[1].lower()
    if ext == ".djvu":
        # This covers the case of existing .djvu files.
        # They should be added to the document
        basename_src = os.path.splitext(fname_src)[0]
        if basename_src in PROCESSEDFILES:
            return
        elif SKIPCONVERSION:
            DJVUFILES.append(fname_src)
        return

    if SKIPCONVERSION:
        return

    if os.path.exists(fname_dst):
        DJVUFILES.append(fname_dst)
        PROCESSEDFILES.append(fname_src)
        return

    args = [fname_src, fname_dst]
    if ext in (".jpg", ".jpeg", ".ppm"):
        args = "c44 -slice 74,89".split() + args
    elif ext == ".pbm":
        args.insert(0,"cjb2")
    else:
        return
    print args
    subprocess.call(args)
    DJVUFILES.append(fname_dst)
    PROCESSEDFILES.append(fname_src)

if __name__ == '__main__':
    import optparse
    parser = optparse.OptionParser(
        usage='%prog [-c] <indir> <outfile>'
        )
    parser.add_option("-s", "--skipconversion",
                   action="store_true", dest="skipconversion", default=False,
                  help="Skip conversion of images to (single) .djvu files")
    parser.add_option("-r", "--removeold",
                   action="store_true", dest="removeold", default=False,
                  help="Remove old (single) .djvu files")
    options, args = parser.parse_args()
    SKIPCONVERSION = options.skipconversion
    REMOVEOLD = options.removeold
    try:
        indir = args[0]
        outfile = args[1]
    except IndexError:
        print parser.usage
    t1 = now()
    if REMOVEOLD:
        removeold(indir)
    dir_iter(indir, outfile)
    t2 = now()
print t2 - t1

As can be seen, I have used only Open Source software for creating/manipulating the DjVu files.

For converting to PDF I used djview, that is djview4. This application can even do the bitonal conversion, but I wasn’t satisfied with it, because the covers were completely black.

So, in order to generate a bitonal (black-and-white) DjVu with the desired properties I had to convert the pictures to PBM format, mandated by the cjb2 program. For JPEG/PBM conversion I used the following bash command:

anytopnm jpegs/image.jpg |ppmtopgm |pgmtopbm -threshold -value 0.15>image.pbm

“-threshold” option is used for the good-old thresholding. I found a threshold of 0.15 to be satisfying. There are more black dots, but I don’t mind.

I have tried generating the color DjVu with the following setting:

c44 -slice 74,89,99

but it gives a twice as large file.

This is all that I can remember.

Written by edin1

December 19, 2008 at 9:29 pm