Deskewing Scanned Documents

Check out updates and new versions of Deskew tool.

Some time ago I wrote a simple command line tool for deskewing scanned documents called Deskew. Technically, it’s a rotation since angles are preserved and skew transformation doesn’t do that. However, deskewing is commonly used term in this context.

Deskewing some smart paper

My approach is fairly common for this problem – rotation angle is first determined using Hough transform and then the image is rotated accordingly. Classical Hough transform is able identify lines in the image and it was later extended to allow detection of any arbitrary shapes.

Lines of text can be thought of as horizontal lines in the image. In a skewed scanned document all the lines will be rotated by some small angle. We can start with the equation of the line y = k · x + q. Since we’re interested in the angle, we can rewrite it as y = (sin(α) / cos(α)) · x + q. Finally, we can rearrange it as y · cos(α) − x · sin(α) = d. Now every point [x, y] in the image can have infinite number of lines going through it, where each is defined by two parameters: angle α and distance from the origin d.

We want to consider lines only for certain points of input image. Ideally, that would be the base lines on which the “text is sitting”. Simple way of determining these points is to check for black pixels which have white pixels just below them. Now for each of the classified points, we determine parameters α and d for all the lines that go through them. To get some finite number of lines, we calculate d for angles α from a certain range (I use angle step of 0.1 degrees). We want to find a line that intersects as many classified points as possible – an accumulator is used to store “votes” for each calculated line. For each point that is believed to be on the text base line, we add one vote for each line that intersects it. At the end, we find the top lines that have the most votes. Ideally, these are the base lines of all lines of text in the document. Finally, we get the rotation angle by averaging angle α of the top lines and rotate the whole image accordingly.

Important part is that one: “check for black pixels which have white pixels just below”. What’s black and white is determined by comparing value of the current pixel against some given threshold. For images where background is plain white and the text is black it’s easy just to use 0.5 as the threshold. But when the background/foreground distinction is not so sharp calculating the threshold adaptively based on the current image can be very useful. Deskew supports both adaptive threshold calculation as well as specifying constant threshold as command line parameter.

Deskewing some math exercise

Implementation is written in Object Pascal and uses Imaging library for reading and writing various image file formats. There are precompiled binaries for a few platforms, others be built from sources using Free Pascal compiler. Archive also contains few test images.

  Deskew 1.20
» 4.1 MiB - 5,962 hits - January 5, 2011 (last update November 1, 2016)
Command line tool for deskewing scanned documents. Binaries for several platforms, test images, and Object Pascal source code included.

17 thoughts on “Deskewing Scanned Documents

  1. This is great! Would be nice to have the option of selecting the white color to fill in the “voids” created by the rotation.

    • Not a problem, I’ll update the program when I have some time. Some autocrop feature might be useful too, to remove the “voids” from rotated images directly.

  2. Could you also describe variable “-a angle: Maximal skew angle in degrees (default: 10)”?
    Is this the maximum skew by which to rotate the image to its final de-skewed position?

    • Basically, yes. Since it’s a quite demanding operation, skew detection is limited only to certain range (-maxangle to +maxangle). Documents are usually skewed only a few degrees so it’s reasonable to limit the max angle to get considerable speed up. If you have some highly skewed scans, you can use the “-a” parameter to widen the detection range.

  3. Great. I was looking for a toll to correct scanned text pages automatically. Even if I align the pages in the scanner very carefully, there remain small mistakes. I assume, the print of the scanned pages already contains these errors. The misalignment is within the range of +/- 0,3 degree.
    Until now I made the corrections manually using Photoshop. The results are very good, but the operation is tremendous time consuming.
    I found your program and made some tests. In most cases the automatic rotation works perfect. This is even true in the case of bad documents containing recangles as text boxes with angles different from 90 degree. In this case the horizontal lines become rotated in the correct direction which is the best option in such cases (of course true deskewing would be ideal 😉 ).
    May I have a wish? There are a lot of input/output option. Unfortunately not for my purpose. Due to the very high scan resolution I use 1 bit TIF with 1200 dpi. The tricks is first to convert to PNG greyscale, do the autorotation and reconvert to 1 bit TIF. Otherwise the documents become too large. 1 bit TIF in and 1 bit TIF out would be very nice.

    Thanks for this nice tool.

    • I’m glad you like it.
      Support fot 1bit TIFFs can be added as I implemented it in Imaging library that is used by Deskew some months ago. What’s the pixel resolution of your scans? It would be easiest to convert to 8b greyscale internally in the program but if your scans are hundreds of megabytes large it would be better to keep it 1bit for deskew operation to preserve memory.
      Also which OS you’re using? TIFF is supported by Deskew only in Windows. There’s also 1bit support possible for PNG (which works in every
      supported OS).

    • No easy way unfortunately (using the current image rotation method). For images with alpha channel blending over empty image filled with “void” color could be an option (but taking time and memory).

  4. This program is very helpful! I’m doing a university project to detect features of a scanned text (e.g. headers and footers, page numbers, chapter breaks, paragraphs, etc). This program means that I can spend much less time on sanitizing the input images. You rock!

  5. Very nice tool. But one problem: the linux version does not support case sensitive filenames for the output-file. They are always lowercase.

  6. After beating my head against the wall trying to use other publicly available code for a Hough Transform, I ran across yours. It’s been 30 years since I wrote any Pascal code and it was not Object Pascal back then, but I could follow your work without difficulty which says a lot about your work. I code in C so that I can use valgrind and GDB to debug my optimizations. I discovered that there are edge conditions when the sine or cosine approach zero that cause access violations. I’d suggest making the accumulator array at least one full row larger than the diagonal of the image.
    rays = ceil(search_degrees / degrees_per_step) + 1;
    start_angle = search_center - (search_degrees / 2.0);
    hough_rows = (int)(sqrt ((img_width * img_width) + (img_length * img_length)));
    buffsize = hough_rows * (rays + 1);

    Then when you compute the index into the accumulator, offset the index by at least half the diagonal so you don’t get an out-of-bounds array access violation.
    hough_offset = (ray * hough_rows) + (int)ray_length + (int)(hough_rows / 2);
    if ((hough_offset = buffsize))
    TIFFError ("", "Offset out of range: Row %d, Col %d, Ray %d, Angle %8.4f, Offset: %8d", y, x, ray, start_angle + (ray * degrees_per_step), hough_offset);

    Many thanks for sharing your work.

    • There was a typo in my previous post: the bounds checking should look like this:

      if ((hough_offset = buffsize))
      TIFFError ("", "Offset out of range: Row %d, Col %d, Ray %d, Angle %8.4f, Offset: %8d", y, x, ray, start_angle + (ray * degrees_per_step), hough_offset);
      (*(hough + hough_offset))++;

Leave a Reply

Your email address will not be published. Required fields are marked *