This manual is for GNU Ocrad (version 0.29, 18 January 2024).
Copyright © 2003-2024 Antonio Diaz Diaz.
This manual is free documentation: you have unlimited permission to copy, distribute, and modify it.
GNU Ocrad is an OCR (Optical Character Recognition) program and library based on a feature extraction method. It reads images in png or pnm formats and produces text in byte (8-bit) or UTF-8 formats. The formats pbm (bitmap), pgm (greyscale), and ppm (color) are collectively known as pnm.
Ocrad includes a layout analyser able to separate the columns and blocks of text normally found on printed pages.
For best results the characters should be at least 20 pixels high. If they are smaller, try the option --scale. Scanning the image at 300 dpi usually produces a character size good enough for ocrad.
The character set internally used by ocrad is ISO 10646, also known as UCS (Universal Character Set), which can represent over two thousand million characters (2^31).
As it is unpractical to try to recognize one among so many different characters, you can tell ocrad what character sets to recognize. You do this with the option --charset.
If the input page contains characters from only one character set, say 'ISO-8859-15', you can use the default 'byte' output format. But in a page with 'ISO-8859-9' and 'ISO-8859-15' characters, you can't tell if a code of 0xFD represents a 'latin small letter i dotless' or a 'latin small letter y with acute'. You should use --format=utf8 instead. Of course, you may request UTF-8 output in any case.
NOTE: 10^9 is a thousand millions, a billion is a million millions (million^2), a trillion is a million million millions (million^3), and so on. Please, don't "embrace and extend" the meaning of prefixes, making communication among all people difficult. Thanks.
The format for running ocrad is:
ocrad [options] [files]
A hyphen '-' used as a file argument means standard input. It can be mixed with other files and is read just once, the first time it appears in the command line. Ocrad can read concatenated files from standard input. Remember to prepend ./ to any file name beginning with a hyphen, or use '--'.
ocrad supports the following options:
-h
--help
-V
--version
-a
--append
-c
name--charset=
name-e
name--filter=
name-E
file--user-filter=
file-f
--force
-F
name--format=
name-i
--invert
-l
--layout
-o
file--output=
file-q
--quiet
-s
value--scale=
value-t
name--transform=
name-T
value--threshold=
value-u
left,
top,
width,
height--cut=
left,
top,
width,
height-v
--verbose
-x
file--export=
fileExit status: 0 for a normal exit, 1 for environmental problems (file not found, invalid command-line options, I/O errors, etc), 2 to indicate a corrupt or invalid input file, 3 for an internal consistency error (e.g., bug) which caused ocrad to panic.
Filters replace some characters in the text output with different characters and remove some other characters from the output. For example, when recognizing a text that is known to contain just numbers, any character recognized as a 'Z' will probably be a '2'.
Filters don't enable the recognition of characters, just filter them from the output. Use --charset to enable the recognition of a character set different from the default ISO-8859-15.
Ocrad provides both built-in filters and user-defined filters.
The format of a user-defined filter file (see --user-filter) is very simple. Each line contains either a character conversion or a word that specifies the default behaviour for unlisted characters.
A character conversion is a comma-separated list of quoted characters ('c'), character sets ([0-9A-Z]), character codes (U0063), or character ranges (U0000 - UFFFF), and an optional conversion (an equal sign (=) followed by a quoted character or a character code). The characters in the list are converted to the character in the conversion. If no conversion is specified, the character is left unmodified (converted to itself).
The default behaviour is to discard unlisted characters, i.e. those characters not appearing in the file, either by themselves or included in a set or range. If a line containing just the word 'leave' is found in the file, unlisted characters are left unmodified. If the word is 'mark', unlisted characters are marked as unrecognized.
The destination character of a conversion is considered as listed by default. Every character may be listed more than once, even as part of different conversions. The last conversion affecting a given character is the one that is performed.
Character sets and quoted characters may contain escape sequences.
The character '#' at begin of line or after whitespace starts a comment that extends to the end of the line.
Ranges of characters may be specified in character sets by writing the starting and ending characters with a '-' between them. Thus, '[A-Z]' matches any ASCII uppercase letter. '-' may be specified by placing it first or last. ']' may be specified by placing it first. If the first character after the left bracket is '^', it indicates a "complemented set", which matches any character except the ones between the brackets.
Literals (quoted characters and character sets) are decoded as ISO-8859-15. Character codes are decoded as UCS2. Thus, a 'latin capital letter y with diaeresis' is specified in a set as '[\xBE]', but its code is 'U0178'.
Spaces and control characters are unaffected by filters, except that leadind, trailing, and duplicate spaces produced by the removal of other characters will be themselves removed.
Here is an example user-defined filter file equivalent to the built-in filter 'numbers':
leave # remove this line to get 'numbers_only'
'D', 'O', 'Q', 'o' = '0'
'I', 'L', 'l', '|' = '1'
'Z', 'z' = '2'
'3'
'A', 'q' = '4'
'S', 's' = '5'
'G', 'b', U00F3 = '6' # U00F3 = latin small letter o with acute
'J', 'T' = '7'
'&', 'B' = '8'
'g' = '9'
Ocrad provides the following built-in filters (see --filter):
--filter=letters
--filter=letters_only
--filter=numbers
--filter=numbers_only
--filter=same_height
--filter=text_block
--filter=upper_num
--filter=upper_num_mark
--filter=upper_num_only
This constant is defined in 'ocradlib.h' and works as a version test macro. The application should check at compile time that OCRAD_API_VERSION is equal to the version required by the application:
#if !defined OCRAD_API_VERSION || OCRAD_API_VERSION != 28 #error "ocradlib 0.28 needed." #endifBefore version 0.28, ocradlib didn't define OCRAD_API_VERSION.
OCRAD_API_VERSION is defined as (major * 1000 + minor).
NOTE: Version test macros are the library's way of announcing functionality to the application. They should not be confused with feature test macros, which allow the application to announce to the library its desire to have certain symbols and prototypes exposed.
If OCRAD_API_VERSION >= 28, this function is declared in 'ocradlib.h' (else it doesn't exist). It returns the OCRAD_API_VERSION of the library object code being used. The application should check at run time that the value returned by
OCRAD_api_version
is equal to the version required by the application.#if defined OCRAD_API_VERSION && OCRAD_API_VERSION >= 28 if( OCRAD_api_version() != 28 ) show_error( "ocradlib 0.28 needed." ); #endif
This string constant is defined in the header file 'ocradlib.h' and represents the version of the library being used at compile time.
This function returns a string representing the version of the library being used at run time.
These are the OCRAD library functions. In case of error, all of them return -1 or a null pointer, except 'OCRAD_open' whose return value must be checked by calling 'OCRAD_get_errno' before using it.
Initialize the internal library state and return a pointer that can only be used as the ocrdes argument for the other OCRAD functions, or a null pointer if the descriptor could not be allocated.
The pointer returned must be checked by calling 'OCRAD_get_errno' before using it. If 'OCRAD_get_errno' does not return 'OCRAD_ok', the pointer returned must not be used and should be freed with 'OCRAD_close' to avoid memory leaks.
Free all dynamically allocated data structures for this descriptor. After a call to 'OCRAD_close', ocrdes can no more be used as an argument to any OCRAD function.
Return the current error code for ocrdes. See Library error codes.
Load image into the internal buffer. If invert is true, image levels are inverted (white on black). Loading a new image deletes any previous text results.
Load a image from the file filename into the internal buffer. If invert is true, image levels are inverted (white on black). Loading a new image deletes any previous text results.
Set the output format to 'byte' (if utf8=false) or to 'utf8' (if utf8=true). By default ocrad produces 'byte' (8 bit) output.
Set the binarization threshold for greymap and RGB images. threshold values between 0 and 255 set a fixed threshold. A value of -1 sets an automatic threshold. Pixel values greater than the resulting threshold are converted to white. The default threshold value if this function is not called is 127.
Scale up the image in the internal buffer by value. If value is negative, the image is scaled down by -value.
Recognize the image loaded in the internal buffer and produce text results which can be later retrieved with the functions 'OCRAD_result_*'. The same image can be recognized as many times as desired, for example setting a new threshold each time for 3D greymap recognition. Every time this function is called, the text results produced replace any previous ones. If layout is true, page layout analysis is enabled, probably producing more than one text block.
Return the number of text blocks found in the image, or 0 if no text was found. The value returned is usually 1, but can be larger if layout analysis was requested.
Return the number of text lines contained in the text block given.
Return the total number of text characters contained in the image recognized.
Return the number of text characters contained in the text block given.
Return the number of text characters contained in the text line given.
Return the line of text specified by blocknum and linenum.
Return the byte result for the first character in the image. Return 0 if the image has no characters or if the first character could not be recognized. This function is a convenient short cut to the result for images containing a single character.
Most library functions return -1 or a null pointer to indicate that they have failed. But this return value only tells you that an error has occurred. To find out what kind of error it was, you need to check the error code by calling 'OCRAD_get_errno'.
Library functions don't change the value returned by 'OCRAD_get_errno' when they succeed; thus, the value returned by 'OCRAD_get_errno' after a successful call is not necessarily OCRAD_ok, and you should not use 'OCRAD_get_errno' to determine whether a call failed. If the call failed, then you can examine 'OCRAD_get_errno'.
The error codes are defined in the header file 'ocradlib.h'.
The value of this constant is 0 and is used to indicate that there is no error.
At least one of the arguments passed to the library function was invalid.
No memory available. The system cannot allocate more virtual memory because its capacity is full.
A library function was called in the wrong order. For example 'OCRAD_result_line' was called before 'OCRAD_recognize'.
A bug was detected in the library. Please, report it. See Problems.
There are a lot of image formats, but ocrad is able to decode only four of them; png, pbm, pgm, and ppm. In this chapter you will find command examples and advice about how to convert image files to a format that ocrad can manage.
Ocrad is mainly a research project. Many of the algorithms ocrad uses are ad hoc, and will change in successive releases as I myself gain understanding about OCR issues.
The overall working of ocrad may be described as follows:
1) Read the image.
2) Optionally, perform some transformations (cut, rotate, scale, etc).
3) Optionally, perform layout detection.
4) Remove frames and pictures.
5) Detect characters and group them in lines.
6) Recognize characters (very ad hoc; one algorithm per character).
7) Correct some ambiguities (transform l.OOO into 1.000, etc).
8) Optionally, apply one or more filters to the text.
9) Output text result.
Ocrad recognizes characters by its shape, and the reason it is so fast is that it does not compare the shape of every character against some sort of database of shapes and then chooses the best match. Instead of this, ocrad only compares the shape differences that are relevant to choose between two character categories, mostly like a binary search.
As there is no such thing as a free lunch, this approach has some drawbacks. It makes ocrad very sensitive to character defects, and makes difficult to modify ocrad to recognize new characters.
Calling ocrad with option -x produces an OCR results file (ORF), that is, a parsable file containing the OCR results. The ORF format is as follows:
For each text block in the source image, the following data follows:
For each line in every text block, the following data follows:
Running './ocrad -x test.orf testsuite/test.pbm' in the source directory will give you an example ORF file.
There are probably bugs in ocrad. There are certainly errors and omissions in this manual. If you report them, they will get fixed. If you don't, no one will ever know about them and they will remain unfixed for all eternity, if not longer.
If you find a bug in GNU Ocrad, please send electronic mail to [email protected]. Include the version number, which you can find by running 'ocrad --version'.