Franken+
About
Franken+ was developed by Bryan Tarpley at the Initiative for the Digital Humanities, Media, and Culture at Texas A&M University. It is a specialized tool designed to allow users to perform OCR on historic fonts. Franken+ ingests output from PRIMALab's Aletheia (currently, only output from Aletheia version 2.1 is supported, which is still available for download on their site under "previous version"). Aletheia allows the user to open a scanned text document, binarize it, and draw boundaries around each individual character (among many other things). Franken+ currently only works with Aletheia projects wherein the characters have been outlined using polygons (not boxes). Once each character has been bounded appropriately, Aletheia saves the project as an XML file which adheres to PRIMA's PAGE XML format. Provided the binarized image and the resulting XML file generated with Aletheia, Franken+ extracts individual .tif images for each letter blocked-out using Aletheia, giving the user the opportunity to hand-pick the best instances of each letter (thus producing a "font" consisting of only hand-picked images). Using this font, Franken+ can then create synthetic TIF images of text "printed" using this font, with corresponding BOX files, which are then used to train Google's open-source Tesseract OCR engine in order to OCR images of documents printed with the relevant historic font. Using these synthetic images and their corresponding BOX files, Franken+ then automates the Tesseract font training process and allows a user to test this font.
For more information please see http://dh-emopweb.tamu.edu/Franken+/