eMOP | eMOP

redirect_to: "http://emop.tamu.edu/outcomes/github" eMOP | eMOP

eMOP is funded by a grant from the Andrew W. Mellon Foundation, and as such all code produced by and for eMOP is available via an Apache licence 2.0.

emop-controller
The code that implements the entire eMOP workflow.
emop-dashboard
The online dashboard that powers the eMOP workflow.
FrankenPlus
A tool created for eMOP that allows users to create training for Tesseract with their own typeface samples.
hOCR deNoising
A tool created for eMOP post-processing that removes noise from Tesseract's hOCR output.
Juxta-cl
A command line version of Juxta that compares OCR output to groundtruth files.
Page Corrector
A tool created for eMOP that uses dictionary files and a google 3-gram DB to correct Tesseract output.
Page Evaluator
A tool created for eMOP that evaluates OCR output to determine how correctable it is.
RETAS
A tool created for eMOP that compares OCR output to groundtruth files.
TesseractTraining
A collection of training created for Tesseract by eMOP using Franken+.
Publishing Imprint DB
Printer, Seller, and location information culled from the imprint lines of the entire eMOP dataset. These XML files (EEBO and ECCO separately) contain only those entries for which we have an ESTC number.
Cobre
A robust image comparison environment, presenting versions of texts in filmstrip view along side each other and collating these images of different texts while allowing users to adjust the collation.
Aletheia Web Layout Editor
A tool for identifying and transcribing paratext on a page image in TypeWright

Copyright 2014 Initiative for Digital Humanities, Media, and Culture at Texas A&M University

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Why eMOP Matters

The Early Modern OCR Project (Lead PI, Dr. Laura Mandell) is an effort, on the one hand, to make access to texts more transparent and, on the other, to preserve a literary cultural heritage. The printing process in the hand-press period (roughly 1475-1800), while systematized to a certain extent, nonetheless produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink (among many other variables). Combining these factors with the poor quality of the images in which many of these books have been preserved (in EEBO and, to a lesser extent, ECCO), creates a problem for Optical Character Recognition (OCR) software that is trying to translate the images of these pages into archiveable, mineable texts. By using innovative applications of OCR technology and crowd-sourced corrections, eMOP will solve this OCR problem.

About eMOP

Press Release
Cyber Infrastructure
Participating Institutions
Team Members
Wiki
Mellon Grant Info
eMOP Conversations

@IDHMC_Nexus

Tweets by @IDHMC_Nexus

Search form

Why eMOP Matters

About eMOP

@IDHMC_Nexus